Baby Names Question 1 – Nested Data [15 Points]
Use Spark SQL’s native JSON support to read the baby names file into a dataframe. Use this dataframe to create a temporary table containing all the nested data columns (“sid”, “id”, “position”, “created_at”, “created_meta”, “updated_at”, “updated_meta”, “meta”, “year”, “first_name”, “county”, “sex”, “count”) so that they can be queried using SQL.
Hint: you can use dbutils.fs.head(baby_names_path) to take a look at the dataset before reading it in.
Suggested Steps:
1. Read in the JSON data
2. Pull all columns in the nested data column to top level, following the schema specified above. There are built-in Spark SQL functions that will accomplish this.
3. Create a temp table from this expanded dataframe using createOrReplaceTempView()
Cmd 24
1
# Please provide your code answer for Question 1 here
Command took 0.05 seconds — by a user at 1/12/2021, 5:12:00 PM on unknown cluster
Cmd 25
Please provide your brief, written description of your code here.
Cmd 26
Baby Names Question 2 – Multiple Languages [10 Points]
Using the temp table you created in the question above, write a SQL query that gives the most popular baby name for each year in the dataset. Then, write the same query using either the Scala or Python dataframe APIs.
Cmd 27
1
# Please provide your code answer for Question 2 here. You will need separate cells for your SQL answer and your Python or Scala answer.
Command took 0.04 seconds — by a user at 1/12/2021, 5:12:00 PM on unknown cluster
Cmd 28
Please provide your brief, written description of your code here.
Cmd 29
Baby Names Question 3 – Performance [10 Points]
Are there any performance considerations when choosing a language API (SQL vs Python vs Scala) in the context of Spark?
Are there any performance considerations when using different data representations (RDD, Dataframe) in Spark? Please explain, and provide references if possible. No code answer is required.
Cmd 30
Please write your written answer here.
Cmd 31
Baby Names Question 4 – Nested XML [15 Points]
Imagine that a new upstream system now automatically adds an XML field to the JSON baby dataset. The added field is called visitors. It contains an XML string with visitor information for a given birth. We have simulated this upstream system by creating another JSON file with the additional field.
Using the JSON dataset at dbfs:/interview-datasets/sa/births/births-with-visitor-data.json, do the following:
1. Read the rows-with-xml.json file into a dataframe and parse the nested XML fields into columns and print the total record count.
2. Find the county with the highest average number of visitors across all births in that county
3. Find the average visitor age for a birth in the county of KINGS
4. Find the most common birth visitor age in the county of KINGS
Cmd 32
1
visitors_path = “/interview-datasets/sa/births/births-with-visitor-data.json”
Command took 0.02 seconds — by a user at 1/26/2021, 11:31:12 AM on unknown cluster
Cmd 33
1
## Hint: the code below will read in the downloaded JSON files. However, the xml column needs to be given structure. Consider using a UDF.
2
#df = spark.read.option(“inferSchema”, True).json(visitors_path)
3
Command took 0.02 seconds — by a user at 1/12/2021, 5:12:00 PM on unknown cluster
Cmd 34
1
## Hint: check for inconsistently capitalized field values. It will make your answer incorrect.
Command took 0.02 seconds — by a user at 1/12/2021, 5:12:00 PM on unknown cluster
Cmd 35
1
## Hint: check for inconsistently capitalized field values. It will make your answer incorrect.
Command took 0.03 seconds — by a user at 1/12/2021, 5:12:00 PM on unknown cluster
Cmd 36
1
## Hint: check for inconsistently capitalized field values. It will make your answer incorrect.
Command took 0.02 seconds — by a user at 1/12/2021, 5:12:00 PM on unknown cluster
Cmd 37
Please provide your written answer for Question 4 here
Cmd 38
ML Question 1 [20 Points]
Choose a data set then demonstrate the training and testing of a Spark MLlib algorithm. You may use any dataset contained in dbfs:/databricks-datasets, or you may use a publically available dataset of your choice. However, if you choose a public data set, please ensure your code downloads the dataset from a publically available source. The grader needs to be able to run all your code. Please cite any sources/guides/blogs used.
Show and discuss an example of using hyperparameter tuning on the algorithm to improve prediction accuracy. Please describe in detail what algorithm you chose and why.
Note: Blogs and tutorials can be used to inform you in writing your answer, but if your code is copied from the blog or tutorial then your answer will be worth at most half-credit.
Cmd 39
1
# Please provide your code answer for Question 1 Here
Command took 0.03 seconds — by a user at 1/12/2021, 5:12:00 PM on unknown cluster
Cmd 40
Please provide your brief, written description of your code here
Shift+Enter to run