Exercise 5: Summarize Data Using HDInsight Spark
Synopsis: In this exercise, attendees will prepare a summary of flight delay data in HDFS using Spark SQL.
- Navigate to the blade for your Spark cluster in the Azure Portal. An easy way to find your Spark cluster is to navigate to the resource group you created during the workshop setup using the Resource Group item on the leftmost navigation area in the portal. Once you select your resource group you will see all of the resources that were created for the workshop (including the Spark cluster).
In the Quick Links section, click Cluster Dashboards.
From the Cluster Dashboards , click Jupyter Notebooks.
A new browser tab should open and you will be prompted for credentials. Log in with the following:
- User name: cortana
- Password: Password.1!!
Copy the below text and paste it into the Jupyter notebook.
-
Once the previous command has completed, click in the empty paragraph below the paragraph in which you entered your Scala script. In this paragraph, you are going to author a SQL query to view the results of the table you just created. In order to switch from running Scala code, to running SQL, your first line in the paragraph must start with %%. As before, click the Play icon at the top of the screen to run the command.
Click on the Table button.
You should see the results appear in a table form similar to the following:
Next, you will create a table that summarizes the flight delays data. Instead of containing one row per flight, this new summary table will contain one row per origin airport at a given hour along with a count of the quantity of anticipated delays.
In a new paragraph below, try running the following query.
-
Since the summary data looks good, the final step is save this summary calculation as a table that we can later query using Power BI in the next exercise.
To accomplish creating the table, enter a new paragraph and add the following Scala code and run it.
Click the Play icon in the top of the screen to execute this code.
To verify your table was successfully created, go to another new paragraph and enter and run the following query.
Click the Play icon in the top of the screen to execute this code.
You should see results similar to the following:
Next Exercise: Exercise 6 - Visualizing in Power BI Desktop