Recent versions of Spark released the programming abstraction named DataFrame, which can be regarded as a table in a relational database. DataFrame is stored in a distributed manner so that different rows may locate on different machines. On DataFrame you can write sql queries, manipulate columns programatically with API etc.
Spark provides an API to load data from JSON, Parquet, Hive table etc. You can refer to the official Spark SQL programming guide for those formats. Here we show how to load csv files. And we will use the spark-csv module by Databricks.
Start the Spark shell in local mode with the command below to add extra dependencies which are needed to complete this training.
You may want to hide the log messages from spark. You can achieve that by
Now load data into the shell.
The first parameter is path to the data (in HDFS), and second is a class name, the specific adapter required to load a CSV file. Here we specified a directory name instead of a specific file name so that all files in that directory will be read and combined into one file. Next we call toDF to rename the columns in the CSV file with meaningful names. Finally, we add one more column that has double type of value instead of string which we will use ourselves for the rest of this material.
There are two methods to work with the DataFrame, either using SQL or using domain specific language (DSL).
Writing SQL is straightforward assuming you have experiences with relational databases.
Here the patientEvents DataFrame is registered as a table in sql context so that we could run sql commands. Next line is a standard sql command with where, group by and order by statements.
Next, we show how to manipulate data with DSL, the same result of previous SQL command can be achieved by:
For complete DSL functions, see DataFrame class API.
Spark SQL provides a convenient way to save data in a different format just like loading data. For example you can write
to save your transformed data in json format or
to save in csv format.
In many cases the built-in function of SQL like count, max is not enough, you can extend it with your own functions. For example, you want to find the number of different event types with the following UDF.