library(sparklyr)
sc <- spark_connect(master = "local")
mtcars_tbl <- copy_to(sc, mtcars, overwrite = TRUE)
Model Data
You can orchestrate machine learning algorithms in a Spark cluster via the machine learning functions within sparklyr
. These functions connect to a set of high-level APIs built on top of DataFrames that help you create and tune machine learning workflows.
Exercise
Here’s an example where we use ml_linear_regression()
to fit a linear regression model. We’ll use the built-in mtcars
dataset, and see if we can predict a car’s fuel consumption (mpg
) based on its weight (wt
), and the number of cylinders the engine contains (cyl
). We’ll assume in each case that the relationship between mpg
and each of our features is linear.
Initialize the environment
We will start by creating a local Spark session and load the mtcars
data frame to it.
Prepare the data
Spark provides data frame operations that makes it easier to prepare data for modeling. In this case, we will use the sdf_partition()
command to divide the mtcars
data into “training” and “test”.
partitions <- mtcars_tbl %>%
select(mpg, wt, cyl) %>%
sdf_random_split(training = 0.5, test = 0.5, seed = 1099)
Note that the newly created partitions
variable does not contain data, it contains a pointer to where the data was split within Spark. That means that no data is downloaded to the R session.
Fit the model
Next, we will fit a linear model to the training data set:
fit <- partitions$training %>%
ml_linear_regression(mpg ~ .)
fit
Formula: mpg ~ .
Coefficients:
(Intercept) wt cyl
38.927395 -4.131014 -0.938832
For linear regression models produced by Spark, we can use summary()
to learn a bit more about the quality of our fit, and the statistical significance of each of our predictors.
summary(fit)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.4891 -1.5262 -0.1481 0.8508 6.3162
Coefficients:
(Intercept) wt cyl
38.927395 -4.131014 -0.938832
R-Squared: 0.8469
Root Mean Squared Error: 2.416
Use the model
We can use ml_predict()
to create a Spark data frame that contains the predictions against the testing data set.
pred <- ml_predict(fit, partitions$test)
head(pred)
# Source: spark<?> [?? x 4]
mpg wt cyl prediction
<dbl> <dbl> <dbl> <dbl>
1 14.3 3.57 8 16.7
2 14.7 5.34 8 9.34
3 15 3.57 8 16.7
4 15.2 3.44 8 17.2
5 15.2 3.78 8 15.8
6 15.5 3.52 8 16.9
Further reading
Spark machine learning supports a wide array of algorithms and feature transformations and as illustrated above it’s easy to chain these functions together with dplyr
pipelines. To learn more see the Machine Learning article on this site. For a list of Spark ML models available through sparklyr
visit Reference - ML