Install

Install the package

You can install the sparklyr package from CRAN as follows:

install.packages("sparklyr")

Install Spark locally

Caution

The steps in this section are only needed if you need to run Spark in your computer. If you already have a running Spark cluster that you will use to learn sparklyr, then skip this section.

This section is meant for developers new to sparklyr. You will need a running Spark environment to connect to. sparklyr can install Spark in your computer. The installed Spark environment is meant for learning and prototyping purposes. The installation will work on all the major Operating Systems that R works on, including Linux, MacOS, and Windows.

Please be aware that after installation, Spark is not running. The next section will explain how to start a single node Spark cluster in your machine.

Connect to Spark

You can use spark_connect() to connect to Spark clusters. The arguments passed to this functions depend on the type of Spark cluster you are connecting to. There are several different types of Spark clusters, such as YARN, Stand Alone and Kubernetes.

spark_connect() is able to both start, and connect to, the single node Spark cluster in your machine. In order to do that, pass “local” as the argument for master:

library(sparklyr)

sc <- spark_connect(master = "local")

The sc variable now contains all of the connection information needed to interact with the cluster.

To learn how to connect to other types of Spark clusters, see the Deployment section of this site.

Disconnect from Spark

For “local” connection, spark_disconnect() will shut down the single node Spark environment in your machine, and tell R that the connection is no longer valid. For other types of Spark clusters, spark_disconnect() will only end the Spark session, it will not shut down the Spark cluster itself.

Clusters

Here are some examples of how to use spark_connect() to connect to different types of Spark clusters:

Hadoop YARN:

sc <- spark_connect(master = "yarn")

Mesos:

sc <- spark_connect(master = "mesos://host:port")

Kubernetes:

sc <- spark_connect(master = "k8s://https://server")

Apache Livy:

sc <- spark_connect(master = "http://server/livy", method = "livy")

Stand Alone:

sc <- spark_connect(master = "spark://master-url:7077")

Qubole: (for more info visit the Qubole page on this site)

sc <- spark_connect(method = "qubole")

Databricks - Visit the Databricks page on this site to review the connection options