Option 2 - Working inside of Databricks

Overview

If the recommended path of connecting to Spark remotely with Databricks Connect does not apply to your use case, then you can install RStudio Workbench directly within a Databricks cluster as described in the sections below.

With this configuration, RStudio Workbench is installed on the Spark driver node and allows users to work locally with Spark using sparklyr.

This configuration can result in increased complexity, limited connectivity to other storage and compute resources, resource contention between RStudio Workbench and Databricks, and maintenance concerns due to the ephemeral nature of Databricks clusters.

For additional details, refer to the FAQ for RStudio in the Databricks Documentation.

Advantages and limitations

Advantages:

  • Ability for users to connect sparklyr to Spark without configuring remote connectivity
  • Provides a high-bandwidth connection between R and the Spark JVM processes because they are running on the same machine
  • Can load data from the cluster directly into an R session since RStudio Workbench is installed within the Databricks cluster

Limitations:

  • If the Databricks cluster is restarted or terminated, then the instance of RStudio Workbench will be terminated and its configuration will be lost
  • If users do not persist their code through version control or the Databricks File System, then you risk losing user’s work if the cluster is restarted or terminated
  • RStudio Workbench (and other RStudio products) installed within a Databricks cluster will be limited to the compute resources and lifecycle of that particular Spark cluster
  • Non-Spark jobs will use CPU and RAM resources within the Databricks cluster
  • Need to install one instance of RStudio Workbench per Spark cluster that you want to run jobs on

Requirements

  • A running Databricks cluster with a runtime version 4.1 or above
  • The cluster must not have “table access control” or “automatic termination” enabled
  • You must have “Can Attach To” permission for the Databricks cluster

Preparation

The following steps walk through the process to install RStudio Workbench on the Spark driver node within your Databricks cluster.

The recommended method for installing RStudio Workbench to the Spark driver node is via SSH. However, an alternative method is available if you are not able to access the Spark driver node via SSH.

Configure SSH access to the Spark driver node

Configure SSH access to the Spark driver node in Databricks by following the steps in the SSH access to clusters section of the Databricks Cluster configurations documentation.

Note: If you are unable to configure SSH access or connect to the Spark driver node via SSH, then you can follow the steps in the Get started with RStudio Workbench section of the RStudio on Databricks documentation to install RStudio Workbench from a Databricks notebook, then skip to the access RStudio Workbench section of this documentation.

Connect to the Spark driver node via SSH

Connect to the Spark driver node via SSH on port 2200 by using the following command on your local machine:

ssh ubuntu@<spark-driver-node-address> -p 2200 -i <path-to-private-SSH-key>

Replace <spark-driver-node-address> with the DNS name or IP address of the Spark driver node, and <path-to-private-SSH-key> with the path to your private SSH key on your local machine.

Install RStudio Workbench on the Spark driver node

After you SSH into the Spark driver node, then you can follow the typical steps to install RStudio Workbench in the RStudio documentation. In the installation steps, you can select Ubuntu as the target Linux distribution.

Configure RStudio Workbench

The following configuration steps are required to be able to use RStudio Workbench with Databricks.

Add the following configuration lines to /etc/rstudio/rserver.conf to use proxied authentication with Databricks and enable the administrator dashboard:

auth-proxy=1
auth-proxy-user-header-rewrite=^(.*)$ $1
auth-proxy-sign-in-url=<domain>/login.html
admin-enabled=1

Add the following configuration line to /etc/rstudio/rsession-profile to set the PATH to be used with RStudio Workbench:

export PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:$PATH

Add the following configuration lines to /etc/rstudio/rsession.conf to configure sessions in RStudio Workbench to work with Databricks:

session-rprofile-on-resume-default=1
allow-terminal-websockets=0

Restart RStudio Workbench:

sudo rstudio-server restart

Access RStudio Workbench

From the Databricks console, click on the Databricks cluster that you want to work with:

From within the Databricks cluster, click on the Apps tab:

Click on the Set up RStudio button:

To access RStudio Workbench, click on the link to Open RStudio:

If you configured proxied authentication in RStudio Workbench as described in the previous section, then you do not need to use the username or password that is displayed. Instead, RStudio Workbench will automatically login and start a new RStudio session as your logged-in Databricks user:

Other users can access RStudio Workbench from the Databricks console by following the same steps described above. You do not need to create those users in RStudio Workbench or their home directory beforehand.

Configure sparklyr

Use the following R code to establish a connection from sparklyr to the Databricks cluster:

SparkR::sparkR.session()
library(sparklyr)
sc <- spark_connect(method = "databricks")

Additional information

For more information on using RStudio Workbench inside of Databricks, refer to the sections on RStudio on Databricks (AWS) or RStudio on Databricks (Azure) in the Databricks documentation.