Leverages dplyr to process the calculations of a plot inside a database. This package provides helper functions that abstract the work at three levels:

  1. Functions that ouput a ggplot2 object
  2. Functions that outputs a data.frame object with the calculations
  3. Creates the formula needed to calculate bins for a Histogram or a Raster plot

Installation

You can install the released version from CRAN:

Or the the development version from GitHub, using the remotes package:

Connecting to a data source

Example

In addition to database connections, the functions work with sparklyr. A Spark DataFrame will be used for the examples in this README.

library(sparklyr)
sc <- spark_connect(master = "local")
spark_flights <- copy_to(sc, nycflights13::flights, "flights")

ggplot

Histogram

By default dbplot_histogram() creates a 30 bin histogram

Use binwidth to fix the bin size

spark_flights %>% 
  dbplot_histogram(distance, binwidth = 400)

Because it outputs a ggplot2 object, more customization can be done

spark_flights %>% 
  dbplot_histogram(distance, binwidth = 400) +
  labs(title = "Flights - Distance traveled") +
  theme_bw()

Raster

To visualize two continuous variables, we typically resort to a Scatter plot. However, this may not be practical when visualizing millions or billions of dots representing the intersections of the two variables. A Raster plot may be a better option, because it concentrates the intersections into squares that are easier to parse visually.

A Raster plot basically does the same as a Histogram. It takes two continuous variables and creates discrete 2-dimensional bins represented as squares in the plot. It then determines either the number of rows inside each square or processes some aggregation, like an average.

  • If no fill argument is passed, the default calculation will be count, n()
spark_flights %>%
  dbplot_raster(sched_dep_time, sched_arr_time) 

  • Pass an aggregation formula that can run inside the database

  • Increase or decrease for more, or less, definition. The resolution argument controls that, it defaults to 100

Bar Plot

  • dbplot_bar() defaults to a tally() of each value in a discrete variable

  • Pass a formula, and column name, that will be operated for each value in the discrete variable
spark_flights %>%
  dbplot_bar(origin, avg_delay =  mean(dep_delay, na.rm = TRUE))

Line plot

  • dbplot_line() defaults to a tally() of each value in a discrete variable

  • Pass a formula that will be operated for each value in the discrete variable
spark_flights %>%
  dbplot_line(month, avg_delay = mean(dep_delay, na.rm = TRUE))

Boxplot

  • It expects a discrete variable to group by, and a continuous variable to calculate the percentiles and IQR. It doesn’t calculate outliers. Currently, this feature works with sparklyr and Hive connections.

Calculation functions

If a more customized plot is needed, the data the underpins the plots can also be accessed:

  1. db_compute_bins() - Returns a data frame with the bins and count per bin
  2. db_compute_count() - Returns a data frame with the count per discrete value
  3. db_compute_raster() - Returns a data frame with the results per x/y intersection
  4. db_compute_raster2() - Returns same as db_compute_raster() function plus the coordinates of the x/y boxes
  5. db_compute_boxplot() - Returns a data frame with boxplot calculations

The data can be piped to a plot

spark_flights %>%
  filter(arr_delay < 100 , arr_delay > -50) %>%
  db_compute_bins(arr_delay) %>%
  ggplot() +
  geom_col(aes(arr_delay, count, fill = count))