Motivation
I use R to extract data held in Microsoft SQL Server databases on a daily basis.
When I first started I was confused by all the different ways to accomplish this task. I was a bit overwhelmed trying to choose the, “best,” option given the specific job at hand.
I want to share what approaches I’ve landed on to help others who may want a simple list of options to get started with.
Scope
This post is about reading data from a database, not writing to one.
I prefer to use packages in the tidyverse so I’ll focus on those packages.
While it’s possible to generalize many of the concepts I write about here to other DBMS systems I will focus exclusively on Microsoft SQL Server. I hope this will provide simple, prescriptive guidance for those working in a similar configuration.
The data for these examples is stored using Microsoft SQL Server Express. Free download available here.
One last thing - these are a few options I populated my toolbox with. They have served me well over the past two years as an analyst in an enterprise environment, but are definitely not the only options available.
Setup
Connect to the server
I use the keyring package to keep my credentials out of my R code. You can use the great documentation available from RStudio to learn how do the same.
user_name <-
keyring::key_list("dbi_test")$username
con <- DBI::dbConnect(
odbc::odbc(),
Driver = "SQL Server",
Server = "testsqlserver",
UID = user_name,
PWD = keyring::key_get("dbi_test", user_name)
)
Write some sample data
Note that I set the temporary argument to TRUE so that the data is written to the tempdb on SQL server, which will result in it being deleted on disconnection.
This results in dplyr prefixing the table name with, “##.”
SOURCE: https://db.rstudio.com/dplyr/#connecting-to-the-database
dplyr::copy_to(con, nycflights13::flights, "flights",
temporary = TRUE
)
## Created a temporary table named: ##flights
dplyr::copy_to(con, nycflights13::airlines, "airlines",
temporary = TRUE
)
## Created a temporary table named: ##airlines
dplyr::copy_to(con, nycflights13::planes, "planes",
temporary = TRUE
)
## Created a temporary table named: ##planes
Option 1: Use dplyr syntax and let dbplyr handle the rest
When I use this option
This is my default option.
I do almost all of my analysis in R and this avoids fragmenting my work and thoughts across different tools.
Examples
Example 1: filter rows, and retrieve selected columns
dplyr::tbl(con, "##flights") %>%
dplyr::filter(origin == "JFK", dest == "BQN" ) %>%
dplyr::select(flight, tailnum, dep_time, sched_dep_time,
dep_delay) %>%
# I will remove the following line if exploring the data and
# not interested in actually retrieving all records
dplyr::collect()
## # A tibble: 599 x 5
## flight tailnum dep_time sched_dep_time dep_delay
## <int> <chr> <int> <int> <dbl>
## 1 725 N804JB 544 545 -1
## 2 727 N588JB 2356 2359 -3
## 3 725 N624JB 539 545 -6
## 4 727 N789JB 2354 2359 -5
## 5 727 N618JB 235 2359 156
## 6 725 N779JB 543 545 -2
## 7 725 N652JB 608 545 23
## 8 727 N599JB 2358 2359 -1
## 9 725 N589JB 602 545 17
## 10 727 N649JB 2357 2359 -2
## # ... with 589 more rows
Example 2: join across tables and retrieve selected columns
dplyr::tbl(con, "##flights") %>%
dplyr::select(origin, dest, carrier, tailnum ) %>%
dplyr::left_join( #dplyr can join database tables
dplyr::tbl(con, "##airlines"), by = "carrier"
) %>%
mutate(airline = name) %>%
select(-carrier) %>% # only want the "human readable" airline name
dplyr::collect()
## # A tibble: 336,776 x 5
## origin dest tailnum name airline
## <chr> <chr> <chr> <chr> <chr>
## 1 EWR IAH N14228 United Air Lines Inc. United Air Lines Inc.
## 2 LGA IAH N24211 United Air Lines Inc. United Air Lines Inc.
## 3 JFK MIA N619AA American Airlines Inc. American Airlines Inc.
## 4 JFK BQN N804JB JetBlue Airways JetBlue Airways
## 5 LGA ATL N668DN Delta Air Lines Inc. Delta Air Lines Inc.
## 6 EWR ORD N39463 United Air Lines Inc. United Air Lines Inc.
## 7 EWR FLL N516JB JetBlue Airways JetBlue Airways
## 8 LGA IAD N829AS ExpressJet Airlines Inc. ExpressJet Airlines Inc.
## 9 JFK MCO N593JB JetBlue Airways JetBlue Airways
## 10 LGA ORD N3ALAA American Airlines Inc. American Airlines Inc.
## # ... with 336,766 more rows
Example 3: Summarize and count
dplyr::tbl(con, "##flights") %>%
filter(!is.na(tailnum)) %>%
dplyr::left_join(tbl(con, "##planes"), by = "tailnum") %>%
dplyr::group_by(engine) %>%
dplyr::summarise(flights = n()) %>%
dplyr::collect() %>%
dplyr::arrange(desc(flights))
## # A tibble: 7 x 2
## engine flights
## <chr> <int>
## 1 Turbo-fan 240915
## 2 <NA> 50094
## 3 Turbo-jet 40976
## 4 Reciprocating 1774
## 5 Turbo-shaft 410
## 6 4 Cycle 48
## 7 Turbo-prop 47
Quite a few tailnum values in flights, are not present in planes, interesting!
Option 2: Write SQL syntax and have dplyr and dbplyr run the query
When I use this option
I use this option when I am reusing a fairly short, existing SQL query with minor modifications.
Example 1: Simple selection of records using SQL syntax
query <- "SELECT flight, tailnum, dep_time, origin, dest
FROM ##flights"
dplyr::tbl(con, dplyr::sql(query)) %>%
dplyr::collect()
## # A tibble: 336,776 x 5
## flight tailnum dep_time origin dest
## <int> <chr> <int> <chr> <chr>
## 1 1545 N14228 517 EWR IAH
## 2 1714 N24211 533 LGA IAH
## 3 1141 N619AA 542 JFK MIA
## 4 725 N804JB 544 JFK BQN
## 5 461 N668DN 554 LGA ATL
## 6 1696 N39463 554 EWR ORD
## 7 507 N516JB 555 EWR FLL
## 8 5708 N829AS 557 LGA IAD
## 9 79 N593JB 557 JFK MCO
## 10 301 N3ALAA 558 LGA ORD
## # ... with 336,766 more rows
Example 2: Use dplyr syntax to enhance a raw SQL query
dplyr::tbl(con, dplyr::sql(query)) %>% #reuse query from previous example
# add a filter state on top of the raw sql statement
dplyr::filter(origin == "JFK", dest == "BQN" ) %>%
collect()
## # A tibble: 599 x 5
## flight tailnum dep_time origin dest
## <int> <chr> <int> <chr> <chr>
## 1 725 N804JB 544 JFK BQN
## 2 727 N588JB 2356 JFK BQN
## 3 725 N624JB 539 JFK BQN
## 4 727 N789JB 2354 JFK BQN
## 5 727 N618JB 235 JFK BQN
## 6 725 N779JB 543 JFK BQN
## 7 725 N652JB 608 JFK BQN
## 8 727 N599JB 2358 JFK BQN
## 9 725 N589JB 602 JFK BQN
## 10 727 N649JB 2357 JFK BQN
## # ... with 589 more rows
Option 3: Store the SQL query in a text file and have dplyr and dbplyr run the query
When I use this option
I use this approach under the following conditions:
- I’m reusing existing SQL code or when collaborating with someone who will be writing new code in SQL
- The SQL code is longer than a line or two
I prefer to, “modularize,” my R code. Having an extremely long SQL statement in my R code doesn’t abstract away the complexity of the SQL query. Putting the query into it’s own file helps achieve my desired level of abstraction.
In conjunction with source control it makes tracking changes to the definition of a data set simple.
More importantly, it’s a really useful way to collaborate with others who are comfortable with SQL but don’t use R. For example, I recently used this approach on a project involving aggregation of multiple data sets. Another team member focused on building out the data collection logic for some of the data sets in SQL. Once he had them built and validated he handed off the query to me and I pasted it into a text file.
Step 1: Put your SQL code into a text file
Here is some example SQL code that might be in a file
SELECT f.year
,f.month
,f.day
,f.dep_time
,a.name
,p.manufacturer
,p.model
FROM ##flights f
LEFT JOIN ##airlines a ON f.carrier = a.carrier
LEFT JOIN ##planes p ON f.tailnum = p.tailnum
WHERE p.manufacturer IS NOT NULL
Let’s say that SQL code was stored in a text file called, flights.sql
Step 2: Use the SQL code in the file to retrieve data and execute the query.
#reference: https://stackoverflow.com/a/46105261/7416441
query <- readr::read_file("flights.sql")
dplyr::tbl(con, dplyr::sql(query))
## # A tibble: 284,170 x 7
## year month day dep_time name manufacturer model
## <int> <int> <int> <int> <chr> <chr> <chr>
## 1 2013 1 1 517 United Air Lines Inc. BOEING 737-824
## 2 2013 1 1 533 United Air Lines Inc. BOEING 737-824
## 3 2013 1 1 542 American Airlines Inc. BOEING 757-223
## 4 2013 1 1 544 JetBlue Airways AIRBUS A320-232
## 5 2013 1 1 554 Delta Air Lines Inc. BOEING 757-232
## 6 2013 1 1 554 United Air Lines Inc. BOEING 737-924ER
## 7 2013 1 1 555 JetBlue Airways AIRBUS INDUSTR~ A320-232
## 8 2013 1 1 557 ExpressJet Airlines In~ CANADAIR CL-600-2B~
## 9 2013 1 1 557 JetBlue Airways AIRBUS A320-232
## 10 2013 1 1 558 JetBlue Airways AIRBUS A320-232
## # ... with 284,160 more rows