Spoiler Alert: A fix has been released!
A solution was released after I wrote this post.
odbc 1.3.0 is now on CRAN!🎉
— Jim Hester (@jimhester_) October 28, 2020
odbc allows you to connect to many different databases efficiently from #rstats
This release avoids the infamous 'Invalid Descriptor' error in SQL Server, so you no longer need worry about column order.
See https://t.co/asZXCn7T34 for more info! pic.twitter.com/BDLDC2hzH9
Here are the steps that worked for me:
- Update to the latest released version of
odbc
install.packages("odbc")
Update to the latest version of the Microsoft Driver for SQL Server. I suggest Googling, “odbc driver for sql server,” to find the latest version.
Specify the ODBC driver you downloaded in step 2 when connecting to your database.
con <- DBI::dbConnect(
odbc::odbc(),
Driver = "ODBC Driver 17 for SQL Server",
Server = "testsqlserver",
UID = user_name,
Database = "dbi_error",
PWD = keyring::key_get("dbi_test", user_name)
)
To determine what to specify in the Driver = "ODBC Driver 17 for SQL Server"
line, I used the following command to get the name of drivers installed on my system…
stringr::str_subset(
unique(odbc::odbcListDrivers()$name),
"SQL Server")
## [1] "SQL Server" "SQL Server Native Client 11.0"
## [3] "ODBC Driver 17 for SQL Server"
Motivation
I use the dbplyr, dplyr, DBI and odbc R packages to extract data held in Microsoft SQL on a daily basis.
I used to run into the, “Invalid Descriptor Index,” error documented here constantly. It was awful! I spent so much time trying to find the, “problem,” columns in tables that often had over 100 columns. The time I spent trying to get my queries to work severely reduced the time I had available to clean, transform and analyze my data.
Through the work of many great people sharing their expertise on GitHub, I was able to identify a workaround that worked well for me.
The workaround helped me continue my daily work but did nothing to prevent others from encountering the same issue in the future. One of the great things about R is the community and I would like to do my part to help. That might involve making a code change in a package or in another component involved.
I don’t yet have a firm grasp on what, from a technical standpoint, leads to the failure but I do know many things are involved - specifically multiple R packages, ODBC drivers and Microsoft SQL Server.
I am comfortable as an R user and debugging C++ code individually but have not spent much time debugging multi-component issues in R.
This investigation is an opportunity for me:
- Advance my R debugging skills so I well-armed to tackle other multi-component R issues in the future
- Understand, at a deep level, the root cause, and contributing factors, of the failure that leads to this error. With that understanding I will be a much better position to propose, evaluate and contribute to solutions.
Scope
I’ll be reproducing this issue using Microsoft SQL Server Express which is a free download available here.
I’ll be focusing my analysis on when this error occurs running R on the Microsoft Windows operating system.
Debugging the error
One of the primary motivations for this post is to force me to advance my debugging skills.
I decided to start with the debugging chapter from Hadley Wickham’s Advanced R book.
Debugging Step 1: Google it!
I started out by Googling a portion of the error string, “[Microsoft][ODBC SQL Server Driver]Invalid Descriptor Index
,” which yielded quite a few relevant hits, including a surprising amount of duplicate GitHub issues raised in the odbc project.
Based on Hadley’s recommendation in the book I decided to check out the errorist package which gave me much more targeted hits because it included the entire text of the error message in the search string. It was also automatic, saving me a second or two of copying and pasting. I’ll be incorporating this package into my debugging work-flow going forward!
Here is what I gleaned:
Judging by the sheer frequency, and recency, of this issue being reported, I don’t believe this is a corner case affecting a small minority of R users. I’ve captured a list of relevant issues, and the Stack Overflow thread I found most helpful in Appendix: Some Reports of, and references to, this issue.
This is caused by an ODBC (the Microsoft C interface - not the R odbc package) constraint:
- Called out as a limitation of
SQLGetData()
citing this portion of the documentation:
- Called out as a limitation of
There are several restrictions on using SQLGetData. Generally, columns accessed with SQLGetData:
…
Must have a higher column number than the last bound column. For example, if the last bound column is column 3, it is an error to call SQLGetData for column 2. For this reason, applications should make sure to place long data columns at the end of the select list.
SOURCE: Microsoft documentation on “Getting Long Data”
The most common workaround is to put “long” columns at the end of the select statement, and there are a few different methods to do this. See Appendix: Methods to move “long” data to end of query for options to do this.
People have reported that the error doesn’t occur with:
- RStudio’s professional drivers which are part of their commercial (you need to pay for) offerings. SOURCE: GitHub issue comment from Jim Hester
- RODBC package. Reported in this GitHub issue, although Jim Hester reported here that RODBC is simply truncating the long data.
- FreeTDS drivers SOURCE: Bill Evans comment on GitHub See Appendix: Installing and using the FreeTDS driver on Windows. Note that this is an evolving approach and Bill Evans lists some potential cons to using FreeTDS:
- Might be slower than Microsoft drivers
pool
package doesn’t reconnect after disconnects- Hasn’t been able to identify a way to use SSL for the database connections
Debugging Step 2: Make it repeatable
This is a little more complex than the standard reprex because it requires that you have access to a SQL Server database which has problematic data, and ideally that you can modify data in order to test theories.
Create the test database
Simple steps based on instructions from Microsoft.
library(tidyverse)
user_name <-
keyring::key_list("dbi_test")$username
con <- DBI::dbConnect(
odbc::odbc(),
Driver = "SQL Server",
Server = "testsqlserver",
UID = user_name,
PWD = keyring::key_get("dbi_test", user_name)
)
# kill any connections before trying to drop database
DBI::dbExecute(con,
"IF DB_ID (N'dbi_error') IS NOT NULL
ALTER DATABASE [dbi_error] set single_user
with rollback immediate")
## [1] 0
# Delete the test database if it already exists
DBI::dbExecute(con,
"IF DB_ID (N'dbi_error') IS NOT NULL
DROP DATABASE dbi_error")
## [1] 0
# Create the test database
DBI::dbExecute(con, "CREATE DATABASE dbi_error")
## [1] 0
# Verify it was created
DBI::dbGetQuery(con,
"SELECT name, size, size*1.0/128 AS [Size in MBs]
FROM sys.master_files WHERE name = N'dbi_error'")
## name size Size in MBs
## 1 dbi_error 1024 8
DBI::dbDisconnect(con)
Write test data
Create working and non-working data:
- A table, “problem_data,” with, “long” columns that are not the last columns in the table.
- A table, “working_data,” with the same data, only this time with the, “long,” column at the end of the table.
con <- DBI::dbConnect(
odbc::odbc(),
Driver = "SQL Server",
Server = "testsqlserver",
UID = user_name,
Database = "dbi_error",
PWD = keyring::key_get("dbi_test", user_name)
)
dplyr::copy_to(con,
tibble(a = 1:3,
b = blob::blob(raw(1)), #blob data NOT at
# end of table
c = 1
),
"problem_data", # will expose problem
temporary = FALSE
)
dplyr::copy_to(con,
tibble(a = 1:3,
c = 1,
b = blob::blob(raw(1)) # blob data at end
# of table
),
"working_data", # will not expose problem
temporary = FALSE
)
Reproduce the error
Here the problem occurs - because the column with, “long,” data is NOT the last column in the table.
dplyr::tbl(con,"problem_data") %>%
collect()
## Error in result_fetch(res@ptr, n): nanodbc/nanodbc.cpp:3280: 07009: [Microsoft][ODBC SQL Server Driver]Invalid Descriptor Index
Here the problem does not occur - because the column with “long” data is the last column in the table.
dplyr::tbl(con, "working_data") %>%
collect()
## # A tibble: 3 x 3
## a c b
## <int> <dbl> <blob>
## 1 1 1 <raw 1 B>
## 2 2 1 <raw 1 B>
## 3 3 1 <raw 1 B>
SOURCE: https://github.com/r-dbi/odbc/issues/309#issue-507726267
Debugging Step 3: Figure out where it is (the error)
dplyr::tbl(con,"problem_data") %>%
collect()
## Error in result_fetch(res@ptr, n): nanodbc/nanodbc.cpp:3280: 07009: [Microsoft][ODBC SQL Server Driver]Invalid Descriptor Index
Executing traceback()
yields
Recalling that stack track reads from bottom to top I can see it progresses from dplyr::tbl()
through the final stop()
and that there are some database related functions along the way, for example dbFetch
.
What I cannot see is what packages export these functions and where in the package code (.r file and line #) these functions are located.
Hadley does mention this..
If you’re calling code that you source()d into R, the traceback will also display the location of the function, in the form filename.r#linenumber. These are clickable in RStudio, and will take you to the corresponding line of code in the editor.
Based on what on my previous learnings about how the dplyr, dbplyr, DBI and odbc packages work together to bring data stored in Microsoft SQL Server databases into R, I decided to start by building the following packages from source:
- dpyr
- dbplyr
- DBI
- odbc
See Appendix: Building a package from source using RStudio for my short notes on how to build a package from source using RStudio.
I then sourced all the .R files for each of these packages. I received errors regarding not being able to locate certain functions unless I sourced the files containing those functions before others.
There is probably an easier way to do this.
library(rlang)
library(R6)
library(assertthat)
library(DBI)
library(odbc)
map(list.files("C:\\r_projects\\dplyr\\R", full.names = TRUE),
source)
source("C:\\r_projects\\dbplyr\\R\\translate-sql-helpers.R")
source("C:\\r_projects\\dbplyr\\R\\translate-sql-conditional.R")
source("C:\\r_projects\\dbplyr\\R\\translate-sql-paste.R")
source("C:\\r_projects\\dbplyr\\R\\translate-sql-string.R")
source("C:\\r_projects\\dbplyr\\R\\translate-sql-quantile.R")
source("C:\\r_projects\\dbplyr\\R\\translate-sql-window.R")
map(list.files("C:\\r_projects\\dbplyr\\R", full.names = TRUE),
source)
map(list.files("C:\\r_projects\\DBI\\R", full.names = TRUE),
source)
source("C:\\r_projects\\odbc\\R\\Driver.R")
map(list.files("C:\\r_projects\\odbc\\R", full.names = TRUE),
source)
I then set breakpoints in result_fetch()
because according to the output of traceback()
, that is the last call in the call stack before stop()
is called.
debugonce("result_fetch")
That didn’t change behavior - the breakpoint didn’t trigger a browser as expected.
I went further “up” the stack and set a breakpoint on dbFetch()
debugonce("dbFetch")
dplyr::tbl(con,"problem_data")
Now reproducing the error breaks into the browser and displays the Traceback window.
NOTE: I had to expand the Traceback window to display it.
Note that I still did not see line numbers in the traceback!
Along the way of randomly trying things I found that using debug()
instead of debugonce()
resulted in line numbers being displayed.
debug("dbFetch")
dplyr::tbl(con,"problem_data")
What I found interesting was that the output of traceback()
did not match the output of `traceback() after reproducing the error.
My suspicion was that dbFetch()
is called multiple times as a result of my call to dplyr::tbl()
.
To test that hypothesis, I clicked, “continue,” in the debugger and sure enough the breakpoint was hit again. This time the call stack looked similar to the output of traceback()
when the error occurred.
Note that for some reason the output of the Traceback window in RStudio stopped displaying line numbers. It works for the first attempt after running debug()
and then stops displaying line numbers for subsequent attempts. Weird!
I then clicked the, “next,” button in Rstudio to execute the statement. The output of RStudio’s Traceback window did not change, but running where()
in the Browser terminal yielded additional detail.
A big benefit of adding where()
to my debugging work flow is that it provided the complete path to each source file, whereas Traceback only gave the file name.
We can see that dbFetch()
is about to be called from line 123 of verb-compute.R from inside the db_collect.DBIConnection
function.
Unfortunately, trying to step further into the code with the next or step into debugger function resulted in the error being returned and execution stopping.
I still was not able to figure out where dbFetch()
was defined.
I decided to change approaches so I set the RStudio option to break into the code on error.
I also stopped debugging dbFetch()
undebug("dbFetch")
Reproducing the error now broke into the Browser, and executing where()
shed some additional light.
I could see that dbFetch()
was resulting in a call into odbc::result_fetch()
and then that ends up with a stop()
message from a file called RcppExports.R
Looking at the help for the setMethod()
function I can see that it defines what code gets run when dbFetch()
is called, in this case result_fetch()
.
Still not entirely clear where the definition for result_fetch()
is but moving up the callstack I can see that stop()
is called from RcppExports.R
, line 81 so I took a peek at that code. By the way, the fact that the file contains, “Rcpp,” indicated to me that this had something to do with C++ code.
Well sure enough there was a call to .Call()
, which according to the help provides, “Modern Interfaces to C/C++ code.”
Searching my file system for files with the string "_odbc_result_fetch" yielded hits odbc C++ source files.
At the same position in the call stack there is a string, “message =”nanodbc/nanodbc.cpp:3186…"
I opened r_projects/odbc/src/nanodbc/nanodbc.cpp
and took a look at the code around line 3186.
This code is inside a function result::result_impl::get_ref_impl
and specifically inside a case statement that is executed when a column is of type SQL_C_BINARY
.
The error itself results from a call to NANODBC_CALL_RC
which is a wrapper to ODBC functions. The ODBC function being called is SQLGetData()
, which according to the documentation, “retrieves data for a single column in the result set or for a single parameter after SQLParamData returns SQL_PARAM_DATA_AVAILABLE. It can be called multiple times to retrieve variable-length data in parts.”
Recalling what I read in the other Github issues, I referred back to the Microsoft documentation on getting long data and found this statement.
Because such data cannot be stored in a single buffer, it is retrieved from the driver in parts with SQLGetData after the other data in the row has been fetched.
So my hypothesis is that SQLGetData()
is being called on a “long” column before all other “non-long” columns have been retrieved. This echo’s other comments on some of the issues, and also my testing where moving the “long” column to the end of the select statement.
Taking a broader look at this nanodbc code I can see that SQLGetData()
is being called inside a case statement.
I suspect that this second part of the case statement is being run for “non-long” columns and SQLGetData()
is being called when the column comes back as binary.
I could confirm this by setting up C++ debugging of this nanodbc code but after looking at what it would take for my Windows config I decided not to take the time to do that.
In conclusion, I can see the error is being returned inside nanodbc C++ code and that nanodbc is used by the odbc package.
Debugging Step 4: Fix it and test it
Other than the workarounds detailed in the Debugging Step 1: Google it! section, I believe reading all columns with SQLGetData()
might a suitable option based on the Microsoft Documentation on reading long data
An application can actually retrieve any type of data with SQLGetData, not just long data, although only character and binary data can be retrieved in parts.
The documentation does indicate that this is more complex. I have also come across some comments that this approach could impact performance. I am not aware of whether this a significant concern in practice.
For now, I’ve added a function resembling the first method in Appendix: Methods to move “long” data to end of query to my personal suite of functions. I use it after receiving the, “Invalid descriptor index,” error on a table.
I am optimistics that this will eventually be resolved so R users don’t encounter this issue in the future. I am following this particular PR in nanodbc.
Optional binding to support out-of-order retrival of unbound columns with SQLGetData
Appendix: Some Reports of, and references to, this issue
Issues logged against odbc R package:
- https://github.com/r-dbi/odbc/issues/10
- https://github.com/r-dbi/odbc/issues/86
- https://github.com/r-dbi/odbc/issues/112
- https://github.com/r-dbi/odbc/issues/256
- https://github.com/r-dbi/odbc/issues/309
- https://github.com/r-dbi/odbc/issues/331
- https://github.com/r-dbi/odbc/issues/358
Pull request in odbc:
- looks like it adds better error messaging (tremendously valuable!) vs. fully addressing the problem
Issues logged against DBI R package:
Issues logged against nanodbc:
- https://github.com/nanodbc/nanodbc/issues/149
- https://github.com/nanodbc/nanodbc/issues/228 <- this discusses a potential fix in nanodbc
Appendix: Methods to move “long” data to end of query
Method 1: Identify and move columns over a certain size to the end
tbl_long_cols <- function(con, table, size_thresh = 200) {
long_cols <-
odbc::odbcConnectionColumns(con, table) %>%
dplyr::filter(column_size >= size_thresh) %>%
dplyr::pull(name)
dplyr::tbl(con, table) %>%
dplyr::select(-tidyselect::all_of(long_cols),
tidyselect::everything(),
tidyselect::all_of(long_cols))
}
tbl_long_cols(con,"problem_data") %>%
dplyr::collect()
## # A tibble: 3 x 3
## a c b
## <int> <dbl> <blob>
## 1 1 1 <raw 1 B>
## 2 2 1 <raw 1 B>
## 3 3 1 <raw 1 B>
Method 2: sort by column size and move longest columns to end
This also demonstrates how you can re-order the, “long,” columns freely after the data has been retrieved from the database.
tbl_long_cols_2 <- function(con, table) {
cols_sorted <-
odbc::odbcConnectionColumns(con, table) %>%
dplyr::arrange(column_size) %>%
pull(name)
dplyr::tbl(con, table) %>%
dplyr::select(tidyselect::all_of(cols_sorted))
}
tbl_long_cols_2(con, "problem_data") %>%
dplyr::collect() %>%
dplyr::select(a,b,c) #can move the "long" cols back after retrieval from db
## # A tibble: 3 x 3
## a b c
## <int> <blob> <dbl>
## 1 1 <raw 1 B> 1
## 2 2 <raw 1 B> 1
## 3 3 <raw 1 B> 1
Method 3: Move certain data types to the end
There are certain data types that are, “long.” You can look for these long data types and move them to the end of the select statement.
tbl_long_cols_3 <- function(con, table) {
long_cols <-
odbc::odbcConnectionColumns(con, table) %>%
dplyr::filter(field.type == "varbinary") %>%
pull(name)
dplyr::tbl(con, table) %>%
dplyr::select(-tidyselect::all_of(long_cols),
tidyselect::everything(),
tidyselect::all_of(long_cols))
}
tbl_long_cols_3(con, "problem_data") %>%
collect()
## # A tibble: 3 x 3
## a c b
## <int> <dbl> <blob>
## 1 1 1 <raw 1 B>
## 2 2 1 <raw 1 B>
## 3 3 1 <raw 1 B>
Appendix: Installing and using the FreeTDS driver on Windows
- View the latest FreeTDS build for your architecture (32 or 64 bit). Here is a link to the current FreeTDS build
Click one of the rows.
- Click Artifacts and click a .zip file - I tried
vs2017_64-master.zip
variant but received an error when trying to register the driver dll. I believe this is due to the dependency on OpenSSL. I looked to install OpenSSL but it had quite a few dependencies that I didn’t feel like setting up (Perl and make for example)
Extract the contents of the zip file
Open a command prompt or PowerShell window and execute the following command:
regsvr32 "C:\temp\vs2017_64-master\bin\tdsodbc.dll"
- Success should yield a message like this:
- Verify that R can see the new driver
odbc::odbcListDrivers() %>%
dplyr::filter(name == "FreeTDS")
## name attribute value
## 1 FreeTDS UsageCount 2
## 2 FreeTDS APILevel 2
## 3 FreeTDS ConnectFunctions YYN
## 4 FreeTDS DriverODBCVer 03.00
## 5 FreeTDS FileUsage 0
## 6 FreeTDS SQLLevel 2
Let’s connect using the FreeTDS driver and verify the issue does not occur.
NOTE: I had to make a couple changes, in addition to changing the driver to get this to work:
- Specified SQL port(typically this is 1433)
- Set the client character set to UTF-8 (referenced in this post).
user_name <-
keyring::key_list("dbi_test")$username
con_freetds <- DBI::dbConnect(
odbc::odbc(),
Driver = "FreeTDS",
Server = "testsqlserver",
UID = user_name,
PWD = keyring::key_get("dbi_test", user_name),
ClientCharset = "UTF-8",
port = 1433,
Database = "dbi_error"
)
dplyr::tbl(con_freetds,"problem_data") %>%
collect()
## # A tibble: 3 x 3
## a b c
## <int> <blob> <dbl>
## 1 1 <raw 1 B> 1
## 2 2 <raw 1 B> 1
## 3 3 <raw 1 B> 1
DBI::dbDisconnect(con_freetds)
It works!
A few NOTES:
It’s possible there may be an implication for extended (longer than 8-bit) character data when setting character set to 8-bit. I haven’t evaluated that.
I also tested FreeTDS in a Windows domain joined environment and it worked!
SOURCE: Bill Even’s GitHub Issue post where he shares solution
SOURCE: Setting up FreeTDS to work with ODBC Data Source Administrator on Windows 7
SOURCE: FreeTDS user guide
Appendix: Building a package from source using RStudio
- Install the latest version of rtools
- Follow these instructions from Jenny Bryan to clone the project using RStudio.
- Open the project file for the package
- Build -> Install and Restart