Posted on Categories Coding, data science, Expository Writing, Practical Data Science, Pragmatic Data Science, TutorialsTags , , , , , , , ,

Using PostgreSQL in R: A quick how-to

The combination of R plus SQL offers an attractive way to work with what we call medium-scale data: data that’s perhaps too large to gracefully work with in its entirety within your favorite desktop analysis tool (whether that be R or Excel), but too small to justify the overhead of big data infrastructure. In some cases you can use a serverless SQL database that gives you the power of SQL for data manipulation, while maintaining a lightweight infrastructure.

We call this work pattern “SQL Screwdriver”: delegating data handling to a lightweight infrastructure with the power of SQL for data manipulation.

NewImageImage: Iainf, some rights reserved.

We assume for this how-to that you already have a PostgreSQL database up and running. To get PostgreSQL for Windows, OSX, or Unix use the instructions at PostgreSQL downloads. If you happen to be on a Mac, then Postgres.app provides a “serverless” (or application oriented) install option.

For the rest of this post, we give a quick how-to on using the RpostgreSQL package to interact with Postgres databases in R.

You have your PostgresSQL database up and running. Now you want to work with the data in that database in R. First, let’s create a data frame that we want to insert into the database.

# An example data frame to play with

iris = as.data.frame(iris)
summary(iris)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  

The column names of this data frame are problematic for databases (and especially PostgreSQL) for a few reasons: the “.”s in the names can be an issue, and PostgreSQL expects column names to be all lowercase. Here’s a function to make the column names db safe:

# make names db safe: no '.' or other illegal characters,
# all lower case and unique
dbSafeNames = function(names) {
  names = gsub('[^a-z0-9]+','_',tolower(names))
  names = make.names(names, unique=TRUE, allow_=TRUE)
  names = gsub('.','_',names, fixed=TRUE)
  names
}
colnames(iris) = dbSafeNames(colnames(iris))
summary(iris)

##   sepal_length    sepal_width     petal_length    petal_width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50

Now let’s open up a database connection and insert the table.

# Create a connection to the database
library('RPostgreSQL')

## Loading required package: DBI

pg = dbDriver("PostgreSQL")

# Local Postgres.app database; no password by default
# Of course, you fill in your own database information here.
con = dbConnect(pg, user="ninazumel", password="",
                 host="localhost", port=5432, dbname="ninazumel")

# write the table into the database.
# use row.names=FALSE to prevent the query 
# from adding the column 'row.names' to the table 
# in the db
dbWriteTable(con,'iris',iris, row.names=FALSE)

## [1] TRUE

The function dbWriteTable() returns TRUE if the table was successfully written. Note this call will fail if iris already exists in the database. Use overwrite=TRUE to force overwriting of an existing table, and append=TRUE to append to an existing table.

Now you can read the table back out.

# read back the full table: method 1
dtab = dbGetQuery(con, "select * from iris")
summary(dtab)

##   sepal_length    sepal_width     petal_length    petal_width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##    species         
##  Length:150        
##  Class :character  
##  Mode  :character  
##                    

# read back the full table: method 2
rm(dtab)
dtab = dbReadTable(con, "iris")
summary(dtab)

##   sepal_length    sepal_width     petal_length    petal_width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##    species         
##  Length:150        
##  Class :character  
##  Mode  :character  

Of course, the point of using a database is to extract subsets or transformations of your data, using SQL.

# get part of the table
rm(dtab)
dtab = dbGetQuery(con, "select sepal_length, species from iris")
summary(dtab)

##   sepal_length     species         
##  Min.   :4.300   Length:150        
##  1st Qu.:5.100   Class :character  
##  Median :5.800   Mode  :character  
##  Mean   :5.843                     
##  3rd Qu.:6.400                     
##  Max.   :7.900

You can use dbSendQuery for sending queries that don’t return a data-frame-like result.

# remove table from database
dbSendQuery(con, "drop table iris")

# commit the change
dbCommit(con)

When you are done, disconnect.

# disconnect from the database
dbDisconnect(con)

And that’s it!

Extra: RPostgreSQL and sqldf

If you are accustomed to manipulating data with SQL, you may prefer SQL notation to the sometimes convoluted calling conventions of the analogous R operations like aggregate(), or the functions in the dplyr package. In this case you have probably already discovered the sqldf package, which allows you to manipulate data frames using SQL. If you are a sqldf user, there is an additional subtlety if you are also using RPostgreSQL or other R packages for talking to databases: sqldf uses its own internal (and ephemeral) database to perform its operations, but if RPostgreSQL is loaded, sqldf will pick up your PostgreSQL driver by default. This is probably not what you want.

options(gsubfn.engine = "R")
library(sqldf)  
## Loading required package: gsubfn
## Loading required package: proto
## Loading required package: RSQLite
## sqldf will default to using PostgreSQL <=== NOTE THIS!

# Now try to use sqldf to aggregate sepal_length by species
query = "select avg(sepal_length) avg_sepal_length, 
                species 
         from dtab 
         group by species"

sqldf(query)

# Error in postgresqlNewConnection(drv, ...) : 
#   RS-DBI driver: (could not connect postgres@localhost on dbname "test"
# )
# Error in !dbPreExists : invalid argument type

To use sqldf on local data frames, you must specify the driver and dbname explicitly.

sqldf(query, drv="SQLite", dbname=":memory:")

##   avg_sepal_length    species
## 1            5.006     setosa
## 2            5.936 versicolor
## 3            6.588  virginica

Extra: PostgreSQL and dplyr

If you do use dplyr, the good news is that you can connect to a PostgreSQL database directly through the dplyr function src_postgres().

library('dplyr')

# Connect to local PostgreSQL via dplyr
localdb <- src_postgres(dbname = '',
                           host = 'localhost',
                           port = 5432,
                           user = 'ninazumel',
                           password = '')

# cheat and access the db connection directly
# assume we have made the colnames db safe
dbWriteTable(localdb$con,'iris',iris, row.names=FALSE)

## [1] TRUE

The tbl() command lets you access tables in the database remotely, and sql() lets you send queries.

# this is not a data frame; it's a dplyr PostgreSQL handle into the database
d = tbl(localdb, "iris") 
d

## Source: postgres 9.5.0 [ninazumel@localhost:5432/ninazumel]
## From: iris [150 x 5]
## 
##    sepal_length sepal_width petal_length petal_width species
## 1           5.1         3.5          1.4         0.2  setosa
## 2           4.9         3.0          1.4         0.2  setosa
## 3           4.7         3.2          1.3         0.2  setosa
## 4           4.6         3.1          1.5         0.2  setosa
## 5           5.0         3.6          1.4         0.2  setosa
## 6           5.4         3.9          1.7         0.4  setosa
## 7           4.6         3.4          1.4         0.3  setosa
## 8           5.0         3.4          1.5         0.2  setosa
## 9           4.4         2.9          1.4         0.2  setosa
## 10          4.9         3.1          1.5         0.1  setosa
## ..          ...         ...          ...         ...     ...

# this is a data frame
dtab = as.data.frame(d)

# send a query through dplyr
query = "select avg(sepal_length) avg_sepal_length, 
                species 
         from iris
         group by species"
dsub = tbl(localdb, sql(query))
dsub

## Source: postgres 9.5.0 [ninazumel@localhost:5432/ninazumel]
## From:  [?? x 2]
## 
##    avg_sepal_length    species
## 1             5.936 versicolor
## 2             6.588  virginica
## 3             5.006     setosa
## ..              ...        ...

# make it local
dsub = as.data.frame(dsub)
summary(dsub)

##  avg_sepal_length   species         
##  Min.   :5.006    Length:3          
##  1st Qu.:5.471    Class :character  
##  Median :5.936    Mode  :character  
##  Mean   :5.843                      
##  3rd Qu.:6.262                      
##  Max.   :6.588

# shuts down database
rm(list=c('d','localdb')); gc() 

4 thoughts on “Using PostgreSQL in R: A quick how-to”

    1. True (and I am sure Nina knows the quoting rules and was writing proscriptively). Also rpg looks interesting, will have to check it out for our PostgreSQL needs.

      That being said there are reasons Nina wrote what she wrote (I did some editing for her, so I know some of the trade-offs).

      Just about anything can be used as a variable in R with the backwards single-quote. And just about anything can be used in SQL if you configure the column quote (often the double quote for columns and the forward single quote for values). And SQL tends to use dots for table and schema naming, so having dots in SQL column names is about is as bad as having dollar signs in R columns/variables. Finally there is the issue of case sensitivity versus case insensitivity (and lowercase display in other SQL tools).

      However, being able to use the same string in R and in SQL tools without quoting is a great convenience. It is better for beginners to stick to easy column names, as it leaves them more time to think on their actual analysis goals (versus remembering quoting rules).

      One of the points of using RPostgreSQL was the fact it roughly constrains itself to things like the DBI interface (and I thought one of the points of rpg was avoiding such interfaces) so it is quite easy to develop in a database agnostic way (allowing the same code to work with many other database providers) and to use additional packages that depend on such common interfaces (like the mentioned dplyr package).

      Anyway, thanks for your comment (and sorry the reply got so long).

  1. perhaps a latter piece would integrate PL/R. I find the functionality intriguing, albeit not used in the heat of battle. the commercial databases now offer similar functionality, although not all with R.

Comments are closed.