Data Import & Export in R
Please direct questions and comments about these pages, and the R-project in general, to Dr. Tom Philippi.
Much environmental data is either large or complex, and thus requires different approaches than reading a tab or comma delimited file.
Understanding data objects in R is important--for some applications, it is not enough to get the data into R, the data must be imported into an object with the right structure and attributes. While most graphical and statistical tools in R apply to data frames, some analyses require data in specific, more complex structures. For example, SpatialPolygons is a data object in R that is equivalent to an ESRI polygon shapefile containing information for 1 or more polygons.
Flat Files (ASCII text, comma or tab delimited files)
The menu commands in Rcommander handle most simple cases of loading data from a file into an R dataframe. The underlying functions tend to have more optional parameters and much greater flexibility and funcitonality. read.table() and scan() functions are sufficient for handling data in simple delimited text files. read.table() is intelligent enough to determine the data types of each variable, and if the variable names are in the first row, they can be directly assigned to the vector objects:
testdata <-read.table("mydata.dat", header=TRUE)
If you need to give the full path name, either use forward slashes or use double backslashes instead of backslashes between levels of folders.
testdata <- read.table("d:/data/mydata.dat", header=TRUE) # valid
testdata <- read.table("d:\\data\\mydata.dat", header=TRUE) # valid
testdata <- read.table ("d:\data\mydata.dat", header=TRUE) # invalid
The default delimiter between values is whitespace: a space, tab, or newline. For comma-delimited files, read.table() needs to be told that commas are delimiters:
testdata <- read.table ("d:/data/mydata.csv", header=TRUE, sep=",")
You could also use the variants read.csv() or read.csv2().
One can get the standard file open dialog for your operating system by using the file.choose() function:
testdata<- read.table(file.choose(), header=TRUE)
read.table also works with remote files with urls:
testdata2 <- read.table("https://science.nature.nps.gov/im/monitor/stats/R/data/ MyData.csv", header=TRUE, sep=",")
This is more commonly useful for ftp access via ftp:// urls.
Note that read.table() will return an error if not all lines have the full set of values for variables. This is common for data from spreadsheets when the last (right) columns have missing values.
By default, only blanks are read as missing values. If you have some other indication for a missing value, you need to let R know what that missing value character is by specifying na.strings:
testdata <- read.table("mydata.csv", header=TRUE, sep=",", na.strings=".")
Fixed-column data can be imported via read.fwf(). Columns are specified as a list of field widths instead of Fortran formats or COBOL descriptors, you may need to define dummy variables to discard unwanted columns, or use read.fortran. Also, column/variable names don't generally fit in the column widths, so it is common to read the data into a data object, then assign names to the vectors in that object:
myfwffile <- read.fwf("file.txt",width=c(2,4,1,1,3))
 "col1" "col2" "third_variable" "v4" "v5"
If your file happens to be documented with a Fortran FORMAT statement, you can use read.fortran().
Integrated Taxonomical Information (ITIS) Tables
The complete ITIS database of recognized taxonomic units is available from: http://www.itis.gov/downloads/
The MS SQL version not only cannot be imported to MS Access, and at least the June 2008 version cannot be loaded into MS SQL server via the directions given: the schema and instructions tables do not match with the actual tables provided. However, once you unzip the download, each table is merely an ASCII text file with values delimited by pipes. The tables do not have variable or attribute values in the first row, as the variable names and types are found in the schema.
Most of these files are quite easy to import in R, and then assign variable names (pulled from the schema):
Vernaculars <- read.table("vernaculars",sep="|")
names(Vernaculars) < c("tsn","vernacular_name","language","approved_ind", "update_date","vern_id")
Synonyms <- read.table("synonym_links",sep="|")
names(Synonyms) <- c("synonym","valid","date")
The taxonomic_units table is trickier. While the schema has 24 variables, the last column in the file does not appear when it is a missing value, and it is missing for more than the first 5 lines. Therefore, R determines that there are only 23 columns it determines the number of variables and the variable types from the first 5 lines of data, and reading the file as above gives an error message about the number of items being read not being a multiple of the number of columns. One solution is to include the column names in the read.table() function, which lets R know how many columns to read, and fill=TRUE to tell R to fill in that 24th column with blanks where it is missing:
TaxonomicUnits <- read.table("P:/datasets/ITIS/Oct2008/taxonomic_units", sep="|", fill=TRUE, col.names = c("tsn", "unit_ind1", "unit_name1", "unit_ind2", "unit_name2", "unit_ind3", "unit_name3", "unit_ind4", "unit_name4", "unnamed_taxon_ind", "usage", "unaccept_reason", "credibility_rtng", "completeness_rtng", "currency_rating", "phylo_sort_seq", "initial_timestamp", "parent_tsn", "taxon_author_id", "hybrid_author_id", "kingdom_id", "rank_id", "update_date", "uncertain_prnt_ind"))
From the R Data Import/Export Guide http://cran.r-project.org/doc/manuals/R-data.html#Reading-Excel-spreadsheets
The most common R data import/export question seems to be `how do I read an Excel spreadsheet'. The first piece of advice is to avoid doing so if possible! If you have access to Excel, export the data you want from Excel in tab-delimited or comma-separated form, and use read.delim or read.csv to import it into R. Exporting a DIF file and reading it using read .DIF is another possibility.
Other considerations: Excel files can have more than 1 sheet, so it is important to specify which sheet to import. Also, Excel files can have "stuff" in any cell: mixed numerals and characters in the 1345th row of a nominally numeric column, formulas, inserted objects (e.g., graphs), or even macros. "stuff" are not data! R will properly choke on them. Excel also does a poor job at keeping track of the extent of the spreadsheet (the bottom right corner), so you will often get multiple extra columns and rows of mostly missing values. Saving the spreadsheet to a .csv file will get rid of the "stuff", replacing formulas with their values, and otherwise producing something that may be a dataset, although you may still have extra rows & columns unless you explicitly specify the block to save. If you have columns with date or time information, make sure that you apply format cells to those columns to make sure that what Excel writes out is interpretable as dates (see the Dates topic).
If you must read from spreadsheets, there are at least 3 options: ODBC, clipboard, and package xlsx.
If you know what you are doing in terms of ODBC, the package RODBC has functions for reading and writing .xls spreadsheet files and .xlsx files. Note that the Rcmdr menu "data | import data | from Excel" uses RODBC sqlQuery() to get data from .xls spreadsheets. You don't always get what you want, but it does its best by renaming variables to valid names, omitting objects like graphs, and reading values of formulas. Also note that as of June 2011, Microsoft still does not have 64-bit ODBC tools, so RODBC only works with 32-bit versions of R.
If you have a spreadsheet open (not just Excel, but OpenOffice or any other spreadsheet), you can select the block of cells you wish to import, copy them to the clipboard, and then read them into an R data object with readClipboard() or read.table(file="clipboard").
Finally, the xlsx package reads Excel 2007/2010 .xlsx files via functions read.xlsx() and read.xlsx2(), which include parameters controlling reading formulas or values, whether to keep Excel's formats, etc...
While I don't recommend keeping data in .xlsx files, and I especially don't recommend keeping data in popup cell notes and such in Excel, the xlsx package has several functions to read worksheets, columns, or ranges from Excel .xlsx and .xls files.
What I find more useful in the xlsx package is the ability to write out results to an Excel .xlsx file: you can colorize cells, freeze panes, define print areas, merge cells, insert images (as wmfs, too!), and otherwise build a pretty complex spreadsheet. Because so many recipients of our Inventory & Monitoring results are much more comfortable with an Excel spreadsheet than other complex formats, I think that this has potential for producing "dynamic" tables and reports, where the recipient / audience can scroll around a table too large to fit on a page, re-sort the table by species or by family, and in general tweak the information to address their needs (alas, I'll learn how many folks don't do spreadsheets, either).
DBF files started as binary files for the DBase data management program. However, many other applications can create .dbf files, most notably ESRI's ARC/GIS shapfiles, which store attribute data in .dbf files. DBF files are individual tables, so they import directly as dataframes.
Rcommander can read .dbf files, but you might need more flexibility or power than it provides, or you may want a script for repeated analyses.
The library foreign includes a function read.dbf() explicitly for importing a .dbf file into an R dataframe. Beware, because several other packages include read.dbf() functions, with conflicting arguments (e.g., the filename is specified as "myfile" instead of "myfile.dbf"). This function understands logical, numeric and date fields, any other field type in the .dbf file becomes a character vector.
read.dbf(file, as.is = FALSE)
More complex datasets are usually stored as relational databases, where information can be contained in several related tables. For instance, one table might have data on species composition at a number of sites. A second table might have information about several park units. These tables would be linked by a pair of columns indicating which park each site belonged to. This pair of columns could be part of the table of site data, or it could be a separate table (as long as the site table has a name or label for each site). One advantage of storing the data in related tables is that information common to an entire park is not duplicated for each site, but only stored once, both saving space and simplifying editing and updating of the park-level data.
ODBC is a standard for "open database connectivity", originally for Microsoft products, but now applicable to most database applications on most operating systems. RODBC is an R package for communicating with most databases, where communicating includes reading, writing, and manipulating the remote database.
The simplest way to get data from a relational database into R requires 2 steps: first defining the data source, then grabbing the dataframe. The odbcConnect() family of functions define the data source, and have versions tweaked for Access, Excel, Dbase, so that you don't have to specify the parameters for each source. Note that because Microsoft slightly tweaks Access and Excel between versions of Office, the odbcConnectAccess() function only works with Office 2003; odbcConnectAccess2007() works with Access 2007 and most Access 2010 files. When in doubt, either consult the documentation (?odbcConnect), or try appending a version on the end: odbcConnectAccess2010(). sqlTables() returns a list of tables and views in that data source. One key is that saved or pre-defined queries in the DBMS are actually "views", so their data can be imported as easily as data tables per se. sqlFetch() returns a data table or view from that data source. Note that one can have "virtual" tables in an Access .mdb file that are really links to external files or to tables in other .mdb files. Many I&M networks keep their common lookup tables (species names, locations, etc.) as external tables. If you are running from their table locations, these external links will often work. If you are working from a copy of their database, you are likely to get error messages that the table is not found, or the table / file is locked.
# Road Density
rddall <- odbcConnectAccess("n:/products/Allparks/Roads_RDDALL.mdb")
rdd.raw <- sqlFetch(rddall,"Roads_rddall") # a raw table
rdd.qry <- sqlFetch(rddall,"qry_Roads_RDDALL") # a saved query or view
rdd.qry.final <- sqlFetch(rddall,"qryroadsRDDALL_FINAL") # a saved query or view
While the menu in Rcommander can read an individual table in an MSAccess database as a dataframe, it does not handle relationships among tables. One approach is to use commands in MSAccess or MySQL or even a spreadsheet to perform all of the data manipulations first, creating a table that you can directly import as the required dataframe. That is the simplest from the R side, and is what Paul assumes for the Learn R course. However, one does not always have write access to the relevant database, which can be somewhere on the internet as easily as somewhere on your local hard drive.
RODBC also allows you to connect to remote SQL servers, whether MS SQLserver, Oracle, MYsql, or any other datasource that you can define as a datasource in ODBC. For example, if I already have the NPS Climate data server defined in my local ODBC as a source named "NPS_climate", I can connect with:
climate.db <- odbcConnect("NPS_climate",uid="climate_reader",pwd="******"
If I have not yet defined te ODBC data source, I can either define it myself in ODBC, or call odbcConnect() with a null parameter to open up the ODBC handler to define it on the fly:
A complete screen by screen example is given in the page on connecting to the NPS climate data.
RODBC allows any SQL query to be sent to the attached database: complex SELECT commands involving joins among several related tables, UPDATE, DELETE, CREATE (a table in the remote database if you have write access!). My queries always seem to be lengthy, so I assign them as character values, then call sqlQuery():
NPspecies <- odbcConnectAccess("p:/mammals/test_NPSpecies.mdb")
q <- paste("SELECT PARK_CODE AS unit, accepted_TSN AS TSN, SciName, ","
CommonName,status_type AS status, Abundance_type AS abundance, ","
residency_type as residency, nativity, Order_Name, Family_Name ","
FROM CertifiedUSN_There2 WHERE TaxonCat = 500;")
mammals.long <- sqlQuery(NPspecies,q)
As an alternative to ODBC, RJDBC allows connecting to any DBMS that has a JDBC (java) driver defined. There are packages specifically for many major relational databases: ROracle, RMySQL, RSQLite, RpgSQL and RPostgresSQL for PostGresSQL. DBI is a project and package attempting to make a universal interface to relational databases.
XML is perhaps the most portable format for complex data. Unfortunately, that complexity comes at the cost of large file sizes. There are several packages for reading and writing (and parsing) XML files in R. In my experience, these packages all produce a parsable tree or list, which may be important for many applications, but means that .xml is not very useful for importing data tables per se. I'll add recommendations once I have an idea of what works for common environmental datasets.
The attributes of an ESRI shapefile are stored as a .dbf file, so analyses of attributes can use read.dbf or the Rcommander menu mentioned above. However, spatial analyses in R require spatial information, and require importing the extent, projection, and coordinates as well as the attribute data.
I strongly recommend using the rgdal and raster packages for pretty much everything, with the exception of kml files and possibly NetCDF. Package rgdal provides bindings to the open source GDAL (Frank Warmerdam's Geospatial Data Abstraction Library for raster data), proj4, and OGR (the complement to gdal for vector data). Under MS windows, the rgdal package now installs the core of the GDAL (open source) tool along with common file format definitions, along with OGR and OGR format definitions. If you are using a different operating system, you may need to first install GDAL, then the rgdal package.
The readOGR() function reads vector files (e.g., shapefiles with points, polylines, polygons, etc.) into spatial classes (e.g., spatialPolygonsDataframe).
readOGR(dsn, layer, verbose = TRUE, p4s=NULL,
VegMap <- readOGR(".",layer="CABR_VegMap_Draft_June_2011") # ESRI shapefile
The readGDAL() function reads raster files (e.g., grids, ERDAS .img files, etc.) into appropriate spatial classes.
readGDAL(fname, offset, region.dim, output.dim, band, p4s=NULL, ...,
half.cell=c(0.5, 0.5), silent = FALSE)
Elev <- readGDAL("ned3m.img") # ERDAS Imagine .img file
The raster package includes functions for reading and writing raster (grid) files, including automatic tiling to allow processing of files larger than the 2GB limit in 32-bit R or larger than your physical & virtual memory in 64-bit R.
bigDEM <- raster("e:/GRSM/GRSM_DEM_2180601/grsmdem") # make it an object but leave it on disk
Note that in readOGR() and raster(), you do not include the file extension in the filename. Also, the specification of directories v. file names is a bit tricky, as GDAL & OGR drivers for different file formats interpret dsn and layer differently (sometimes as directory and file, sometimes as file and layer within file). If the vector or raster file had projection information (e.g., in a .prj file), readOGR(), readGDAL(), and raster() read that projection information and translate it to a proj4string, the standard for projection information in R spatial classes.
One minor difficulty in using rgdal and gdal is that spatial objects in R use proj4 strings to define projections and coordinate systems, and there is no complete mapping of ESRI projection names to proj4 strings. However, as long as the file read by readOGR or readGDAL had projection information associated with it (e.g., a .prj file), that projection information will be translated correctly to the proj4 string for that spatial object.
The biggest shortcoming is that there is no current tool in R for pulling data from an ESRI v10 geodatabase. As far as I know, the API for these geodatabases is still proprietary, so no one could write such a function. My workaround is to grab what I need from the geodatabase in ARC/Map, then export that feature to a shapefile. Yes that's a pain for separating boundaries for each NPS unit; I can export as a single shapefile, then use R to split into separate objects. Note that gdal has a driver for "personal geodatabases". Brian Ripley and Uwe Liggs have enhanced the forthcoming R version 2.15 (forthcoming in April 2012) to include that driver (PGeo) in rgdal:
As a special case, there are several packages for working with either map tiles or kml/kmz files from Google maps and Google Earth: ggmap, googleVis, plotGoogleMaps, and RgoogleMaps that I know of.
There are many other tools for reading and manipulating GIS data included in other packages, because until recently GDAL required a separate insallation of stand-alone GDAL and proj4. I have used readShapePoly in package maptools to read shapefiles with 1 to 9000 polygons each into SpatialPolygonsDataFrame objects. maptools also has readShapeLines, readShapePoints, and readShapeSpatial, as well as read.AsciiGrid to read ESRI asciigrid files, and the same set of writeShape* functions to write R objects to ESRI formats. The shapefiles package includes tools for reading .shp .shx .sbn .sbx and .prj files into point, line, or polygon objects
RArcInfo has tools to read and write ARC/Info v7 coverages and .e00 export files.
Download and install the "foreign" package. [This is so useful that I include "foreign" in my defaultPackages list in Rprofile.site.]
If you have SAS datasets saved as SAS xport files,
read.xport("fromsas.xpr") # works whether you have SAS installed or not
If you have SAS installed on your local machine, you can use
read.ssd("sasfile.sas7bdat") # reads .ssd or .sas7bdat files IF you have SAS on your machine
In package HMISC, sas.get() also reads SAS datasets if you have SAS installed. I believe that both read.ssd() and sas.get() generate the appropriate SAS code to format and write the selected variables to an XPORT dataset, then use the equivalent of read.xport() to grab the xport file.
Bob Muenchen's "R for SAS and SPSS Users" can be useful. You may want to search CRAN's contributed documentation for newer versions.
If you have old S-Plus datasets, function read.S() in library foreign can read vectors, matrices, data frames, and lists from S-Plus 3,4,or 2000. It also works across platforms: S-Plus unix files can be read by R MSwin, etc..
Look at the documentation for library foreign: it includes tools for SPSS, STATA, MINITAB, SYSTAT, EpiInfo.
If you need to send data back and forth with MATLAB, package R.matlab has functions.
Large scientific and environmental datasets present 2 problems: they are large, and they are often complex. Large size means that efficient data compression is important for file transfer if not for storage. The good news is that R is used by many researchers, and is extensible, so tools are available for reading and writing many of these data formats. Many packages with field-specific analytical tools include functions for reading and writing the common file formats in that field, creating the appropriate data objects for those tools in R. The process is roughly the same for all: identify the appropriate package, load the library, point the tool to the data file, and create an R data object.
netCDF (network Common Data Form) via ncdf
netCDF is one of many binary data formats, used especially for atmospheric and oceanographic data with spatial, temporal, and attribute dimensions. Some hydrologic data are now archived in netCDF, and even instrumentation that produces large data cubes use these files. One advantage of netCDF is that it is self-describing: any program that can read the netCDF file learns not just the variable names, but the structure of the dataset, flags for missing values, etc.
The National Center for Atmospheric Research (NCAR) has a guide to netCDF and R. One major problem is that many atmospheric datasets are larger than 2GB, and thus can't be imported in entirety into 32 bit R and then subsetted. Thus, netCDF has tools for extracting and importing individual variables, or subsets of the observations based on attribute values.
The major current issue is that there are no tools available under MS windows that can read or write netCDF version 4 files. This is not limited to R, but is an issue with incompatibilities among Fortran compilers for MS windows.
Extracting data from netCDF files requires several simple steps: loading the ncdf libarary, loading the netCDF file, examining the structure of the ncdf data, and extracting values to variables. The following example is from the NCAR guide, and uses their example.nc file.
library(ncdf) # loads the library
ex.nc <- open.ncdf("example.nc") # opens netcdf file example.nc as R object
print (ex.nc) # show information about the structure of the data
Note that in this case, print() with a ncdf object shows just the header information: the description of the complicated dataframe. When the author of the ncdf pakage defined the ncdf class, they also defined a print.ncdf method to do something sensible like print the header information instead of something stupid like print the entire data hypercube. Similarly, they defined a summary.ncdf method to provide complementary information about the variables in the ncdf object:
 "file example.nc has 2 dimensions:"
 "EW Size: 87"
 "SN Size: 61"
 "file example.nc has 1 variables:"
 "float Elevation[EW,SN] Longname:Elevation Missval:-1"
summary(ex.nc) # grab variables into R objects
y = get.var.ncdf(ex.nc, "SN")
x = get.var.ncdf(ex.nc,"EW")
elev = get.var.ncdf(ex.nc,"Elevation") # image plot of terrain
filled.contour(x,y,elev, color= terrain.colors, asp=1)
Another example from NCAR (download the file STN_050258.nc ):
# for time plots below
station = open.ncdf(con="STN_050258.nc", write=FALSE, readunlim=FALSE)
lons = get.var.ncdf(nc=station,varid="longitude") # reads entire coordinate
lats = get.var.ncdf(nc=station,varid="latitude") # kinda boring, since the
elev = get.var.ncdf(nc=station,varid="elevation") # didn't move when usually do.
timearr = get.var.ncdf(nc=station,varid="time") # reads entire time array
prcp = get.var.ncdf(nc=station,varid="PRCP") # reads entire precip array
tmin = get.var.ncdf(nc=station,varid="TMIN")
tmax = get.var.ncdf(nc=station,varid="TMAX")
tobs = get.var.ncdf(nc=station,varid="TOBS")
snow = get.var.ncdf(nc=station,varid="SNOW")
# plot 'em up
plot( timearr, snow, main='SNOW' )
plot( timearr, prcp, main='PRCP')
plot( timearr, tmin, main='TMIN')
plot( timearr, tmax, main='TMAX')
plot( timearr, tobs, main='TOBS')
# note that there are many more observations of snow and precip than
# temperature. The unobserved temperature data must be coded as 'missing' to use # the same time coordinate variable.
For the downscaled GCM projections from LBL:
ibrary(ncdf)# loads the library ncdf
# opens the netcdf file Extraction_TavgPrcp.nc as R object
jotr.nc <- open.ncdf("d:/R/R_course/climate/Extraction_TavgPrcp.nc")
# show information about the structure of the data jotr.nc
# note that these data objects are 5 dimensional matrices
Tavg <- get.var.ncdf(jotr.nc, "Tavg")
Prcp <- get.var.ncdf(jotr.nc,"Prcp")
For my request (Joshua Tree NP) the frame is a 5 dimensional cube: 112 different model projections, 1800 (monthly) times, 5 lattitudes, 2 bounds (upper & lower), and 11 longitudes.
Note from above that instead of a filename, read.csv(), and odbcConnect() functions can all use a url to a remote data source. Therefore, the RESTinterface for IRMA is easy to hit in R. Use read.csv() and build the REST url to request data in .csv. Currently this is only available to NPS (and some FWS) users from the NPS intranet. For details, including several examples hitting the units and NPSpecies services, see my REST page.