Monitoring Statistics
Monitoring
I & M
NPS.gov

Data Import/Export in R

Introduction

Much environmental data is either large or complex, and thus requires different approaches than reading a tab or comma delimited file.

Understanding data objects in R is important--for some applications, it is not enough to get the data into R, the data must be imported into an object with the right structure and attributes. While most graphical and statistical tools in R apply to data frames, some analyses require data in specific, more complex structures. For example, SpatialPolygons is a data object in R that is equivalent to an ESRI polygon shapefile containing information for 1 or more polygons.

Basics

 

Flat Files (ASCII text, comma or tab delimited files)

The menu commands in Rcommander handle most simple cases of loading data from a file into an R dataframe. The underlying functions tend to have more optional parameters and much greater flexibility and funcitonality. read.table() and scan() functions are sufficient for handling data in simple delimited text files. read.table() is intelligent enough to determine the data types of each variable, and if the variable names are in the first row, they can be directly assigned to the vector objects:

If you need to give the full path name, either use forward slashes or use double backslashes instead of backslashes between levels of folders.

The default delimiter between values is whitespace: a space, tab, or newline. For comma-delimited files, read.table() needs to be told that commas are delimiters:

You could also use the variants read.csv() or read.csv2().

One can get the standard file open dialog for your operating system by using the file.choose() function:

read.table also works with remote files with urls:

This is more commonly useful for ftp access via ftp:// urls.

Note that read.table() will return an error if not all lines have the full set of values for variables. This is common for data from spreadsheets when the last (right) columns have missing values.

Missing Values

By default, only blanks are read as missing values. If you have some other indication for a missing value, you need to let R know what that missing value character is by specifying na.strings:


Fixed-Column Data

Fixed-column data can be imported via read.fwf(). Columns are specified as a list of field widths instead of Fortran formats or COBOL descriptors, you may need to define dummy variables to discard unwanted columns, or use read.fortran. Also, column/variable names don't generally fit in the column widths, so it is common to read the data into a data object, then assign names to the vectors in that object:

file.txt:
01ABCD1C001
50DEF10X002
33SKID2D003

If your file happens to be documented with a Fortran FORMAT statement, you can use read.fortran().

Integrated Taxonomical Information (ITIS) Tables

The complete ITIS database of recognized taxonomic units is available from: http://www.itis.gov/downloads/

The MS SQL version not only cannot be imported to MS Access, and at least the June 2008 version cannot be loaded into MS SQL server via the directions given: the schema and instructions tables do not match with the actual tables provided. However, once you unzip the download, each table is merely an ASCII text file with values delimited by pipes. The tables do not have variable or attribute values in the first row, as the variable names and types are found in the schema.

Most of these files are quite easy to import in R, and then assign variable names (pulled from the schema). :


The taxonomic_units table is trickier. While the schema has 24 variables, the last column in the file does not appear when it is a missing value, and it is missing for more than the first 5 lines. Therefore, R determines that there are only 23 columns it determines the number of variables and the variable types from the first 5 lines of data, and reading the file as above gives an error message about the number of items being read not being a multiple of the number of columns. One solution is to include the column names in the read.table() function, which lets R know how many columns to read , and fill=TRUE to tell R to fill in that 24th column with blanks where it is missing:

TaxonomicUnits <- read.table("P:/datasets/ITIS/Oct2008/taxonomic_units", sep="|", fill=TRUE, col.names = c("tsn", "unit_ind1", "unit_name1", "unit_ind2", "unit_name2", "unit_ind3", "unit_name3", "unit_ind4", "unit_name4", "unnamed_taxon_ind", "usage", "unaccept_reason", "credibility_rtng", "completeness_rtng", "currency_rating", "phylo_sort_seq", "initial_timestamp", "parent_tsn", "taxon_author_id", "hybrid_author_id", "kingdom_id", "rank_id", "update_date", "uncertain_prnt_ind"))


Spreadsheets

From the R Data Import/Export Guide http://cran.r-project.org/doc/manuals/R-data.html#Reading-Excel-spreadsheets

The most common R data import/export question seems to be `how do I read an Excel spreadsheet'. The first piece of advice is to avoid doing so if possible! If you have access to Excel, export the data you want from Excel in tab-delimited or comma-separated form, and use read.delim or read.csv to import it into R. Exporting a DIF file and reading it using read .DIF is another possibility.

Other considerations: Excel files can have more than 1 sheet, so it is important to specify which sheet to import. Also, Excel files can have "stuff" in any cell: mixed numerals and characters in the 1345th row of a nominally numeric column, formulas, inserted objects (e.g., graphs), or even macros. "stuff" are not data! R will properly choke on them. Saving the spreadsheet to a .csv file will get rid of the "stuff", replacing formulas with their values, and otherwise producing something that may be a dataset.

If you must read from spreadsheets, and you know what you are doing in terms of ODBC, the package RODBC has functions for reading and writing .xls spreadsheet files and .xlsx files . Note that the Rcmdr menu "data | import data | from Excel" uses RODBC sqlQuery() to get data from .xls spreadsheets. You don't always get what you want, but it does its best by renaming variables to valid names, omitting objects like graphs, and reading values of formulas. Also note that as of June 2011, Microsoft still does not have 64-bit ODBC tools, so RODBC only works with 32-bit versions of R.

If you have a spreadsheet open (not just Excel, but OpenOffice or any other spreadsheet), you can select the block of cells you wish to import, copy them to the clipboard , and then read them into an R data object with readClipboard() or read.table(file="clipboard").

 

DBF Files

DBF files started as binary files for the DBase data management program. However, many other applications can create .dbf files, most notably ESRI's ARC/GIS shapfiles, which store attribute data in .dbf files. DBF files are individual tables, so they import directly as dataframes.

Rcommander can read .dbf files, but you might need more flexibility or power than it provides, or you may want a script for repeated analyses.

The library foreign includes a function read.dbf() explicitly for importing a .dbf file into an R dataframe. Beware, because several other packages include read.dbf() functions, with conflicting arguments (e.g., the filename is specified as "myfile" instead of "myfile.dbf"). This function understands logical, numeric and date fields, any other field type in the .dbf file becomes a character vector.


Relational Databases

More complex datasets are usually stored as relational databases, where information can be contained in several related tables. For instance, one table might have data on species composition at a number of sites. A second table might have information about several park units. These tables would be linked by a pair of columns indicating which park each site belonged to. This pair of columns could be part of the table of site data, or it could be a separate table (as long as the site table has a name or label for each site). One advantage of storing the data in related tables is that information common to an entire park is not duplicated for each site, but only stored once, both saving space and simplifying editing and updating of the park-level data.

ODBC is a standard for "open database connectivity", originally for Microsoft products, but now applicable to most database applications on most operating systems. RODBC is an R package for communicating with most databases, where communicating includes reading, writing, and manipulating the remote database .

The simplest way to get data from a relational database into R requires 2 steps: first defining the data source, then grabbing the dataframe. The odbcConnect() family of functions define the data source, and have versions tweaked for Access, Excel, Dbase, so that you don't have to specify the parameters for each source. Note that because Microsoft slightly tweaks Access and Excel between versions of Office, the odbcConnectAccess() function only works with Office 2003; odbcConnectAccess2007() works with Access 2007 and most Access 2010 files. When in doubt, either consult the documentation (?odbcConnect), or try appending a version on the end: odbcConnectAccess2010(). sqlTables() returns a list of tables and views in that data source. One key is that saved or pre-defined queries in the DBMS are actually "views", so their data can be imported as easily as data tables per se. sqlFetch() returns a data table or view from that data source. Note that one can have "virtual" tables in an Access .mdb file that are really links to external files or to tables in other .mdb files. Many I&M networks keep their common lookup tables (species names, locations, etc.) as external tables. If you are running from their table locations, these external links will often work. If you are working from a copy of their database, you are likely to get error messages that the table is not found, or the table / file is locked.

An example:

While the menu in Rcommander can read an individual table in an MSAccess database as a dataframe, it does not handle relationships among tables. One approach is to use commands in MSAccess or MySQL or even a spreadsheet to perform all of the data manipulations first, creating a table that you can directly import as the required dataframe. That is the simplest from the R side, and is what Paul assumes for the Learn R course. However, one does not always have write access to the relevant database, which can be somewhere on the internet as easily as somewhere on your local hard drive.

RODBC also allows you to connect to remote SQL servers, whether MS SQLserver, Oracle, MYsql, or any other datasource that you can define as a datasource in ODBC. For example, if I already have the NPS Cliamte data server defined in my local ODBC as a source named "NPS_climate", I can connect with:

climate.db <- odbcConnect("NPS_climate",uid="climate_reader",pwd="******"
sqlTables(climate.db)

If I have not yet defined te ODBC data source, I can either define it myself in ODBC, or call odbcConnect() with a null parameter to open up the ODBC handler to define it on the fly:

odbcConnect("")

A complete screen by screen example is given in the page on connecting to the NPS climate data.

RODBC allows any SQL query to be sent to the attached database: complex SELECT commands involving joins among several related tables, UPDATE, DELETE, CREATE (a table in the remote database if you have write access!). My queries always seem to be lengthy, so I assign them as character values, then call sqlQuery():

NPspecies <- odbcConnectAccess("p:/mammals/test_NPSpecies.mdb")

q <- paste("SELECT PARK_CODE AS unit, accepted_TSN AS TSN, SciName, CommonName, "," status_type AS status, Abundance_type AS abundance, residency_type as residency, "," nativity, Order_Name, Family_Name "," FROM CertifiedUSN_There2 WHERE TaxonCat = 500;")

mammals.long <- sqlQuery(NPspecies,q)

The ROracle package includes many more functions that appear to be utility and administrative functions in Oracle as well as SQL querying. Both RODBC and ROracle accomodate permissions and credentials to secure databases, and both can connect to remote databasesd via urls.

XML

XML is perhaps the most portable format for complex data. Unfortunately, that complexity comes at the cost of large file sizes. There are several packages for reading and writing (and parsing) XML files in R. I'll add recommendations once I have an idea of what works for common environmental datasets.

GIS Files

The large number of geostatistical tools available as packages in R make it a natural tool for analysis of geospatial data. The two major open source GIS systems (SAGA and GRASS) both work well with R via the RSAGA and spgrass6 packages. However, ESRI's ARC/GIS is both the dominant GIS platform and the standard for NPS and most other agencies, and thus the focus of this section. There is a useful (but cryptic) wiki-R page on getting different GIS and spatial data files in and out of R.

The attributes of an ESRI shapefile are stored as a .dbf file, so analyses of attributes can use read.dbf or the Rcommander menu mentioned above. However, spatial analyses in R require spatial information.

If you have either vector or raster data in a modern format, you can use GDAL and the rdgal package to read and write files. Under MS windows, the rgdal package now installs the core of the GDAL (open source) tool along with common file format definitions. If you are using a different operating system, you may need to first install GDAL, then the rgdal package.

The read(OGR() function reads vector files (e.g., shapefiles with points, polylines, polygons, etc.) into spatial classes (e.g., spatialPolygonsDataframe).

The read(GDAL() function reads raster files (e.g., grids, ERDAS .img files, etc.) into appropriate spatial classes.

The major difficulty in using rgdal and gdal is that you need to specify projections with proj4 strings, and there is no complete mapping of ESRI projection names to proj4 strings. However, most functions for analyzing and manipulating spatial data require proj4 strings as the projection information on the geospatial data object, so you will need proj4 strings anyway.

The raster package includes functions for reading and writing raster (grid) files, including automatic tiling to allow processing of files larger than the 2GB limit in 32-bit R or larger than your physical & virtual memory in 64-bit R.

There are many other tools for reading and manipulating GIS data included in other packages, because until recently GDAL required a separate insallation of stand-alone GDAL and proj4. I have used readShapePoly in package maptools to read shapefiles with 1 to 9000 polygons each into SpatialPolygonsDataFrame objects. maptools also has readShapeLines, readShapePoints, and readShapeSpatial, as well as read.AsciiGrid to read ESRI asciigrid files, and the same set of writeShape* functions to write R objects to ESRI formats. The shapefiles package includes tools for reading .shp .shx .sbn .sbx and .prj files into point, line, or polygon objects

RArcInfo has tools to read and write ARC/Info v7 coverages and .e00 export files.

The raster package includes functions for reading and writing raster (grid) files, including automatic tiling to allow processing of files larger than the 2GB limit.

Other Statistical Programs

Download and install the "foreign" package. [This is so useful that I include "foreign" in my defaultPackages list in Rprofile.site.]

SAS

If you have SAS datasets saved as SAS xport files,

If you have SAS installed on your local machine, you can use

In package HMISC, sas.get() also reads SAS datasets if you have SAS installed. I believe that both read.ssd() and sas.get() generate the appropriate SAS code to format and write the selected variables to an XPORT dataset, then use the equivalent of read.xport() to grab the xport file.

Bob Muenchen's "R for SAS and SPSS Users" can be useful. You may want to search CRAN's contributed documentation for newer versions.

S-Plus

If you have old S-Plus datasets, function read.S() in library foreign can read vectors, matrices, data frames, and lists from S-Plus 3,4,or 2000. It also works across platforms: S-Plus unix files can be read by R MSwin, etc..

Other Formats

Look at the documentation for library foreign: it includes tools for SPSS, STATA, MINITAB, SYSTAT, EpiInfo.

If you need to send data back and forth with MATLAB, package R.matlab has functions.

Scientific File Formats

Large scientific and environmental datasets present 2 problems: they are large, and they are often complex. Large size means that efficient data compression is important for file transfer if not for storage. The good news is that R is used by many researchers, and is extensible, so tools are available for reading and writing many of these data formats. Many packages with field-specific analytical tools include functions for reading and writing the common file formats in that field, creating the appropriate data objects for those tools in R. The process is roughly the same for all: identify the appropriate package, load the library, point the tool to the data file, and create an R data object.

netCDF (network Common Data Form) via ncdf

netCDF is one of many binary data formats, used especially for atmospheric and oceanographic data with spatial, temporal, and attribute dimensions. Some hydrologic data are now archived in netCDF, and even instrumentation that produces large data cubes use these files. One advantage of netCDF is that it is self-describing: any program that can read the netCDF file learns not just the variable names, but the structure of the dataset, flags for missing values, etc.

The National Center for Atmospheric Research (NCAR) has a guide to netCDF and R. One major problem is that many atmospheric datasets are larger than 2GB, and thus can't be imported in entirety into 32 bit R and then subsetted. Thus, netCDF has tools for extracting and importing individual variables, or subsets of the observations based on attribute values.

The major current issue is that there are no tools available under MS windows that can read or write netCDF version 4 files. This is not limited to R, but is an issue with incompatibilities among Fortran compilers for MS windows.

Extracting data from netCDF files requires several simple steps: loading the ncdf libarary, loading the netCDF file, examining the structure of the ncdf data, and extracting values to variables. The following example is from the NCAR guide, and uses their example.nc file.

Note that in this case, print() with a ncdf object shows just the header information: the description of the complicated dataframe. When the author of the ncdf pakage defined the ncdf class, they also defined a print.ncdf method to do something sensible like print the header information instead of something stupid like print the entire data hypercube. Similarly, they defined a summary.ncdf method to provide complementary information about the variables in the ncdf object:
[1] "file example.nc has 2 dimensions:"
[1] "EW Size: 87"
[1] "SN Size: 61"
[1] "------------------------"
[1] "file example.nc has 1 variables:"
[1] "float Elevation[EW,SN] Longname:Elevation Missval:-1"
  Length Class Mode
id 1 -none- numeric
ndims 1 -none- numeric
natts 1 -none- numeric
unlimdimid 1 -none- numeric
filename 1 -none- character
varid2Rindex 3 -none- numeric
writable 1 -none- logical
dim 2 -none- list
nvars 1 -none- numeric
var 1 -none- list
filled contour map

Another example from NCAR (download the file STN_050258.nc ):

For the downscaled GCM projections from LBL:

For my request (Joshua Tree NP) the frame is a 5 dimensional cube: 112 different model projections, 1800 (monthly) times, 5 lattitudes, 2 bounds (upper & lower), and 11 longitudes.

IRMA

Note from above that instead of a filename, read.table(), read.csv(), and odbcConnect() functions can all use a url to a remote data source. Therefore, the RESTinterface for IRMA is easy to hit in R. Use read.csv() and build the REST url to request data in .csv. Currently this is only available to NPS (and some FWS) users from the NPS intranet. For details, including several examples hitting the units and NPSpecies services, see my REST page.

update on 01/28/2010  |      |   Webmaster of NPS R pages
This site is best viewed with Any (modern) Browser