National Park Service

Inventory & Monitoring (I&M)

R For Natural Resources Course (Spring 2013)

These pages are in support of a series of Webinar sessions on R for Natural Resources in March & April 2013. Please direct questions and comments about these pages, or about R in NPS, to Dr. Tom Philippi.

2013 Course Overview

Dates and Times: Tuesdays and Thursdays 12:00 noon Pacific Time (3pm Eastern Time, etc.). The first 8 sessions (see schedule of topics below) will build upon each other, and provide an introduction to the fundamentals of R. My intent is to have each session consist of a 30-45 minute presentation, followed by 30-45 minutes of real world examples and more advanced aspects. For example, the presentation in the session on getting data into R would cover various text files, .csv files from spreadsheets, relational databases (e.g., Access), and simple geospatial data, with simple examples. The second half of that session would include examples of pulling tables & stored queries from Access & remote SQL databases, but also using REST calls and APIs to pull data from internet services, and vector & raster geospatial data.

Those 8 sessions will be followed by a set of Webinars on more specific topics related to natural resources. The topics sessions will be free-standing in the sense that they will build upon the content of the first 8 sessions, but not on each other, so you may choose the subset of interest to you.

Webinar Registration

To register for this webinar, go to https://www1.gotomeeting.com/register/697704864

Why R?

R is an open-source implementation of the S language for statistical computing. For over 20 years, applied statisticians have been submitting implementations of their new techniques to StatLib. When most of those implementations were written as libraries for the commercial S-plus implementation of the S language, statisticians were providing software for free, but users (including those same statisticians) had to pay a third party to be able to run the software. A very small group of statisticians took it upon themselves to write a complete open-source implementation of S that would run under most operating systems, which they called R. Since then, the vast majority of implementations of new statistical techniques have been made available as R packages, which include the code as a library of functions and at least some documentation. In fairness to S-Plus, while there are several GUI interfaces available for R, S-Plus provides a much more polished and complete GUI interface and user experience.

Because R is very useful for "computing with data", experts in many fields use it for their work. Because R is open source, many of those experts make their field-specific code and functions freely available as packages (currently over 4000 packages, see http://cran.cnr.berkeley.edu/web/packages/ ). For example, climate researchers use R with netCDF files, so there are packages for reading and writing netCDF files (netCDF, ncdf4) as well as for generating standard climate diagrams, imputing missing weather data, downscaling from coarse data, etc. (climtol, clim.pact, seas, anm, zyp). Phenology researchers provide packages bise and pheno as well as a package for pulling data directly from the National Phenology Network. Jari Oksanen (with help from others) provides package vegan for vegetation analysis (ordination, classification, analysis of similarity, and much more). There are several packages for species richness, diversity, rarefaction, etc.. Wildlife biologists provide several packages for estimating occupancy & abundance from various forms of data: unmarked, mra, Rcapture, secr, PresenceAbsence. The key point is that by learning how to use R, at lest to the level of writing code to reshape our data into the required structures and call the provided sunctions, we can leverage their efforts and expertise, and not reinvent those wheels. In order to produce informative and valid results we still have to understand the topic (e.g., water quality or wildlife population assessment), but we do not need to translate the approaches and equations in the literature into computer code, as experts have done that, tested it, and (to varying degrees) documented it.

Why Coding?

There are two major reasons you may want to learn to write R code rather than use the Rcmdr GUI. First, while more and more of the general statistical methods are being added to Rcmdr via plugins, almost all of the field-specific packages require R code to use. Packages are sets of one or more functions useful for a set of tasks. The advantage of functions in R is that we don't need to understand or modify anything inside the function in order to use tha package (although the source code is available if we need to inspect it or improve it). We only need to know what parameters we need to pass to the function, and how to use the objects (figures, analysis results, or data objects) it returns. Therefore, the amount of coding required of the user is quite limited: mostly creating the data objects the functions require, then calling the functions in the desired order.

Second, scripts document the analysis and workflow in an unambiguous manner, and make the work reproducible. Most scientific work in ecology involves decisions about outliers and missing values, and many options during the statistical analysis, far too many decisions and options to be documented in a standard methods section of a paper. [They can also be difficult to rerun 6 months later when editors and reviewers want one slight change, or a colleague needs to perform a similar analysis.] Because these details can greatly affect the results, some ecological journals and ecoinformatics groups are considering encouraging or requiring some form of documentation or journaling of the entire scientific workflow. R code (or SAS code or SPSS code) that includes the querying of the database, merging and cleansing data, generating the figures and tables, and performing the analyses themselves are one way to meet that requirement.

There are at least two major projects about scientific workflow. The Kepler project (https://kepler-project.org/) is building a visual tool for tracking the scientific workflow across disparate software, specifically for ecological (broadly defined) work. The vision is that most science involves using more than one tool to handle the data, so that raw data with standards-compliant metadata would be tracked in and out of a SQL database, the results (with automatically-updated metadata) perhaps tracked into ARC/GIS or MATLAB, those results tracked into R for analysis and graphing, etc.. So far Kepler supports R and MATLAB and web services such as data sources and EarthGrid. Second, there is an initiative on "Reproducible Research" (http://reproducibleresearch.net/) coming out of computer science and some biomedical journals. Their goal is to make the entire research workflow from raw data to finished paper reproducible. The philosophy is that the publication about the science is not the scholarship, it is only advertizing of the scholarship, which perhaps makes more sense in computer science than in natural resource management, where the product is the information for management communicated with figures and analytical results. To meet these goals, they have developed tools such as Sweave, ODFweave, knitr, and Sword that support embedding R objects into complex documents, whether self-executing documents or self-documenting executable code.


But I'm a Busy Resource Manager, Why Should I Care?

Perhaps you shouldn't. But cleaning, analyzing, and reporting results requires 1/4th to 1/3rd of the total time and effort of both ecological science in general and NPS Inventory & Monitoring in particular. If you consider an average of 9 parks per I&M network and 8 vital signs per park, I&M networks simply cannot manually generate all of those figures and tables and insert them one by one each year for that many annual reports. Routine reporting must occur, but as much of the repetitive work as possible must be automated, so that network folks can keep up with the workload and have time to devote to occasional larger syntheses. R code, and a bit of thought put into that R code, can make generation of tables, figures, and analyses for annual reports as simple as appending the current year's data onto the cumulative dataset in a database, and running a script (from previous years) in R. Sweave and its MS Office cousin Sword have the potential to embed properly formatted tables, figures, and any other R object (e.g., years or dates from the database) in the correct places in a document template that has section headings and boilerplate text, allowing the author to focus on writing just the short interpretation and discussion of the results. Done right, the power of coding can speed the repetitive tasks and get us more time out in the field. [Alas, that hasn't actually happened for me; I just have more time for more tasks.]


Other Learning Options

We learn in very different ways, and we have a wide range of backgrounds. Even if it goes well, this webinar will not be the most efficient way for some of us to learn to use R in our work (it wouldn't work for me). You may do better simply working through the web pages associated with this course at your own (faster!) pace. CRAN has a list of user contributed documentation ( http://cran.r-project.org/other-docs.html ), including several substantial books for learning R. The main R-project website has a broader list of resources at http://www.r-project.org/other-docs.htmlCoursera has offered several massive, open, online courses on data analysis with R. UCLA maintains a site with links to resources for learning R http://www.ats.ucla.edu/stat/r/, including a stack of slides for their own R course http://www.ats.ucla.edu/stat/r/seminars/intro.htm. There are a number of dead-tree books that provide an introduction to R. Past versions of this webinar have used Everitt & Hothorn's A Handbook of Statistical Analysis Using R" and Horton & Kleinman's "Using R for Data Management, Statistical Analysis, and Graphics." Any of these resources will help you learn the basics of R. Combined with a quick scroll through the web pages for this webinar to pick up specific topics such as pulling data from SQL services or geospatial data, they should equip you for any of the adanced topics I will offer after the first 6 sessions.

Pre-course Installation

We will start on March 5 with the expectation that everyone has R installed and running on their computer. I recommend that you either install R yourself, following the directions in the Install - Configure link on the left (which will be updated by mid-September), or else let your IT folks do the install if you don't have administrative rights or are not comfortable installing software.

My recommendations for quick installation:

  1. Grab the MSwindows binary at: http://cran.r-project.org/bin/windows/base/ If you are running a 64-bit version of Windows, this version will install both 32-bit and 64-bit R, which is what you want. For Mac OS X, go to the page at http://cran.r-project.org/bin/macosx/ and follow the directions. For linux, binaries for common linux distributions are available at http://cran.r-project.org/bin/linux/ .
  2. On MSwindows, install R in c:/R/R-2.15.2 not c:/Program Files/R/... This is especially important if you don't have administrative rights on your computer, and don't have permissions to install software (DOI computers are set up so that users do not have write access to c:/Program Files). You will be downloading additional packages frequently, and they will need to write files under your R directory.
  3. If you don't have a favorite ASCII/text/programming editor, or you prefer a slightly more integrated development environment than I use, you might want either R Studio, an integrated environment for use with R that works on MSwin, Mac OS X, and linux, or Tinn-R, an editor for MSwin that integrates with R. Download it from here.
  4. If this isn't sufficient guidance, Paul Geissler offers a free (registration required) website and video with step by step directions for installing R and R-Studio at http://paulrstat.com/Courses.aspx#install

⇑ To Top of Page

Course Logistics

I will post a separate web page for each webinar session. These pages will be linked from the table of topics at the bottom of this page, and possibly from the navigation panel on the left of these course web pages under "Learn R". The R Topics part of the left navigation panel supports navigation to pages divided up by topic rather than by course session, and is meant as a reference tool or cheat sheet for NPS IMD R users. In many cases the topics page will have more material, and especially more advanced or greater depth, than the corresponding couse session. In other cases such as graphics, the material is spread over several different course sessions, but by the end of this course it will be combined into the graphics page.

When you are participating in the webinar, you will have a small webinar toolbar somewhere on your desktop. The two most important tools on it are the button for raising your hand, and the box for typing in questions. Normally I will keep all participants muted from my control panel, as with 100 participants, even 3-4% not muting on their end and taking other calls, discussing a blind date (which happened in the R course a few years ago), etc., can disrupt the webinar for the rest of us. Therefore, the two ways you have to ask or answer questions are to raise your hand, or to type your question (or answer to my question) into the question box. If you need further information on using GoToWebinar, their pdf Attendee's QuickRef Guide is here (click on documents, then the desired document; they don't allow direct linking to the document).

⇑ To Top of Page

Session Topics

This schedule is tentative, and subject to change if the webinar is going to fast or too slow for the majority of the participants.

Recorded Session Download Link Topic
(link to session webpage)
Date
(2013)
Session 1 Introduction & Fundamentals Tuesday March 5
Session 2 Getting Data In and Out; Simple Manipulations Thursday March 7
Session 3 Data Exploration    session3history.R Tuesday March 12
Session 4 Data Manipulation and Basic Graphics; Writing Functions     session4history.R Thursday March 14
Session 5 Simple Inferences; Simple Linear Models (ANOVA & Regression); Formulas     session5history.R Tuesday March 19
Session 6 Advanced Graphics (lattice)     session6history.R Thursday March 21
Session 7 Generalized Linear Models; Mixed Models    session7(reconstructed)history.R Tuesday March 26
Session 8 Generalized Linear Mixed Models; Real Examples    session8history.R Thursday March 28
Session 9 Automated Reporting Tuesday April 2
Session 10 Geospatial Data Thursday April 4
Session 11 Vegetation Data 1: Import & Cleansing Tuesday April 9
Session 12 Vegetation: Summarization and Analyses Thursday April 11

⇑ To Top of Page

Last Updated: December 30, 2016 Contact Webmaster