Inventory & Monitoring (I&M)
Data Manipulation in R
Page Contents
 Missing Values
 Factors
 Dates and Times
 Numerical Transformations
 Manipulating Character Variables
 Reshaping Dataframes
 Merging Dataframes
 Operating By Groups
Please direct questions and comments about these pages, and the Rproject in general, to Dr. Tom Philippi.
Introduction
This page aims to provide a bit of background on data and data manipulation in R, and a cheat sheet of tools for the most common manipulations: dealing with missing values, categorical variables, and dates & times; numerical transformations; reshaping higherdimensional data between wide and long 2 dimensional tables; and merging datasets, including using an update file with just oldnew pairs for the values that need updating.
If you have substantial needs for manipulating data in R, I recommend Phil Spector's small book Data Manipulation with R.
When we work through a couple of examples, you may find one of the R referene cards handy. I've put a copy of Tom Short's reference card here.
Missing Values
There are 2 parts to dealing with missing values. First, during data import, only blanks are guaranteed to default to missing values, so you may need to specify na.strings or otherwise tell R what to treat as missing values. For example, if your csv file uses the SAS convention of dot for missing values:
read.csv("MyData.csv",header=TRUE,na.strings=".")
Note that the parameter is na.strings plural! If you have more than one indicator for missing values in your file, na.strings accepts a vector of strings, which is easiest to specify with the c() function:
read.csv("MyDataHasHoles.csv",header=TRUE,na.strings=c(".","....",".....","****","*****"," "))
Second, R is careful about handling missing values. By default, any operation (including wellwritten functions) that operates on an object with missing values will return a missing value. For example, mean(x) will return NA if any of the values of x are missing. That helps prevent missing values from sneaking past you through several steps of an analysis: the first operation that includes at least one NA will return NA. If you want summary statistics computed on the nonmissing values, you can use na.rm=TRUE:
mean(x,na.rm=TRUE)
gives the mean of the nonmissing values of x. For other functions, you can apply them to the nonmissing subset using the logical test not NA:
x[!is.na(x)]
is.na() is a function that is true whenever a value is NA; ! is the NOT logical operator (so != is not equal to, even though you might think it should be !== as == for logical equal to, plus ! for not). The square brackets [] subset the data object (in this case x) to where the logical statement in the brackets is true.
Most statistical functions (e.g., lm()) have something like na.action which applies to the model, not to individual variables. na.fail() returns the object (the dataset) if there are no NA values, otherwise it returns NA (stopping the analysis). na.pass() returns the data object whether or not it has NA values, which is useful if the function deals with NA values internally. na.omit () returns the object with entire observations (rows) omitted if any of the variables used in the model are NA for that observation. na.exclude() is the same as na.omit(), except that it allows functions using naresid or napredict. You can think of na.action as a function on your data object, the result being the data object in the lm() function. The syntax of the lm() function allows specification of the na.action as a parameter:
lm(na.omit(dataset),y~a+b+c)
lm(dataset,y~a+b+c,na.omit) # same as above, and the more common usage
You can set your default handling of missing values with
options("na.actions"=na.omit)
Assigning Values
In R, values are assigned to an object with the assignment operator < (a single equal sig= n can also be used, but < is preferred)
object < value
Object can be any data object; value must meet the requirements for that object type or class. If x is numeric, R has the sequence operator (start:stop) and seq() and rep() functions for easily generating pattterned data:
x < 1:10 # x is a vector with the values 1,2,3,4,...,10
x < seq(1,9,by=2) # 1,3,5,7,9 seq(from,to,by=1)
x < rep(1:3,2) # 1,2,3,1,2,3
x < rep(1:3,each=2) # 1,1,2,2,3,3
Numerical Transformations
Numerical transformations are simple. A new variable can be added to the data frame:
dataframe$logvar < log(dataframe$var)
dataframe['logvar'] < log(dataframe['var'] #same as above
In many cases, where you would use the name of a variable, you can use function(var). This does not work in model specifications:
lm(d, y ~ a + log(b)) #doesn't work
d$logb<log(d$b)
lm(d,y ~ a + logb) # works
Note that you can even specify complex transformations that apply different functions depending on the value, using [x<0]etc..
Character Variables
Values of character variables can be manipulated by functions for subsetting, concatenating, searching, replacing, and changing case. Pattern matching uses (unix) regular expressions, so they are very powerful and flexible.
Functions for Manipulating Character Variables 

nchar(x)  a vector fo the lengths of each value in x 
paste(a,b,sep="_")  concatenates character values, using sep between them 
substr(x,start,stop)  extract characters from positions start to stop from x 
strsplit(x,split)  split each value of x into a list of strings using split as the delimiter 
grep(pattern,x)  return a vector of the elements of x that included pattern 
grepl(pattern,x)  returns a logical vector indicating whether each element of x contained pattern 
regexpr(pattern,x)  returns the integer positions of the first occurrence of pattern in each element of x 
gsub(pattern,replacement,x)  replaces each occurrence of pattern with occurrence 
tolower(x)  converts x to all lower case 
toupper(x)  converts x to all upper case 
This page aims to provide a bit of background on data and data manipulation in R, and a cheat sheet of tools for the most common manipulations: dealing with missing values, categorical variables (factors), and dates & times; numerical transformations; reshaping higherdimensional data between wide and long 2 dimensional tables; merging datasets, including using an update file with just oldnew pairs for the values that need updating; and working by groups or subsets of a dataframe.
If you have substantial needs for manipulating data in R, I recommend Phil Spector's small book Data Manipulation with R. Also, note that if you think or dream in SQL, you can use SQL queries to manipulate R data frames using package sqldf.
Missing Values
There are 2 parts to dealing with missing values. First, during data import, only blanks are guaranteed to default to missing values, so you may need to specify na.strings or otherwise tell R what to treat as missing values. For example, if your csv file uses the SAS convention of dot for missing values:
read.csv("MyData.csv",header=TRUE,na.strings=".")
Note that the parameter is na.string plural! If you have more than one indicator for missing values in your file, na.strings accepts a vector of strings, which is easiest to specify with the c() function:
read.csv("MyDataHasHoles.csv",header=TRUE,na.strings=c(".","....",".....","****","*****"," "))
Second, R is careful about handling missing values. By default, any operation (including wellwritten functions) that operates on an object with missing values will return a missing value. For example, mean(x) will return NA if any of the values of x are missing. That helps prevent missing values from sneaking past you through several steps of an analysis: the first operation that includes at least one NA will return NA. If you want summary statistics computed on the nonmissing values, you can use na.rm=TRUE:
mean(x,na.rm=TRUE)
gives the mean of the nonmissing values of x. For other functions, you can apply them to the nonmissing subset using the logical test not NA:
x[!is.na(x)]
is.na() is a function that is true whenever a value is NA; ! is the NOT logical operator (so != is not equal to, even though you might think it should be !== as== for logical equal to, plus ! for not). The square brackets [] subset the data object (in this case x) to where the logical statement in the brackets is true.
Most statistical functions (e.g., lm()) have something like na.action which applies to the model, not to individual variables. na.fail() returns the object (the dataset) if there are no NA values, otherwise it returns NA (stopping the analysis). na.pass() returns the data object whether or not it has NA values, which is useful if the function deals with NA values internally. na.omit() returns the object with entire observations (rows) omitted if any of the variables used in the model are NA for that observation. na.exclude() is the same as na.omit(), except that it allows functions using naresid or napredict. You can think of na.action as a function on your data object, the result being the data object in the lm() function. The syntax of the lm() function allows specification of the na.action as a parameter:
lm(na.omit(dataset),y~a+b+c)
lm(dataset,y~a+b+c,na.omit) # same as above, and the more common usage
You can set your default handling of missing values with:
options("na.actions"=na.omit
Factors
Factors are grouping variables. Many functions in R allow processing by groups, and thus require a factor to specify which group each observation belongs to. What is important about factors is that they are stored as integers 1,2,..., with a separate vector of labels. Therefore, the text vector from above (named g) with only the 3 values "group1" "group2" and "group3" can be converted into a factor (namedf) with the values 1,2,3 and the labels or names "group1" "group2" and "group3":
g < c("group1","group1","group2","group3","group1")
f < as.factor(g)
Both g and f have lengths of 5.
For simple, obvious situations, wellwritten functions will test for the right object type, and "coerce" the object into the right type. For instance, the tapply() function for applying summary statistics functions to a numeric data object requires a factor object (technically, a list of factors each the same length as the numeric variable or vector) to define the groups. If the object (variable) passed for grouping is not a factor, tapply() tries to convert the grouping variable via the as.factor() function.
The most common problem I have with factors is that by default, character variables read from files or databases are stored as factors. This is fine unless there is an error in the data. I can use an assignment statement to fix the errors:
region[region=="National Capital"] < "National Capital Region"
However, the factor still includes a level "National Capital", which means that it will appear in legends of figures. When character values need to be updaed, is is easiest to explicitly keep them as character variables (not factors), update the character variable values, and then convert them to factors.
Assigning Values
In R, values are assigned to an object with the assignment operator < (a single equal sig= n can also be used, but < is preferred)
object < value
Object can be any data object; value must meet the requirements for that object type or class. If x is numeric, R has the sequence operator (start:stop) and seq() and rep() functions for easily generating patterned data:
x < 1:10 # x is a vector with the values 1,2,3,4,...,10
x < seq(1,9,by=2) # 1,3,5,7,9 seq(from,to,by=1)
x < rep(1:3,2) # 1,2,3,1,2,3
x < rep(1:3,each=2) # 1,1,2,2,3,3
Numerical Transformations
Numerical transformations are simple. Because data frames are lists, a new variable can be added to the data frame:
dataframe$logvar < log(dataframe$var)
dataframe['logvar'] < log(dataframe['var'] #same as above
In many cases, where you would use the name of a variable, you can use function(var). This does not work in model specifications:
lm(d, y ~ a + log(b)) #doesn't work
d$logb<log(d$b)
lm(d,y ~ a + log(b)) # works
Note that you can even specify complex transformations that apply different functions depending on the value, using [x<0] etc., or using ifelse().
Character Variables
Values of character variables can be manipulated by functions for subsetting, concatenating, searching, replacing, and changing case. Pattern matching uses (unix) regular expressions, so they are very powerful and flexible.
Functions for Manipulating Character Variables 

nchar(x)  a vector fo the lengths of each value in x 
paste(a,b,sep="_")  concatenates character values, using sep between them 
substr(x,start,stop)  extract characters from positions start to stop from x 
strsplit(x,split)  split each value of x into a list of strings using split as the delimiter 
grep(pattern,x)  return a vector of the elements of x that included pattern 
grepl(pattern,x)  returns a logical vector indicating whether each element of x contained pattern 
regexpr(pattern,x)  returns the integer positions of the first occurrence of pattern in each element of x 
gregexpr(pattern,x)  returns a list of the integer positions of all of the occurrences of pattern in each value of x 
gsub(pattern,replacement,x)  replaces each occurrence of pattern with occurrence 
tolower(x)  converts x to all lower case 
toupper(x)  converts x to all upper case 
Dates and Times
Dates are complicated. Too complicated, with at least 4 different sets of classes and functions. The issue is that different applications in different fields have very different needs for dates and times. The best recommendation is to use the simplest class of date and time objects that can handle your needs.
The base of R handles dates as the number of days since January 1, 1970 (so Jan 1 1970 is day 0). Dates before then are simply negative numbers.
The date package provides additional functions to convert {day, month, year} to numeric dates {mdy.date()} and vice versa {date.mdy()}, but it uses January 1, 1960 as day 0.
The chron package uses chron objects to handle dates and times. Dates and times are stored as fractional numbers of days since January 1, 1970: so 1.5 is noon on January 2, 1970.
POSIX classes handle not only dates and times, they also keep track of time zone and daylight savings time. They are important for timestamps on international transactions, but also on things like timestamps from GPS satellites aligning with local timestamps.
The nice feature of these methods of storing dates is that simple arithmetic works well: the difference between 2 dates is the number of days between them, periods of time may be multiplied or divided, periods of time may be added to dates to produce new dates, etc.
The {base} of R has many functions for manipulating date and time objects. There are functions to show a number as a date {as.Date()}) or a date as a number {julian() and as.numeric()}, and to show the day of the week, the month, and the quarter {weekdays(), months(), quarters()}. These are generic functions, so the value of any of them may be different depending on whether they are operating on base, date, or POSIX class objects.
For more information on dates, see my separae page on dates.
Subsetting Observations and Arrays
One of the many powerful features of the S language is the ability to subset data in multiple, useful ways. If you want to subset by observation number, you can include ranges of numbers in the subscripting brackets:
x[1:10] # the first 10 values of x
x[c(1:10,101110)] # the first 10 observations and observations 101110
You can use negative values to select all observations except those positions:
x[1]# all observations except the first observation
You can provide a vector of logical values, and thus operate on only the rows where that logical value is TRUE.
The x[!is.na(x)]example above first creates a logical vector of TRUE when x is NA and FALSE otherwise, then reverses TRUE and FALSE via the not ! operator, so then subsets the x vector with a logical vector that is TRUE for all values that were not NA. The region example above region[region=="National Capital"] created a logical vector of region "National Capital", then used that vector to choose which values of the region vector I wanted to change.
Arrays can have from one to at least 10 dimensions. Vectors or variables can be treated as 1dimensional arrays. Unlike a data frame, where each column (vector) can have a different type of object, all elements of an array must be the same type. Each dimension i is subscripted from 1 to dim(array)[].
For these examples, let a be a 4dimensional array a[1:20,1:5,1:100,1:2].
You can extract a subset of a dimension by using an explicit subscript or a range of values. For a 2 or higher dimensional array, you can obtain a slice with all values for a given dimension by not specifying subscripts for that dimension you don't have to specify 1: dim(a)[i]. So,
z <a[1,2:3,,] results in a 3 dimensional array z with dim[z] = 2,100,2, with data from all values of the 3rd and 4th dimension of a.
For the purposes of subsetting, data frames are the equivalent of 2 dimensional arrays. In order to subset the rows of a data frame but keep all of the variables, the syntax is:
PWR.info < UnitInfo[UnitInfo$Region_Code=="PWR",]
Note the comma after the logical condition. The logical condition indicates which rows to grab, the comma followed by nothing indicates that for the second dimension (columns or variables), select all.
One of the many powerful features of the S language is the ability to subset data in multiple, useful ways.
Think of a dataframe as a table: columns are variables or attributes, rows are cases or observations. If the dataframe is named MyData, and has 4 variables PlotID Lat Lon Count.
Elements in dataframes can be addressed by numerical subscripts, including ranges (start:stop) or arbitrary orders { c(3,5,2,6,4,1) }, with the first subscript referring to rows, the second referring to columns:
MyData[10:20,1] # PlotID (col1) for the 10th through 20th rows
MyData[seq(1,nrows(MyData),by=2),] # odd rows the hard way
If you want all rows or all columns, leave that dimension blank:
MyData[5,] # extracts the 5th row, and all columns
If you want all rows or columns EXCEPT the specified ones, you can use negative numbers:
MyData[5,] # extracts all except the 5th row
In addition to using the column number as a subscript, we can refer to any variable by its name, using a $ between the dataframe name and the variable name:
MyData$PlotID # same as MyData[,1]
The names() function either retrieves the columns names:
cnames < names(MyData)
or assigns the names of the column names:
names(MyData) < c("PlotID","Lat","Lon","Count")
If we just need to rename the first variable name (leaving the others as they were):
names(MyData)[1] < "Site"
Similarly, we can subset rows by either their numerical subscripts or by values of rownames if we define them with the row.name() function.
row.names(MyData) < MyData[,1] # assign SiteId as rowname
MyData < MyData[,1] # drop SiteID as a variable
In addition to subsetting by numerical subscripts that we want to include or exclude, we can also subset rows by a vector of logical values of the same length as the number of rows in the dataframe.
MyData[tf,] # a new dataframe object with only the rows where object tf has values of TRUE
This is usually done with a logical expression in the subscripting:
NorthernData < MyData[MyData$Lat>=45,]
BadPlots < c("Site13","Site39","Site0") # SiteIDs of bad plots
GoodData < MyData[!MyData$SiteID%in%BadPlots,]
GoodData < MyData[!rownames(MyData)%in%BadPlots,] # if SiteID dropped
Logical Operators  
==  is equal to 
!=  is not equal to 
>  greater than 
>=  greater than or equal to 
<  less than 
<=  less than or equal to 
%in%  is in the list 
!  not (reverses T & F 
&  and 
  or 
There are functions to find out how many rows a vector or dataframe has:
length(MyData$Count) # length of a vector = number of rows
length(MyData) # number of variables of a data frame
nrows(MyData) # number of rows of a data frame
dim(MyData) # numbers of rows and columns of a data frame or array
Reshaping (wide v. long)
Often, especially with splitplot designs or measurements repeated in time, data are written as a row for each subject, and repeated measures (or withinsubject factors) as columns. This is the "wide" format, and contains all of the information in a logical layout.
id  A1  A2  A3  A4 
a  0  1  2  3 
b  10  11  12  13 
c  21  22  23  24 
Most statistical functions require data in the "long" format: one measurement per row, thus multiple rows per subject, and additional columns to indicate which date or withinsubject factor the measurement is from.
id  rep  A 
a  1  0 
a  2  1 
a  3  2 
a  4  3 
b  1  10 
b  2  11 
b  3  12 
This simple reshaping of a data object is common enough that R includes the stack() function to go from wide to long and unstack() to go from long to wide for perfectly rectangular data, the reshape() function for more complex transposition, and melt() and recast() in the reshape package for very complex situations.
stack() and unstack()
Note that all columns in the wide version must be of the same length, or each block of values in the long version must be of the same length.
long < stack(wide)
wide2 < unstack(long)
reshape()
If you are a SAS user, reshape() is the rough equivalent of PROC TRANSPOSE. It is more flexible and powerful than stack() stack(), but is a bit more complicated in syntax (use ?reshape to view the documentation).
reshape(data=df, direction =c("long","wide"),ids="name",varying=c("v1","v2","v3"))
direction can be "long" or "wide". id names the id variable identifying separate subjects; varying lists the columns in the wide format that become separate rows in the long format. There are several other parameters that specify other behavior. One example that isn't in the documentation is what to do when you have more than 1 id variable:
Original data form:
C1  C2  V1  V2  V3 
A  B  1  2  3 
C  D  4  5  6 
E  F  7  8  9 
Ultimately I want:
C1  C2  Var  Value 
A  B  V1  1 
A  B  V2  2 
A  B  V3  3 
C  D  V1  4 
C  D  V2  5 
C  D  V3  6 
df < data.frame(C1, C2, V1, V2, V3) # glue stuff into data frame
long < reshape (data=df, direction = "long", ids = paste(df$C1,df$C2,sep=""), varying = c("V1","V2","V3"), sep="")
The trick is to generate a new ids variable on the fly by gluing together the values of the 2 desired ids variables, using the paste() function for string concatenation.
package reshape
If you have even more complex needs, package reshape lets you first break your data into atomic elements via melt(), then specify how you want those elements arranged with function cast().
package mefa
mefa (meta faunistics) is a package built specifically for dealing with species count data for multiple species at multiple sites. It defines an stcs class as Samples, Taxa, Counts, and Subsites : the long format that data are often entered into a computer. mefa provides functions to transform stcs data objects into wide format mefa objects: tables where rows are sites, columns are species, values are counts of that species in that site, and each segment has a separate sample * taxa tables.The mefa object also has a table of attributes for each sample, and a table of taxon attributes for each taxon. One nice feature of this package is that it not only provides diversity metrics, species accumulation curves, etc., it also generates data objects appropriate for analyses in vegan.
Combining Data Frames
The 2 general ways of combining data objects are appending cases or rows from one object to another {rbind ()}; or appending additional columns or variables to a data object, either by position {cbind()} or with explicit matching of rows {merge()).
Appending
rbind() binds rows of data from 2 or more data objects.
 rbind(A,B)
 A1
 A2
 A3
 B1
 B2
 B3
Merging
cbind() binds columns from 2 or more data objects that have the same number of rows, approximately merging by position (row)
 cbind(A,B)
 A1 B1
 A2 B2
 A3 B3
Note that cbind() glues columns next to each other, matching the first value in A to the first value in B, etc. If the columns are of different lengths, the values from the shorter columns are recycled, which is rarely what you want! If length(A)=5 and length(B)=3:
 A1 B1
 A2 B2
 A3 B3
 A4 B1
 A5 B2
Merging assembles variables (columns from 2 data frames, aligned by matching values of one or more columns in each data frame.
merge(x,y,by=intersect(names(x),names(y)),all=TRUE,sort=TRUE)
merge(x,y,by.x="UNIT_CODE",by.y="")
merge()syntax:
 x,y: 2 data frames or data objects that can be coerced to data frames
 by: The variables (columns) in the 2 objects to match on; the default is all names that occur in both x and y
 by="unit_code" uses only the variable unit_code to match
 by.x="UNIT_CODE",by.y="Unit_Code" matches by unit code, even though it was given slightly different names in
 the 2 data frames. Note that if you want to match by more than one variable, you use the c()function to provide the list:
 by=c("unit_code","year")
 all=FALSE the resulting object will only have rows for by values that occurred in both input data objects
 all=TRUE the resulting object will have rows for each by value that occurred in at least 1 of the 2 input data objects.
 For by values that occur in only one of the input objects, the variables from the other object will be set to NA.
 all.x=TRUE the output object will have a row for each by value found in x (the first input object),
 but not those found in y but not x.
 all.y=TRUE the output object will have a row for each by value found in y (the second input object),
 but not those found in x but not y.
 sort=TRUE make the output object's rows appear in sorted order of the by values.
Update Merging
A special case of merging is updating a subset of values in one variable. Consider a large data frame with species names in birds$SciName, and a second dataframe updates with variables BadName and NewName. If birds$SciName is matched by a value in updates$BadName, we want birds$SciName to take the value in updates$NewName; otherwise we want it to keep its current value. In order to do this, we need to use the match() function, which returns a vector of the positions of the first match of the first argument in the second. In the example below, fixes is the same length as birds. If a value in birds$SciName is found in updates$BadName, fixes has the position of that match. If a value in birds$SciName does not occur in updates$BadName, fixes has the value 0 (nomatch=0):
fixes < match(birds$SciName,updates$BadName,nomatch=0)
birds$SciName[fixes!=0] < updates$NewName[fixes]
The trick to this is that for a vector of subscripts (in this case [fixes], R automatically drops out not only NA values, but also 0 values. Therefore, we need the left side birds$SciName[fixes!=0] to subset that vector as well. While I could have written everything in one line, splitting this into 2 steps let me look at fixes to see what was happening.
Manipulating By Groups or Subsets
Often, one needs to apply some function to data separately by groups, for instance, the mean or maximum or sample size for each group. While sample sizes would be computed by table(), everything else can be computed using one of several functions that apply other functions by subsets of a data object: aggregate(), tapply(), lapply(), sapply(), and mapply(). I give explanation and examples of aggregate() and tapply(); you can always find syntax and details on the other functions using the ? key.
aggregate ()
The aggregate() function splits data into subsets and computes summary statistics for each subset.
aggregate(x,...)
aggregate(x,by=grouping,FUN)
aggregate(formula,data,na.action,...)
In the second syntax, x is the data object (a variable or entire data frame), grouping is a factor vector of the same length as x, and FUN is the function to be applied to groups of x defined by grouping: mean, max,min, etc. The by= variable should be a factor, but will be coerced to a factor if it is not. Note that you can define and use your own function, as long as it returns a single value for each call. [Prior to R 2.11, FUN had to return a scalar value, since 2.11, FUN can return any value, including a list.]
In the third syntax, a formula such as y ~ x or cbind(y1,y2,y3) ~ x1 + x2 uses the right side of the formula to define the subsets or groups, and the left size to define the objects to summarize.
Note that aggregate has a separate method for time series objects, which aggregates to different time intervals, and still requires that FUN return a scalar.
tapply()
The more general functions are tapply(), and lapply() with its variants sapply(), vapply(), and replicate().
tapply(x,INDEX,FUN,simplify=TRUE,...)
applies function FUN to values of x separately for each nonempty group of values given by unique combinations of the levels of the factors in INDEX. x must be an atomic object, usually a vector; INDEX is a list of factors, each the same length as x. FUN is the function to apply, the magic dots are optional additional arguments to FUN. If simplify=TRUE (the default) and FUN returns a scalar, tapply() returns an array with the same mode as that scalar. If simplify=FALSE, tapply() returns a list. Note that again, FUN can be any function, including a call to lm() or lme(). So, if you need to perform separate ANOVAs for each year, you could use something like:
YearlyResults < tapply(LottiaData,LottiaData$Year,glm(Count~zone,family=poisson))
to obtain a list of glm objects, one for each value of LottiaData$Year.
lapply(x,FUN,...)
returns a list object from applying FUN to each element of x. sapply() is a userfriendly version of lapply(). replicate() is a wrapper to sapply() for repeatedly evaluating an expression, usually involving random number generation.
Finally, apply() does not subset the object into groups of rows or cases, but applies a function over the margin of an array. because dataframes can be treated as arrays, apply() can be used to apply a function to column margins, that is to say to each variable in a dataframe:
SpeciesCounts < apply(SpeciesData[,1],2,sum)