National Park Service

Inventory & Monitoring (I&M)

Statistical Formulas

Please direct questions and comments about these pages, and the R-project in general, to Dr. Tom Philippi.

A formula in S is an object that is a symbolic expression of a relationship between one or more response or dependent variables (on the left), and one or more predictors or independent variables (on the right), with a tilde ~ separating the two sides. Note that the variables in the formula do not need to exist for it to be a valid formula. Most manipulations require only the formula; only fitting the model to data and estimating parameters requires a dataframe with corresponding variables.

For the classical regression/ANOVA approach, functions such as lm() translate formula objects to design matrix X and coefficient vector B, then use whatever means necessary to obtain best estimates of the B vector. In general, each term in a formula is translated into one or more columns in X. While you are unlikely to ever need to use it explicitly, the function model.matrix() translates the formula object passed as a parameter into a design matrix X.

The notation for formulas in R is an extension of that developed for GLIM and GENSTAT. The major extensions are that in R, the response (left side) is part of the formula, and that the terms in the formulas can in fact be full expressions, not just simple variables. These extensions make formulas in R much more flexible, applying to complex graphs and tree-based models as well as the original linear models framework, and make it much easier to specify and test various statistical models on the fly in R, without having to first generate explicit variables from each expression fo use in the specification. Chambers and Hastie (1993) "Statistical Models in S" is the definitive reference for formulas in S and thus R, although their conception of formulas has been extended for other uses, and there is a package "formula" that defines a new object class Formula that inherits from class formula. At the R command line, ?formula will give you some details.


Example Formula

Fuel ~ Weight + Displacement

Fuel is the response variable;Weight and Displacement are the predictors. Fitting this formula is estimating parameters of the statistical model:

Fuel = α + β1Weight + β2Displacement + ε

Note that the formula does not explicitly include either the intercept α, nor the error term ε. The intercept is implicit in the formula, but can be made explicit by including a term "1":

Fuel ~ 1 + Weight + Displacement    # formula with explicit intercept term

In order to specify a model without an intercept (e.g., intercept forced to 0), you must explicitly specify the lack of intercept by a 0 term in the formula, or by a minus sign operator removing the intercept term:

Fuel ~ 0 + Weight + Displacement      # formula with no intercept term
Fuel ~ -1 + Weight + Displacement     # formula with no intercept term

The error term ε is controlled by other specifications in the call to a function, including distribution type in Generalized Linear Models.

⇑ To Top of Page

Terms in formulas

The terms on either side of a formula can be any of 3 types:

  • A numeric vector, implying a single coefficient (e.g., slope β)
  • A factor or ordered factor, implying one coefficient for each level
  • A matrix, implying a coefficient for each column

Additionally, each term can be any S (R) expression that evaluates to one of the 3 types above, including transformations and calls to functions. Some especially useful example transformations and functions include:

Term Function/Transformation
c(live,die) which is the binomial response variable for logistic regression in lme
(age >40) which evaluates to a logical variable
cut(Age,3) which evaluates to a 3 level factor
I(age^2) which evaluates to age squared
poly(Age,3) which evaluates to a 3 column matrix of orthogonal polynomials in Age
bs(x,df) create beta spline with df interior knots
smooth(x,kind=s) smoother of various forms

Note the use of the identity or AsIs I() function. Several operators such as +, -, * and ^ have different meanings in formula syntax; if you need them in your R expressions, wrap the R expression in the I() function where they will be interpreted in their usual R operation context.

⇑ To Top of Page

Symbolic Operators in Formulas

Sybmol Use
+ separate effects in a formula
: interaction (A:B is interaction of A and B)
* main effects plus interactions A*B is equivalent to A + B + A:B
^ crossed
%in% nested within
/ nested within
| conditional on; defines separate panels or shingles in lattice

⇑ To Top of Page

Parameterizations

When various functions translate formula specifications into statistical models to be fit via ML, REML, or other numeric techniques, continuous numeric terms are obviously parameterized as single coefficients. Likewise, numeric matrices are parameterized with a single coefficient for each column. However, parameterization of factors is a bit more complicated. In general, a factor with N distinct levels is parameterized as if it were N separate dummy (0 1) variables, each getting a coefficient. However, that leads to models that are over-parameterized. Consider the formula:

Salary ~ Age + Sex

and the corresponding model, where Sex is parameterized as dummy variables M and F:

Salary = μ + β1Age + β2M + β3F + ε

Because M + F = 1 for each observation, there are an infinite number of combinations for μ, β2, and β3 that fit the same equation. The usual parameterization is to retain the intercept mu, but set the coefficient of the first level of the factor equal to 0 (B2 = 0), and thus have coefficients for all but the first level of the factor, and have the coefficients interpretable as differences from the values for the first level. [Note that the mu term is no longer an overall average, but rather the average for the level with coefficient set to 0.] Under that parameterization, is is useful to have the first factor level be the "control" or reference group. The relevel() function in package stats allows specification of a reference level of a factor, and assigns it the integer value of 1. The alternative is to fit a model without an intercept or grand mean, in which case the predicted value for each observation can be computed from:

Salary ~ -1 + Age + Sex

Predicted Salary = β1Age + β2M + β3F

One can also specify explicit contrasts among levels of a factor via the contrasts() function. For unordered factors, the default contrasts are Helmert contrasts: 1 v 2, avg(1,2) v 3, etc.. For ordered factors, the default contrasts are orthogonal polynomials of the integer values of the levels.

⇑ To Top of Page

Examples Right-Side Formula Specifications

Right Side of Formula Meaning
A + B main effects of A and B
A:B interaction of A with B
A*B main effects and interactions = A + B + A:B
A*B*C main effects and interactions A+B+C+A:B+A:C+B:C+A:B:C
(A+B+C)^2 A, B, and C crossed to level 2: A+B+C+A:B+A:C+B:C
A*B*C-A:B:C same as above: main effects plus 2-way interactions
1 + state + state:county nested ANOVA
1 + state + county%in%state nested ANOVA emphasizing county nested in state
state / county nested ANOVA
(1 / subject) fit random intercepts for subjects
(1+time / subject) fit both random intercepts and random subject-specific slopes

⇑ To Top of Page

Last Updated: December 04, 2012 Contact Webmaster