Please direct questions and comments about these pages, and the R-project in general, to Dr. Tom Philippi.
A formula in S is an object that is a symbolic expression of a relationship between one or more response or dependent variables (on the left), and one or more predictors or independent variables (on the right), with a tilde ~ separating the two sides. Note that the variables in the formula do not need to exist for it to be a valid formula. Most manipulations require only the formula; only fitting the model to data and estimating parameters requires a dataframe with corresponding variables.
For the classical regression/ANOVA approach, functions such as lm() translate formula objects to design matrix X and coefficient vector B, then use whatever means necessary to obtain best estimates of the B vector. In general, each term in a formula is translated into one or more columns in X. While you are unlikely to ever need to use it explicitly, the function model.matrix() translates the formula object passed as a parameter into a design matrix X.
The notation for formulas in R is an extension of that developed for GLIM and GENSTAT. The major extensions are that in R, the response (left side) is part of the formula, and that the terms in the formulas can in fact be full expressions, not just simple variables. These extensions make formulas in R much more flexible, applying to complex graphs and tree-based models as well as the original linear models framework, and make it much easier to specify and test various statistical models on the fly in R, without having to first generate explicit variables from each expression fo use in the specification. Chambers and Hastie (1993) "Statistical Models in S" is the definitive reference for formulas in S and thus R, although their conception of formulas has been extended for other uses, and there is a package "formula" that defines a new object class Formula that inherits from class formula. At the R command line, ?formula will give you some details.
Fuel ~ Weight + Displacement
Fuel is the response variable;Weight and Displacement are the predictors. Fitting this formula is estimating parameters of the statistical model:
Fuel = α + β1Weight + β2Displacement + ε
Note that the formula does not explicitly include either the intercept α, nor the error term ε. The intercept is implicit in the formula, but can be made explicit by including a term "1":
Fuel ~ 1 + Weight + Displacement # formula with explicit intercept term
In order to specify a model without an intercept (e.g., intercept forced to 0), you must explicitly specify the lack of intercept by a 0 term in the formula, or by a minus sign operator removing the intercept term:
Fuel ~ 0 + Weight + Displacement # formula with no intercept term
Fuel ~ -1 + Weight + Displacement # formula with no intercept term
The error term ε is controlled by other specifications in the call to a function, including distribution type in Generalized Linear Models.
Terms in formulas
The terms on either side of a formula can be any of 3 types:
- A numeric vector, implying a single coefficient (e.g., slope β)
- A factor or ordered factor, implying one coefficient for each level
- A matrix, implying a coefficient for each column
Additionally, each term can be any S (R) expression that evaluates to one of the 3 types above, including transformations and calls to functions. Some especially useful example transformations and functions include:
|c(live,die)||which is the binomial response variable for logistic regression in lme|
|(age >40)||which evaluates to a logical variable|
|cut(Age,3)||which evaluates to a 3 level factor|
|I(age^2)||which evaluates to age squared|
|poly(Age,3)||which evaluates to a 3 column matrix of orthogonal polynomials in Age|
|bs(x,df)||create beta spline with df interior knots|
|smooth(x,kind=s)||smoother of various forms|
Note the use of the identity or AsIs I() function. Several operators such as +, -, * and ^ have different meanings in formula syntax; if you need them in your R expressions, wrap the R expression in the I() function where they will be interpreted in their usual R operation context.
Symbolic Operators in Formulas
|+||separate effects in a formula|
|:||interaction (A:B is interaction of A and B)|
|*||main effects plus interactions A*B is equivalent to A + B + A:B|
||||conditional on; defines separate panels or shingles in lattice|
When various functions translate formula specifications into statistical models to be fit via ML, REML, or other numeric techniques, continuous numeric terms are obviously parameterized as single coefficients. Likewise, numeric matrices are parameterized with a single coefficient for each column. However, parameterization of factors is a bit more complicated. In general, a factor with N distinct levels is parameterized as if it were N separate dummy (0 1) variables, each getting a coefficient. However, that leads to models that are over-parameterized. Consider the formula:
Salary ~ Age + Sex
and the corresponding model, where Sex is parameterized as dummy variables M and F:
Salary = μ + β1Age + β2M + β3F + ε
Because M + F = 1 for each observation, there are an infinite number of combinations for μ, β2, and β3 that fit the same equation. The usual parameterization is to retain the intercept mu, but set the coefficient of the first level of the factor equal to 0 (B2 = 0), and thus have coefficients for all but the first level of the factor, and have the coefficients interpretable as differences from the values for the first level. [Note that the mu term is no longer an overall average, but rather the average for the level with coefficient set to 0.] Under that parameterization, is is useful to have the first factor level be the "control" or reference group. The relevel() function in package stats allows specification of a reference level of a factor, and assigns it the integer value of 1. The alternative is to fit a model without an intercept or grand mean, in which case the predicted value for each observation can be computed from:
Salary ~ -1 + Age + Sex
Predicted Salary = β1Age + β2M + β3F
One can also specify explicit contrasts among levels of a factor via the contrasts() function. For unordered factors, the default contrasts are Helmert contrasts: 1 v 2, avg(1,2) v 3, etc.. For ordered factors, the default contrasts are orthogonal polynomials of the integer values of the levels.
Examples Right-Side Formula Specifications
|Right Side of Formula||Meaning|
|A + B||main effects of A and B|
|A:B||interaction of A with B|
|A*B||main effects and interactions = A + B + A:B|
|A*B*C||main effects and interactions A+B+C+A:B+A:C+B:C+A:B:C|
|(A+B+C)^2||A, B, and C crossed to level 2: A+B+C+A:B+A:C+B:C|
|A*B*C-A:B:C||same as above: main effects plus 2-way interactions|
|1 + state + state:county||nested ANOVA|
|1 + state + county%in%state||nested ANOVA emphasizing county nested in state|
|state / county||nested ANOVA|
|(1 / subject)||fit random intercepts for subjects|
|(1+time / subject)||fit both random intercepts and random subject-specific slopes|