library(tidyverse)
library(nlme)
data("Orthodont")
<-
Orthodont %>%
Orthodont as.data.frame() %>%
mutate(
age = as.integer(age),
Subject = factor(Subject, levels = sort(levels(Subject)), ordered = FALSE),
Sex_rev = fct_rev(Sex),
Sex3 = factor(if_else(
%in% c("M11","M12","M13","M14","M15","M16"),
Subject "Non-binary",
levels = c("Female","Male","Non-binary")
Sex),
) )
The nlme
package is a venerable, now quite old package for fitting hierarchical models in R. Compared to the better-known lme4
package, nlme
provides greater flexibility for specifying dependence structures at the finest-grain level of the model, such as allowing for heteroskedastic errors or correlated errors that follow an auto-regressive structure within each higher-level unit.1 Due to its quite advanced age (the first entry in the changelog is from December 23, 1999!), it is perhaps not surprising that the package has a number of what one might call quirks—odd, counter-intuitive behaviors that are unlike the behavior of other core model-fitting functions in R. A few of my past posts have documented some of these quirks. Here’s another one.
1 nlme
is a minor obsession for me because Several packages that I’ve worked on provide formal enhancements to or are at least very closely connected to it. The clubSandwich
package provides cluster-robust standard errors for fitted models, and it supports models fitted with nlme::lme()
and nlme::gls()
. The lmeInfo
package provides analytic derivatives for models fitted with nlme::lme()
and nlme::gls()
. The scdhlm
package for calculating effect sizes from single-case designs relies heavily on nlme::lme()
and on lmeInfo::g_mlm()
.
varIdent()
In nlme
’s syntax, varIdent()
is used to specify a model in which the variance of the lowest-level error term differs depending on a categorical covariate. I’ll call this a “heteroskedastic error” model. Say that we have repeated measures (indexed by \(h\)) nested within units (indexed by \(i\)), so that \(y_{hi}\) is measurement \(h\) on unit \(i\) for \(h = 1,...,H_i\) and \(i=1,...,N\). Let \(c_{hi} \in \{1,...,C\}\) denote the category of observation \(hi\), and suppose that the last category serves as a reference level. Say we have some model with predictors \(\mathbf{x}_{hi}\) and potentially also with random effects terms described by \(\mathbf{z}_{hi}\): \[
y_{hi} = \mathbf{x}_{hi} \boldsymbol\beta + \mathbf{z}_{hi} \mathbf{u}_i + e_{hi}
\] where \[
\text{Var}(e_{hi}) = \sigma^2 \times \exp\left( \sum_{j=1}^{C-1} \lambda_j I(c_{hi} = j)\right).
\] The coefficients \(\lambda_1,...,\lambda_{C-1}\) capture how the variance of errors in category \(j\) differ from the variance of errors in category \(C\). The question is, how can one control the reference level (i.e., choose which category is \(C\)) when fitting this model?
Let me give a few examples of models with this feature using Orthodont
, one of the standard datasets included in the nlme
package. Below I do a bit of tidying up on the factors in this dataset.
Here is a simple example of a heteroskedastic error model, without any random effects or any other bells or whistles.
<- gls(
gls1 ~ age + Sex,
distance weights = varIdent(form = ~ 1 | Sex),
data = Orthodont
)summary(gls1)
Generalized least squares fit by REML
Model: distance ~ age + Sex
Data: Orthodont
AIC BIC logLik
494.3251 507.5949 -242.1626
Variance function:
Structure: Different standard deviations per stratum
Formula: ~1 | Sex
Parameter estimates:
Male Female
1.0000000 0.9378976
Coefficients:
Value Std.Error t-value p-value
(Intercept) 17.811621 1.1119255 16.018718 0
age 0.650648 0.0975569 6.669421 0
SexFemale -2.321023 0.4396048 -5.279794 0
Correlation:
(Intr) age
age -0.965
SexFemale -0.173 0.000
Standardized residuals:
Min Q1 Med Q3 Max
-2.58304984 -0.65117352 -0.02923653 0.51805238 2.30992386
Residual standard error: 2.329342
Degrees of freedom: 108 total; 105 residual
The reference level of the variance structure is Male
.2
2 For ease of exposition, let me make a little helper function that pulls out the reference level of the variance structure.
<- function(mod) {
ref_level attr(mod$modelStruct$varStruct, "groupNames")[1]
}ref_level(gls1)
[1] "Male"
Here’s another gls
model, this time allowing the errors to be correlated within subject:
<- update(
gls2
gls1, correlation = corCompSymm(form = ~ 1 | Subject)
)ref_level(gls2)
[1] "Female"
The reference category has now changed to Female. The same thing happens if I fit a heteroskedastic error model using lme()
:
<- lme(
lme1 ~ age + Sex,
distance random = ~ 1 | Subject,
weights = varIdent(form = ~ 1 | Sex),
data = Orthodont
)ref_level(lme1)
[1] "Female"
It would be nice to be able to control the reference level.3
3 I’m not the only one seeking to do this. Someone on the R-SIG-mixed-models listserv had the same query in July of 2013, but none of the responses provide a solution.
What doesn’t work
The usual way that one would specify reference levels is to set a contrast for the factor:
contrasts(Orthodont$Sex) <- contr.treatment(2, base = 1)
But this does not seem to have any effect on the selected reference level of the variance structures:
update(gls1, data = Orthodont) |> ref_level()
[1] "Male"
update(gls2, data = Orthodont) |> ref_level()
[1] "Female"
update(lme1, data = Orthodont) |> ref_level()
[1] "Female"
Nor does using a factor with the levels in reverse order:
update(gls1, weights = varIdent(form = ~ 1 | Sex_rev)) |> ref_level()
[1] "Male"
update(gls2, weights = varIdent(form = ~ 1 | Sex_rev)) |> ref_level()
[1] "Female"
update(lme1, weights = varIdent(form = ~ 1 | Sex_rev)) |> ref_level()
[1] "Female"
Emma Knight suggested on R-SIG-mixed-models that the reference level is selected based on the category with the largest number of observations (which is Male
). That could be what determines the reference level in gls1
(though it clearly can’t be so for gls2
or lme
). I’ve created another factor called Sex3
where the largest category is Female. If this theory is true, then the reference level of gls1
should change to Female:
update(gls1, weights = varIdent(form = ~ 1 | Sex3)) |> ref_level()
[1] "Male"
Nothing changes with the other models either:
update(gls2, weights = varIdent(form = ~ 1 | Sex3)) |> ref_level()
[1] "Female"
update(lme1, weights = varIdent(form = ~ 1 | Sex3)) |> ref_level()
[1] "Female"
What about the order of the data? There is a bug report documenting that varIdent()
depends on the sort order of the categorical grouping variable when used in a gls()
call.4 Sorting the data in descending order of Sex3
does affect the reference level for gls1
:
4 This StackOverflow question notes the same issue and links to a patch from Ben Bolker.
<- arrange(Orthodont, Sex3)
Orthodont_sort <- arrange(Orthodont, desc(Sex3))
Orthodont_rev head(Orthodont_sort$Sex, 4)
[1] Female Female Female Female
attr(,"contrasts")
2
Male 0
Female 1
Levels: Male Female
head(Orthodont_rev$Sex3, 4)
[1] Non-binary Non-binary Non-binary Non-binary
Levels: Female Male Non-binary
update(
gls1, data = Orthodont_sort
|>
) ref_level()
[1] "Female"
update(
gls1, weights = varIdent(form = ~ 1 | Sex3),
data = Orthodont_rev
|>
) ref_level()
[1] "Non-binary"
However, this doesn’t seem to affect gls2
or lme1
at all:
update(
gls2, data = Orthodont_sort
|>
) ref_level()
[1] "Female"
update(
gls2, weights = varIdent(form = ~ 1 | Sex3),
data = Orthodont_rev
|>
) ref_level()
[1] "Female"
update(
lme1, data = Orthodont_sort
|>
) ref_level()
[1] "Female"
update(
lme1, weights = varIdent(form = ~ 1 | Sex3),
data = Orthodont_rev
|>
) ref_level()
[1] "Female"
Very curious.
How to control the reference level
I posted about this issue to R-SIG-mixed-models with some of the above notes. Sebastian Meyer provided a very helpful response, noting that
lme()
internally reorders the data by the grouping factor(s) before initialization, so the order of that factor (‘Subject’ in your example…) will indirectly determine the reference level of varIdent(), regardless of how your data or strata levels are ordered originally.
Similar re-ordering behavior occurs in gls()
when the model includes a grouping structure, as it does in gls2
because of the corCompSymm()
correlation structure. This insight provides a key for controlling the reference level: for a given categorical variable, the reference level will be taken as the level of the first observation in the first level of the grouping factor. Thus, for models with grouping structure, one can set the reference level to Male by re-ordering the levels of Subject
so that a unit with Sex = 'Male'
occurs first.
$Subject2 <- fct_relevel(Orthodont$Subject, "M01")
Orthodonttable(Orthodont$Subject2) |> head()
M01 F01 F02 F03 F04 F05
4 4 4 4 4 4
update(
gls2, correlation = corCompSymm(form = ~ 1 | Subject2)
|>
) ref_level()
[1] "Male"
update(
lme1, random = ~ 1 | Subject2
|>
) ref_level()
[1] "Male"
For models without grouping structure, the reference level is indeed controlled by sort order, as if there were only a single group. To control the reference level, one would need to sort the data accordingly. For example, to make Female the reference level in gls1
, the first observation in the data needs to have Sex = 'Female'
. This is so in Orthodont_sort
:
update(gls1, data = Orthodont_sort) |> ref_level()
[1] "Female"
What if the categorical variable varies within the grouping factor? For instance, say that we want to fit a model with heteroskedastic errors varying by age of the subject, which varies from 8 to 14 across the repeated measures of each participant. Here’s a basic heteroskedastic error gls()
:
<- gls(
gls3 ~ age + Sex,
distance weights = varIdent(form = ~ 1 | age),
data = Orthodont
)
And here is one with a correlation structure:
<- update(
gls4
gls3, correlation = corCompSymm(form = ~ 1 | Subject)
)
In both cases, the reference level is the youngest age:
c(ref_level(gls3), ref_level(gls4))
[1] "8" "8"
To set this to something else in the first model, just re-arrange the first few rows:
<- Orthodont[c(4,1,2,3,5:108),]
Orthodont2
update(
gls3, data = Orthodont2
|>
) ref_level()
[1] "14"
However, this doesn’t work for gls4
5 because the first level of the grouping factor is not the first one to appear in the data.6 But we could sort the data by the grouping factor and then by age
so that the desired reference level appears first:
5 See
update(gls4, data = Orthodont2) |> ref_level()
[1] "8"
6 The first level of Subject
corresponds to rows 65 through 68:
which(levels(Orthodont2$Subject)[1] == Orthodont2$Subject)
[1] 65 66 67 68
<-
Orthodont2_sort %>%
Orthodont arrange(Subject, desc(age))
head(Orthodont2_sort)
distance age Subject Sex Sex_rev Sex3 Subject2
68 23.0 14 F01 Female Female Female F01
67 21.5 12 F01 Female Female Female F01
66 20.0 10 F01 Female Female Female F01
65 21.0 8 F01 Female Female Female F01
72 25.5 14 F02 Female Female Female F02
71 24.0 12 F02 Female Female Female F02
update(gls4, data = Orthodont2_sort) |> ref_level()
[1] "14"
It’s not elegant, nor particularly intuitive, but it seems to work.
Setting a value
Sebastian also suggested another technique for directly controlling the reference level of the variance structure:
It seems that only if there are more than two strata, the reference level for varIdent() can be chosen via a named initial parameter ‘value’, leaving out the desired reference level, but I haven’t tested if this works as intended with both gls() and lme(). It would then make sense to support such a choice also in the case of only two strata, so for a single parameter.
Setting the argument value
of varIdent()
specifies initialization values for the variance structure. Sebastian’s suggestion is to specify values for every level of the categorical variable except for the desired reference level. Let me give this a try with the Sex3
variable, which has three distinct levels.
<- update(
gls_M
gls1, weights = varIdent(form = ~ 1 | Sex3)
)
<- update(
gls_F
gls1, weights = varIdent(
value = c(`Non-binary` = 0.01, Male = 0.01),
form = ~ 1 | Sex3
)
)
<- update(
gls_NB
gls1, weights = varIdent(
value = c(Female = 0.01, Male = 0.01),
form = ~ 1 | Sex3
)
)
coef(gls_M$modelStruct$varStruct, unconstrained = FALSE)
Non-binary Female
0.7438755 0.8608531
coef(gls_F$modelStruct$varStruct, unconstrained = FALSE)
Non-binary Male
0.864114 1.161639
coef(gls_NB$modelStruct$varStruct, unconstrained = FALSE)
Female Male
1.157254 1.344311
Indeed, this works for models with variance structures but no correlation structures or random effects. Likewise for gls
models that include grouping structure:
update(
gls2, weights = varIdent(form = ~ 1 | Sex3)
|>
) ref_level()
[1] "Female"
update(
gls2, weights = varIdent(
value = c(`Non-binary` = 0.01, Female = 0.01),
form = ~ 1 | Sex3
)|>
) ref_level()
[1] "Male"
update(
gls2, weights = varIdent(
value = c(Female = 0.01, Male = 0.01),
form = ~ 1 | Sex3
)|>
) ref_level()
[1] "Non-binary"
And for lme
models with random effects:
update(lme1, weights = varIdent(form = ~ 1 | Sex3)) |>
ref_level()
[1] "Female"
update(
lme1, weights = varIdent(
value = c(`Non-binary` = 0.01, Female = 0.01),
form = ~ 1 | Sex3
)|>
) ref_level()
[1] "Male"
update(
lme1, weights = varIdent(
value = c(Female = 0.01, Male = 0.01),
form = ~ 1 | Sex3
)|>
) ref_level()
[1] "Non-binary"
However, as Sebastian noted, it does not work for categories with only two levels (so only one non-reference level):
update(
gls1, weights = varIdent(
value = c(Male = 0.01),
form = ~ 1 | Sex
)|>
) ref_level()
[1] "Male"
update(
gls2, weights = varIdent(
value = c(Female = 0.01),
form = ~ 1 | Sex
)|>
) ref_level()
[1] "Female"
update(
lme1, weights = varIdent(
value = c(Female = 0.01),
form = ~ 1 | Sex
)|>
) ref_level()
[1] "Female"
Thus, this is only a partial solution. I agree with Sebastian that it would be useful to support user-specified levels within the value
argument, even for factors with just one non-reference level.
Colophon
::session_info() sessioninfo
─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.4.1 (2024-06-14 ucrt)
os Windows 11 x64 (build 22631)
system x86_64, mingw32
ui RTerm
language (EN)
collate English_United States.utf8
ctype English_United States.utf8
tz America/Chicago
date 2024-12-28
pandoc 3.2 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
─ Packages ───────────────────────────────────────────────────────────────────
! package * version date (UTC) lib source
P cli 3.6.3 2024-06-21 [?] RSPM
P colorspace 2.1-1 2024-07-26 [?] RSPM
P digest 0.6.37 2024-08-19 [?] RSPM (R 4.4.0)
P dplyr * 1.1.4 2023-11-17 [?] RSPM
P evaluate 0.23 2023-11-01 [?] RSPM (R 4.4.0)
P fansi 1.0.6 2023-12-08 [?] RSPM
P fastmap 1.1.1 2023-02-24 [?] RSPM (R 4.4.0)
P forcats * 1.0.0 2023-01-29 [?] RSPM
P generics 0.1.3 2022-07-05 [?] RSPM
P ggplot2 * 3.5.1 2024-04-23 [?] RSPM
P glue 1.8.0 2024-09-30 [?] RSPM (R 4.4.0)
P gtable 0.3.6 2024-10-25 [?] RSPM (R 4.4.0)
P hms 1.1.3 2023-03-21 [?] RSPM
P htmltools 0.5.7 2023-11-03 [?] RSPM (R 4.4.1)
P htmlwidgets 1.6.4 2023-12-06 [?] RSPM
P jsonlite 1.8.8 2023-12-04 [?] RSPM
P knitr 1.47 2024-05-29 [?] RSPM (R 4.4.0)
P lattice 0.22-5 2023-10-24 [?] RSPM (R 4.4.1)
P lifecycle 1.0.4 2023-11-07 [?] RSPM
P lubridate * 1.9.3 2023-09-27 [?] RSPM
P magrittr 2.0.3 2022-03-30 [?] RSPM
P munsell 0.5.1 2024-04-01 [?] RSPM
P nlme * 3.1-166 2024-08-14 [?] RSPM
P pillar 1.9.0 2023-03-22 [?] RSPM
P pkgconfig 2.0.3 2019-09-22 [?] RSPM
P purrr * 1.0.2 2023-08-10 [?] RSPM
P R6 2.5.1 2021-08-19 [?] RSPM
P readr * 2.1.5 2024-01-10 [?] RSPM
renv 1.0.5 2024-02-29 [1] RSPM (R 4.4.0)
P rlang 1.1.4 2024-06-04 [?] RSPM
P rmarkdown 2.27 2024-05-17 [?] RSPM
P rstudioapi 0.17.1 2024-10-22 [?] RSPM (R 4.4.0)
P scales 1.3.0 2023-11-28 [?] RSPM
P sessioninfo 1.2.2 2021-12-06 [?] RSPM (R 4.4.0)
P stringi 1.8.4 2024-05-06 [?] RSPM
P stringr * 1.5.1 2023-11-14 [?] RSPM
P tibble * 3.2.1 2023-03-20 [?] RSPM
P tidyr * 1.3.1 2024-01-24 [?] RSPM
P tidyselect 1.2.1 2024-03-11 [?] RSPM
P tidyverse * 2.0.0 2023-02-22 [?] RSPM
P timechange 0.3.0 2024-01-18 [?] RSPM
P tzdb 0.4.0 2023-05-12 [?] RSPM
P utf8 1.2.4 2023-10-22 [?] RSPM
P vctrs 0.6.5 2023-12-01 [?] RSPM
withr 3.0.2 2024-10-28 [1] RSPM (R 4.4.0)
P xfun 0.45 2024-06-16 [?] RSPM (R 4.4.0)
P yaml 2.3.8 2023-12-11 [?] RSPM (R 4.4.0)
[1] C:/Users/jamespustejovsky/Documents/jepusto-quarto/renv/library/R-4.4/x86_64-w64-mingw32
[2] C:/Users/jamespustejovsky/AppData/Local/R/cache/R/renv/sandbox/R-4.4/x86_64-w64-mingw32/e0da0d43
P ── Loaded and on-disk path mismatch.
──────────────────────────────────────────────────────────────────────────────