2 years ago
#70623
G. Channing
Species-by-year random effects in a GLMM (point count data)
SUMMARY
- I'm analyzing avian point count data using
glmmTMB
. - I'm trying to estimate year-specific mean abundance for each species.
- Models with interactions of fixed terms are not working, I think because limited data are split across several factors (species, year, week, site).
- I'm wondering if adopting a random-effects parameterization is reasonable (shrinking estimates to a realistic range)?
- I'm seeking guidance on what the code for that parameterization would look like.
Any and all recommendations or lessons are greatly appreciated! Thank you.
Intro
The data. I'm working on an analysis of a pre-existing database. The data are semi-structured, opportunistic observations of bird species abundance (zero-filled) collected via a stationary point count methodology. So, each species can be recorded at a site during a week of each year, but there are many "missing" observations since it's an opportunistic design. I'm looking for advice on modeling techniques, particularly related to random effects.
Modeling approach.
I want to estimate annual abundance for each species through a single model (akin to a multi-species, dynamic N-mix model, but assuming p = 1). Since the data are opportunistic counts, a zero-inflated and negative binomial model should make the most sense. Additionally, there is some pseudo-replication of counts at sites, so I know I need site as a random effect, e.g.: + (1|site)
. My understanding is that mgcv
or glmmTMB
are my best options for this type of modeling, and I know Gavin Simpson has mentioned that glmmTMB
is likely preferable over mgcv
when a factor used as a random effect has a large number (100s) of levels (here, 272 sites).
The issue
I've tried to use interaction effects comprised of fixed terms of species
and year
(similar to the Salamander example) to capture the species-specific annual estimates I'm interested in, but the model runs for hours, only to end up crashing. (Note: I can only get it to run & converge if I use a gaussian model, but I don't think that's reasonable given the data.) The terms week
and year
are factors (not integers) because I expect both to have non-linearity, which is important. Overall, I think there's not enough data for fully-independent estimates of these terms.
m0 <- glmmTMB(count ~ species*year + species*week + (1|site),
ziformula = ~ species,
family = nbinom2,
data = df)
Current direction
Random effects. I've often been taught that random effects should only be used to try to eliminate effects that are not of interest, but I was digging into resources online and I came across some of Ben Bolker's writing, which included a discussion of how random effects can have a practical utility beyond stricter definitions. So, I tried switching from interaction effects to various random effect parameterizations, in hopes of allowing levels of species and years to borrow from each other (shrinkage to "population" average).
However, I've gotten a bit confused along the way and I could use some help from others who have more experience working with this type of data.
Starting over.
I'm trying to restart by going back to the essentials, focusing only on species-year estimation. When I include a simple random effect structure, such as (1|species) + (1|year)
, the model estimates the same trend for each species, only varying the intercept, whereas I want each species to be able to be totally different. I think I need some sort of crossed or nested structure, but in reading up on those I got a bit confused for this case (i.e., several schools with their own students makes more sense, with lots of examples and explanations!).
Currently working.
What I can get running is m1
below, which produces the estimates I want. but I'm not sure if it's justifiable, or if there's something better. I also need something that can include week
and site
, too.
m1 <- glmmTMB(count ~ (1|species) + (0+species|winter),
ziformula = ~ (1|species),
family = nbinom2,
data = df)
Data
I added the data to Google Drive, which can be downloaded from this link.
Data summary
Two representations of the same data:
# A tibble: 262,040 × 6
count species checklist site year week
<dbl> <chr> <fct> <fct> <fct> <fct>
1 0 American Crow C1262 S174 2020 5
2 0 American Goldfinch C1262 S174 2020 5
3 0 American Robin C1262 S174 2020 5
4 0 American Tree Sparrow C1262 S174 2020 5
5 2 Black-capped Chickadee C1262 S174 2020 5
6 0 Blue Jay C1262 S174 2020 5
7 0 Brown Creeper C1262 S174 2020 5
8 0 Brown-headed Cowbird C1262 S174 2020 5
9 0 Carolina Wren C1262 S174 2020 5
10 0 Cedar Waxwing C1262 S174 2020 5
# … with 262,030 more rows
'data.frame': 262040 obs. of 6 variables:
$ count : num 0 0 0 0 2 0 0 0 0 0 ...
$ species : chr "American Crow" "American Goldfinch" "American Robin" "American Tree Sparrow" ...
$ checklist: Factor w/ 6551 levels "C0001","C0002",..: 1262 1262 1262 1262 1262 1262 1262 1262 1262 1262 ...
$ site : Factor w/ 272 levels "S001","S002",..: 174 174 174 174 174 174 174 174 174 174 ...
$ year : Factor w/ 33 levels "1989","1990",..: 32 32 32 32 32 32 32 32 32 32 ...
$ week : Factor w/ 21 levels "1","2","3","4",..: 5 5 5 5 5 5 5 5 5 5 ...
Follow-up tests (from comments)
Fixed effect tests
1) Limiting to the 5 most abundant species, I get convergence warning 10.
# Model: count ~ species*year, ziformula = ~species
Warning message:
In fitTMB(TMBStruc) :
Model convergence problem; iteration limit reached without convergence (10). See vignette('troubleshooting')
2) Limiting to top 10 most abundant species, I get convergence warning 9.
# Model: count ~ species*year, ziformula = ~species
Warning message:
In fitTMB(TMBStruc) :
Model convergence problem; function evaluation limit reached without convergence (9). See vignette('troubleshooting')
3) Limiting to the top 2 most abundant species: The model appears to run without issue if the ziformula is just ~1
(count ~ species*year, ziformula = ~1
). But, if I extend it to include the top 5 or the 10 most abundant species, it gives me convergence warning (9), and if I include all 40 species, it crashes R entirely.
4) Using just data from the top 2 most abundant species: if I include the week term, too (because species migrate over the 21 weeks), then I get a warning about the Hessian and also convergence warning (9):
# Model: count ~ species*week*year, ziformula = ~ 1
Warning messages:
1: In fitTMB(TMBStruc) :
Model convergence problem; non-positive-definite Hessian matrix. See vignette('troubleshooting')
2: In fitTMB(TMBStruc) :
Model convergence problem; function evaluation limit reached without convergence (9). See vignette('troubleshooting')
Random effect tests
1) Contrarily, if I use random effects (see model below), then I can include species in the ziformula as a random effect and the model runs without errors.
count ~ (species|year), ziformula = ~ (1|species)
So, it seems like random effects might be the only option? However, I’m not quite sure which random effects coding is justifiable for species-by-year. It seems to me that species should only be crossed with year, but besides (species|year)
, I don’t see any other way to produce separate species-by-year estimates without using a nested structure, which does not reflect reality given my understanding of what nested means (vs crossed). Is that the case?
2) Another note: Limiting to the top 10 species, if I use: count ~ species + (species|year)
, then the model allows a fixed species effect for zero-inflation: ziformula: ~species
. (I'm currently running this for all 40 species, but it's taking a while.)
r
lme4
mixed-models
mgcv
random-effects
0 Answers
Your Answer