Objectives

In this session we will fit some non-spatial models to spatial data in order to demonstrate key ideas in statistical modelling that will be built on in future modelling sessions.

The session will cover:

Revising how to fit linear models
Fitting GLMs
Some model diagnostics

A good text on learning the ins and outs of Generalised Linear Models is Dobson and Barnett (2008); for Generalised Additive models, Wood (2017); and for more general multilevel/hierarchical modelling, Gelman and Hill (2006).

Packages

# Load the packages needed for this practical
if (!require("pacman")) install.packages("pacman")
pkgs = c("sf",
         "scales",
         "mgcv",
         "tidyverse", "magrittr")
pacman::p_load(pkgs, character.only = T)

Refresher - linear models

Reading data in and visualising

Load the tidyverse package and read in the rats data. Ensure that the ID column is a factor

library(tidyverse)
rats_df <- read_csv("./data/rats/rats.csv", 
                    col_types = list(ID = 'f'))

rats_df

## # A tibble: 150 × 3
##    ID     time weight
##    <fct> <dbl>  <dbl>
##  1 1         8    151
##  2 2         8    145
##  3 3         8    147
##  4 4         8    155
##  5 5         8    135
##  6 6         8    159
##  7 7         8    141
##  8 8         8    159
##  9 9         8    177
## 10 10        8    134
## # ℹ 140 more rows

ggplot(data = rats_df,
       aes(x = time, y = weight)) + 
  geom_line(aes(group = ID)) +
  theme_minimal() +
  xlab("Time (days)") +
  ylab("Weight (grain?)")

Exercise: Fit a model showing the weight of the rats as a function of their age

rats_lm <- ...

Exercise: Look at the model summary() to determine whether or not there’s statistical evidence that the rats’ age affects their weight

Diagnostics

Exercise: Use the residuals() and fitted() functions to extract the residuals and fitted values from the model and add them to the rats_df data frame.

Exercise: Plot the residuals against the fitted values.

ggplot(data = rats_df,
       aes(x = ...,
           y = ...)) + 
  ...

Exercise: Do the residuals look IID with, as we move from left to right, the same variance and a mean of 0?

GLMs

Remember that Generalised Linear Models (GLMs) relate the observed outcome variable, \(\boldsymbol{Y}\), to the explanatory variables, \({X}\), through a link function, \(g(\cdot)\) and coefficients, \(\boldsymbol{\beta}\).

\[ g\left(\mathbb{E}(\boldsymbol{y})\right) = X\boldsymbol{\beta} \]

Here, \(g(\cdot)\) is the link between

the expected outcome, \(\mathbb{E}(\boldsymbol{y})\), and
the linear predictor, \(X\boldsymbol{\beta}\), a linear combination of the variables used to predict the outcome

In the following exercises, you’ll need to make sure you specify the family of model you wish to use (e.g. poisson, binomial) otherwise glm() defaults to using a Gaussian/normal model and will give the equivalent to lm()

A Binomial example

Loading and visualising data

We will consider data from Haiti on vaccine coverage for the tri-valent diphtheria-tetanus-pertussis vaccine in 2016. Make a fresh script and load the data (stored in geojson format).

library(sf)
adm2 <- st_read('adm2.geojson',
                stringsAsFactors = F,
                quiet = TRUE)

Exercise: Make a plot showing the relationship between the included dtp3coverage2016 and urbanpct variables. This variable is the proportion of people living in urban areas in the given arrondissement (level 2 admin unit). Include a scatterplot smoother to help visualise the relationship.

The loess smooth (locally weighted polynomial regression) doesn’t replace the need for a model. There may be a relationship but we can’t dismiss or accept it without modelling.

Model fitting

In order to fit the model, we need to tell glm() how many successes and failures we had. In the data frame, the relevant variables are

vacc_num - numerator for vaccine coverage, \(y\)
vacc_denom - denominator for vaccine coverage, \(n\)
urbanpct - proportion (not percentage) explanatory variable, \(x\)

So our number of failures is

adm2 <- mutate(adm2, vacc_fail = vacc_denom - vacc_num)

and the model call is

adm2_binom <- glm(data = adm2,
                  cbind(vacc_num, vacc_fail) ~ urbanpct,
                  family = 'binomial')

Diagnostics

Exercise: Show the model summary

We can also ask for tidier model summaries either as a data frame for the coefficients

library(broom)
tidy(adm2_binom, conf.int = TRUE)

or the model diagnostics

glance(adm2_binom)

Prediction

Our observed range of proportion of each region living in an urban area is (0.03, 1) which covers both almost completely urban and almost completely rural areas.

If we want to know what the predicted vaccine coverage was for each region, we can ask for the fitted values from our model

head(fitted.values(adm2_binom)) # this is the first six of 138

or use the predict() function on the fitted model

head(predict(adm2_binom, type = 'response'))

One of the easiest ways to do this is to use the augment() function from the broom package. It gives us a data frame with the variables used to fit the model and a number of model diagnostics.

rats_aug <- augment(rats_lm)
head(rats_aug)

We need to specify what type of predictions we want, as the default behaviour is to return them on the link (here, logit) scale

head(predict(adm2_binom))

##         1         2         3         4         5         6 
## 1.0302790 1.0367966 1.0241369 0.9863693 0.6342611 0.7650887

and we see that some predicted values are \(> 1\) which means these can’t possibly be predicted proportions.

Exercise: Add the predicted values (on the response scale) to the adm2 data frame and make a plot showing the predicted and observed values.

Exercise: Create a spatial visualisation of the residuals (on the logit scale). Discuss whether they appear to be IID with correlation of 0.

Extension exercise: Calculate the relative risk of vaccine coverage for a completely rural vs a completely urban population, given by finding the ratio of the probabilities in each.

Extension exercise: Fit a Poisson GLM to the vaccine coverage data. The offset variable to be used is the vacc_exp, which is derived from the proportion of vaccinees expected to be found in each arrondissement if the coverage was equal.

Extension exercise: Calculate the change in risk (relative) associated with a change from urbanicity of 1 to 0 by exponentiating the coefficient.

References

Dobson, Annette J, and Adrian G Barnett. 2008. An Introduction to Generalized Linear Models. Chapman; Hall/CRC.

Gelman, Andrew, and Jennifer Hill. 2006. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press.

Wood, S. N. 2017. Generalized Additive Models: An Introduction with r. 2nd ed. Chapman; Hall/CRC.

Introduction to Modelling I - Statistical modelling with the Generalised Linear Model

Objectives

Packages

Refresher - linear models

Reading data in and visualising

Diagnostics

GLMs

A Binomial example

Loading and visualising data

Model fitting

Diagnostics

Prediction

References