Modelling 1 - Spatial random effects (solutions)

Slides

Exercises

Packages

# Load the packages needed for this practical
if (!require("pacman")) install.packages("pacman")
pkgs = c("sf",        # for simple features representing spatial data
         "mgcv",      # for fitting models
         "tidyverse") # for ggplot2, tidyr, readr, dplyr, %>% pipe
pacman::p_load(pkgs, character.only = T)

Don’t forget to define the functions that provide the Queen and Rook adjacency rules.

st_queen <- function(a, b = a,...) st_relate(a, b, pattern = "F***T****", ...)
st_rook  <- function(a, b = a,...) st_relate(a, b, pattern = "F***1****", ...)

Case study - Haiti vaccine coverage

The functions in sf allow us to define neighbourhood structures and convert these to spatial weights matrices in R. The spatial weights matrix will be used later for regression modelling and assessing spatial correlation.

Exploratory analysis

Exercise 1

  1. Load the adm2.geojson file and use the stringsAsFactors = F option
  2. Create a spatial weights matrix using st_queen()
  3. Look at the structure of the spatial weights matrix either by printing or Viewing it. Hint: you can add 0 to the matrix to convert from logical to numeric entries
library(sf)
sf_use_s2(FALSE)
adm2 <- st_read('data/haiti/adm2.geojson', 
                stringsAsFactors = F,
                quiet = TRUE)

# create a neighbours object
adm2_nb <- st_queen(adm2, sparse=FALSE)

We can show the adjacency matrix by converting from R’s default matrix class to the Matrix class of the Matrix package.

library(Matrix)
adm2_nb %>%
  Matrix %>%
  image

Or if we want a ggplot2 version, we would convert from a matrix to a data frame. This data frame has 138 rows and columns and is symmetric, so we add a column that denotes which row of the matrix each row of the data frame represents, and then pivot from wide to long format, so that we can plot that for this row and column in the adjacency matrix, the value is given by a 0 or a 1. We need to ensure that the row and column variables are both integers, as if they were character labels, the string 1 is followed by the string 11, then 12, rather than \(1,2,3\) as we know they should follow.

as.data.frame(adm2_nb) %>%
    tibble::rowid_to_column(var = "row") %>% 
  pivot_longer(cols = -row, names_to = "column", values_to = "value") %>%
  mutate(value = value + 0) %>%
  mutate_at(.vars = vars(column, row, value), 
            .funs = function(x){parse_number(as.character(x))}) %>%
  ggplot(data = ., aes(x = column, y=row)) + 
  geom_raster(aes(fill = factor(value))) +
  coord_equal() +
  scale_y_reverse() +
  theme_bw() +
  scale_fill_manual(values = c("0" = "white", "1" = "black")) +
  theme(legend.position = "none", panel.grid = element_blank())
Adjacency matrix for adm2

Figure 1: Adjacency matrix for adm2

Exercise 2

The variable dtp3coverage2016 is the vaccine coverage proportion,

\[ p_{i}= \frac{Y_{i}}{m_{i}}:i=1,\ldots,n \]

  1. Create a plot showing the vaccine coverage for each area.
  2. Without performing a formal test, does it appear that there is a spatial effect at play here?

We can use either the mapview package or use ggplot2’s ability to plot simple features to show the vaccine coverage.

library(mapview)
mapview(adm2, zcol = "dtp3coverage2016")
ggplot() + geom_sf(data = adm2, aes(fill = dtp3coverage2016)) +
  scale_fill_viridis_c() +
  theme_bw()

We can see that regions with high coverage tend to be near other regions of high coverage, but at this point we are unsure if this spatial variability is due to a predictor variable we have measured or due to some spatial process that we can’t directly model.

Exercise 3

Conduct an exploratory analysis to look at the relationship between DTP3 coverage in 2016 and the percentage of residents living in urban settings using the empirical logit transform of the proportion data.

  1. Create a variable elogitp which is the empirical logit of the vaccine coverage using the formula
    \[ \tilde{p_{i}}=\log\left(\frac{Y_{i}+1/2}{m_{i}-Y_{i}+1/2}\right):i=1,\ldots,n \]

  2. Create a visualisation to compare the empirical logit vaccine coverage to the proportion of people living in urban environments in each administrative region.

  3. If we wish to explain the variability in vaccine coverage, does it appear that there’s an association with urban living?

adm2 <- mutate(adm2,
               p = vacc_num/vacc_denom,
               elogitp = log(
                 (vacc_num + 1/2) /              # successes + 1/2n
                   (vacc_denom - vacc_num + 1/2) # failures + 1/2n
               ) # end log
) # end mutate

# Compare to the percentage of urban
ggplot(adm2, aes(x = urbanpct, y = elogitp)) + 
  geom_point() +
  geom_smooth() +
  theme_bw()

Fit intercept logistic model

Exercise 4

  1. Fit a GLM with just an intercept. This will give us an idea of the average vaccine coverage across the study area.

We can look at the summary of the fitted model to get a table of coefficient estimates (on the logit scale) and the AIC.

mod_base <- glm(p ~ 1, data = adm2, 
                family = "binomial", 
                weights = vacc_denom)

summary(mod_base)
## 
## Call:
## glm(formula = p ~ 1, family = "binomial", data = adm2, weights = vacc_denom)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 0.709750   0.003804   186.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 48588  on 137  degrees of freedom
## Residual deviance: 48588  on 137  degrees of freedom
## AIC: 49609
## 
## Number of Fisher Scoring iterations: 4

Alternatively, the glance function from the broom package can give us some quick model summaries as a data frame.

library(broom)
glance(mod_base)
## # A tibble: 1 × 8
##   null.deviance df.null  logLik    AIC    BIC deviance df.residual  nobs
##           <dbl>   <int>   <dbl>  <dbl>  <dbl>    <dbl>       <int> <int>
## 1        48588.     137 -24803. 49609. 49612.   48588.         137   138
  1. Take the inverse logit of the estimate of the intercept and its 95% confidence interval. What is the average vaccine coverage in the study region?
library(boot) # for the inv.logit() function
bind_cols(data.frame(estimate = coef(mod_base)),
      t(confint(mod_base))) %>%
  mutate_all(.funs = inv.logit)
##              estimate     2.5 %    97.5 %
## (Intercept) 0.6703458 0.6686967 0.6719922

Exercise 5

  1. Fit a logistic regression model with urbanpct as a predictor variable.
  2. What unit is urbanpct measured in?
  3. Calculate the Odds ratio of urbanpct and its 95% confidence interval for each unit increase
  4. Use a calculation within the function I() (which inhibits the conversion of objects) so that urbanpct is in units of 10%.
  5. Refit the model and calculate the Odds ratio.

The urbanpct is measured as a proportion between 0 and 1, representing the effect of moving everyone in a region from living in a rural area to everyone living in an urban area in that same region.

mod_urbanpct <- glm(p ~ 1 + urbanpct,
                    data = adm2, 
                    family = "binomial",
                    weights = vacc_denom)

tidy(mod_urbanpct, conf.int = TRUE) %>%
  dplyr::filter(term != "(Intercept)") %>%
  dplyr::select(term, estimate, conf.low, conf.high) %>%
  dplyr::mutate(estimate = exp(estimate),
                conf.low = exp(conf.low),
                conf.high = exp(conf.high))
## # A tibble: 1 × 4
##   term     estimate conf.low conf.high
##   <chr>       <dbl>    <dbl>     <dbl>
## 1 urbanpct     1.97     1.92      2.01
# re-scaled

glm(p ~ 1 + I(urbanpct*10), # convert to effect of 10% increases instead of 100% increase
    data = adm2, 
    family = "binomial",
    weights = vacc_denom) %>%
  tidy(., conf.int = TRUE) %>%
  dplyr::filter(term != "(Intercept)")  %>%
  dplyr::mutate(estimate = exp(estimate),
                conf.low = exp(conf.low),
                conf.high = exp(conf.high))
## # A tibble: 1 × 7
##   term             estimate std.error statistic p.value conf.low conf.high
##   <chr>               <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
## 1 I(urbanpct * 10)     1.07   0.00111      60.7       0     1.07      1.07

NB: Regarding the Odds Ratios, \(1.07^{10} \approx 1.97\)

Exercise 6

  1. Extract the residuals from the urbanpct model
  2. Add the residuals to the data.frame perhaps called urban_res
  3. Map the urban_res residuals.
  4. After fitting the model with one spatially varying variable, does it appear that there is any leftover spatial variability in the model?
adm2_urbanpct_aug <- bind_cols(adm2, augment(mod_urbanpct)) %>%
  rename(urban_res=.resid)

ggplot() + geom_sf(data = adm2_urbanpct_aug,
                   aes(fill = urban_res)) +
  scale_fill_gradient2() +
  theme_bw()

We see that the eastern regions tend to have a bluer colour, indicating that these have a higher residual value, indicating that the model underpredicts in these regions.

Areal spatial regression models

Exercise 7

  1. Extract the fitted values and add them to the data frame containing the adm2 data
  2. Reshape the data frame to prepare for visualisation of the observed values as well as the predictions from the models with and without the MRF smooth.
  3. Make a visualisation with small multiples that shows the fitted values and observed values

Fitting the model it isn’t necessary to pass in the list of matrices representing the polygons, as we will generally do our plotting by combining model output with the adm2 simple feature rather than directly plotting the model object.

library(mgcv)
adm2 <- mutate(adm2, 
               adm1code = factor(adm1code))

iid_urbanpct <- gam(p ~ 1 + urbanpct + 
                      s(adm1code, bs="re"),
                    data = adm2, 
                    family = "binomial",
                    weights = vacc_denom)

adm2 <- mutate(adm2, 
               adm2code = factor(adm2code),
               REGION_ID = levels(adm2code))


nb_adm2 <- adm2 %>% 
    st_transform(crs = "+proj=utm +zone=18 ellps=WGS84") %>%
    st_queen(sparse=FALSE)

W_adm2 <- diag(rowSums(nb_adm2)) -nb_adm2 

adm2_polygons <- st_coordinates(adm2) %>%
  as.data.frame %>%
  select(X, Y, L3) %>%
  split(.$L3) %>%
  map(~as.matrix(.x[,c(1,2)]))

names(adm2_polygons) <- adm2$adm2code

mrf_urbanpct <- gam(p ~ 1 + urbanpct + 
                      s(adm2code,
                        bs="mrf", k = 100,
                        xt = list(penalty = nb_adm2,
                                  polys = adm2_polygons)),
                    data = adm2, 
                    family = "binomial",
                    weights = vacc_denom)

Now we gather on the observed p and fitted .fitted columns so we can plot with small multiples,

adm2_fitted <-
  adm2 %>%
  mutate(GLM = fitted(mod_urbanpct, type = "response"),
         IID = fitted(iid_urbanpct, type = "response"),
         MRF = fitted(mrf_urbanpct, type = "response"))

adm2_fitted %>%
  rename(Observed = p) %>%
  pivot_longer(cols = c(Observed, GLM, IID, MRF), names_to = "key", values_to = "value") %>%
  ggplot(data = .) +
  geom_sf(aes(fill = value)) +
  facet_wrap(~key) +
  scale_fill_viridis_c(limits = c(0,1)) + theme_bw()

Exercise 8

  1. Extract the residuals from the fitted model objects
  2. Reshape the data frame appropriately
  3. Create a spatial visualisation for the residuals from each model
  4. Create a scatterplot showing how the fitted values (on the link scale) vary with the empirical logit, to show how close the predictions are to the data
adm2_residuals <-
  adm2 %>%
  mutate(GLM = residuals(mod_urbanpct, type = "pearson"),
         IID = residuals(iid_urbanpct, type = "pearson"),
         MRF = residuals(mrf_urbanpct, type = "pearson"))

adm2_residuals %>%
  pivot_longer(cols = c(GLM, IID, MRF), names_to = "key", values_to = "value") %>%
  ggplot(data = .) +
  geom_sf(aes(fill = value)) +
  facet_wrap(~key) +
  scale_fill_gradient2() + theme_bw() +
  theme(legend.position = "bottom")

And for the scatterplot with the empirical logit

adm2_fitted %>%
  pivot_longer(cols = c(GLM, IID, MRF), names_to = "key", values_to = "value") %>%
  ggplot(data = .) +
  geom_point(aes(x = elogitp, y = logit(value))) +
  facet_wrap(~key) +
  geom_abline(intercept = 0, slope = 1) +
  theme_bw() +
  coord_equal() + 
  ylab("logit(fitted value)") + xlab("Empirical logit of observed coverage")

Exercise 9

  1. Calculate the spatial correlation of the residuals from these two models

  2. Calculate the standard deviation of the residuals to show how much leftover variation there is.

library(spdep)
adm2_listw <- mat2listw(adm2_nb, style = "B", zero.policy = TRUE)

calculate_moran <- function(x){
  moran(x = x,# what is correlated?
        listw = adm2_listw, # how are the regions connected?
        n = nrow(adm2_nb),  # how many regions?
        S0 = sum(adm2_nb),  # how many connections are there? 
        zero.policy = TRUE)$I
}

adm2_residuals %>%
  as.data.frame %>%
  summarise(GLM = calculate_moran(GLM),
            IID = calculate_moran(IID),
            MRF = calculate_moran(MRF))
##         GLM        IID       MRF
## 1 0.1207994 0.02177376 0.6328509
adm2_residuals %>%
  as.data.frame %>%
  summarise(GLM = sd(GLM),
            IID = sd(IID),
            MRF = sd(MRF))
##        GLM      IID      MRF
## 1 17.99376 16.60707 9.388051
glance(mod_urbanpct)
## # A tibble: 1 × 8
##   null.deviance df.null  logLik    AIC    BIC deviance df.residual  nobs
##           <dbl>   <int>   <dbl>  <dbl>  <dbl>    <dbl>       <int> <int>
## 1        48588.     137 -22935. 45875. 45881.   44852.         136   138
glance(iid_urbanpct)
## # A tibble: 1 × 7
##      df  logLik    AIC    BIC deviance df.residual  nobs
##   <dbl>   <dbl>  <dbl>  <dbl>    <dbl>       <dbl> <int>
## 1  11.0 -19446. 38913. 38945.   37873.        127.   138
glance(mrf_urbanpct)
## # A tibble: 1 × 7
##      df logLik    AIC    BIC deviance df.residual  nobs
##   <dbl>  <dbl>  <dbl>  <dbl>    <dbl>       <dbl> <int>
## 1  101. -5999. 12199. 12495.   10979.        37.0   138
  1. Which of these models best accounts for the spatial variability in the data and why?

The Markov Random Field model is best. While the spatial correlation of the residuals is higher, their standard deviation is much lower; indicating that more of the spatial variability has been accounted for. This is confirmed by the AIC of the MRF model being 12199 which is much lower than the values of 45875 and 38913 for the non-spatial GLM and the model with an IID spatial random effect at the adm1 level respectively.

Extension exercises

Exercise 10

Modify the code above to conduct the analysis for an MRF spatial effect for the administrative regions in the adm1code variable and compare the effect of urbanicity, residuals and AIC to the model with the MRF for the adm2code variable.

First, we need to aggregate the level 2 admin regions to their level 1 counterparts

adm1 <- aggregate(x = adm2, by = list(adm2$adm1_en), FUN = length)

The FUN argument here is a bit arbitrary, as we just want the geometry. Here we’ve chosen length because it doesn’t matter what type of variable is being aggregated.

We can then calculate the spatial adjacency

W1       <- st_queen(adm1, sparse = F)
W1_mat   <- as.matrix(W1) + 0L
W1_listw <- mat2listw(W1_mat, style = "B", zero.policy = TRUE)

W1
##        [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10]
##  [1,] FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE
##  [2,]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
##  [3,] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE
##  [4,] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE
##  [5,]  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
##  [6,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [7,]  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
##  [8,] FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE
##  [9,] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE
## [10,]  TRUE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE

Or turn the level 1 geometry into a list of matrices representing the polygons and have R work it out for us in the model fitting.

nb_adm1 <- adm1 %>% 
    split(.$Group.1) %>%
    map(~st_coordinates(.x)) %>%
    map(~matrix(.x[,1:2], ncol = 2, byrow=F))

We now fit our model,

gam_adm1 <- gam(data = adm2,
                p ~ 1 + urbanpct + 
                    s(factor(adm1_en), 
                      bs = 'mrf',
                      xt = list(polys = nb_adm1)),
                family = 'binomial',
                weights = vacc_denom)

Having fit the model we can now visualise the predictions alongside the data

# here we use as.numeric because predict() wants to return an array
adm2 %<>% mutate(ADM1 = as.numeric(predict(gam_adm1,     ., type = 'response')))
adm2 %<>% mutate(ADM2 = as.numeric(predict(mrf_urbanpct, ., type = 'response')))

adm2 %>% select(geometry, data = p, ADM1, ADM2) %>%
    gather(key, value, -geometry) %>%
    ggplot(data = .) +
    geom_sf(aes(fill = value),
            size = 0.5,
            color = 'grey40') +
    facet_grid(. ~ key) +
    theme_minimal() +
    scale_fill_viridis_c()

We also want to look at the AIC

models <- list(ADM1 = gam_adm1,
               ADM2 = mrf_urbanpct)

lapply(models, glance)
## $ADM1
## # A tibble: 1 × 7
##      df  logLik    AIC    BIC deviance df.residual  nobs
##   <dbl>   <dbl>  <dbl>  <dbl>    <dbl>       <dbl> <int>
## 1  11.0 -19446. 38913. 38945.   37873.        127.   138
## 
## $ADM2
## # A tibble: 1 × 7
##      df logLik    AIC    BIC deviance df.residual  nobs
##   <dbl>  <dbl>  <dbl>  <dbl>    <dbl>       <dbl> <int>
## 1  101. -5999. 12199. 12495.   10979.        37.0   138

and see that the level 2 admin level model far outperforms the level 1 admin model, which makes sense given the level 1 model has 10 terms in its spatial effect (one for each department) vs the 100 found in the level 2 model (low rank approximation to the 138 arrondissements).

We have been warned about a region with no neighbours, so Moran’s \(I\) won’t work here just yet. But we can at least look at the standard deviation of the residuals

models %>%
    lapply("residuals") %>%
    lapply(sd)
## $ADM1
## [1] 16.6261
## 
## $ADM2
## [1] 8.949951

and visualise them

models %>%
    lapply("residuals") %>%
    lapply(function(x){mutate(adm2, .resid = x)}) %>%
    bind_rows(.id = "Model") %>%
    ggplot(data = .) +
    geom_sf(aes(fill = .resid)) +
    facet_grid(. ~ Model) +
    scale_fill_gradient2(midpoint = 0)

We see that the level 2 model has residuals which are smaller in magnitude even if they likely have higher spatial autocorrelation. The residuals for the level 1 model have less spatial autocorrelation because we have fit a (spatially smoothed) mean for each department.

Exercise 11

  1. Investigate which of the regions in the adm2 data set have zero neighbours or do not appear to be connected to their neighbours properly and determine an appropriate set of neighbours for them
  2. Modify the weights matrix to include these neighbours
  3. Refit the models and assess the goodness of fit and correlation of the residuals
W     <- st_queen(adm2)
W_mat <- as.matrix(W) + 0
W_mat <- W_mat - diag(diag(W_mat)) # subtract the diagonals, so W[i,i] = 0
W_empties <- which(rowSums(W_mat) == 0)

adm2 %<>% mutate(RECORD_ID = 1:nrow(.))

haiti_no_empties <- adm2[-c(W_empties), ]
haiti_empties    <- adm2[c(W_empties), ] 


# find each region's nearest neighbour
W_nn <- haiti_no_empties[
  st_nearest_feature(haiti_empties,
                     haiti_no_empties), "RECORD_ID"] %>%
  pull(RECORD_ID) 

for (i in 1:length(W_empties)){
  W_mat[W_nn[i], W_empties[i]] <- 1
  W_mat[W_empties[i], W_nn[i]] <- 1
}

W_listw <- mat2listw(W_mat, style = "B")

coords <- st_centroid(adm2) %>% st_coordinates

plot(W_listw, coords)

References