1.1. Let’s begin
This session builds on the techniques taught in last week’s tutorials focused on spatial dependence & spatial autocorrelation. Today, you will learn how to build spatially-based regression models for dealing with spatial autocorrelation whenever you want to determine relationship between an outcome and other independent variables. In this exercise we will be using London’s Lower Super Output Area (LSOA) data from 2015 pertained to house prices (as a dependent variable), and assessing it’s relationship with public transport accessibility (PTA), average income and socioeconomic deprivation (IMD) as independent variables while accounting for spatial autocorrelation.
We will implement three models:
Before you begin do make sure to download all data from Moodle. If you are working from a UCL workstation, do create a folder called “Week 5” within your “GEOG0114” folder stored in the N-drive.
Extract all data from the zip folder and stored into “Week 5” folder. Open a new R script and set the work directory to Week 5’s folder (i.e., Home (N:) > GEOG0114 > Week 5)
# Set working directory to Week 5 folder
setwd("N:/GEOG0114/Week 5")
1.2. Loading and installing packages
We will need to load the following packages:
sf
: Simple Featurestmap
: Thematic Mappingspdep
: Spatial Dependence (Weighting schemes & Spatial Statistics)sp
: Package for providing classes for spatial data (points, lines, polygons and grids)# Load packages using library() function
library("sf")
library("tmap")
library("spdep")
library("sp")
The above packages sf
, tmap
, spdep
& sp
should have been installed in the previous session(s).
We will need to install the following packages:
spatialreg
: provides functions for spatial regression modelling.# Install the packages: spatialreg using the install.package()
install.packages("spatialreg")
# Load the packages with library()
library("spatialreg")
1.3. Loading datasets
Let us first import the quantitative data i.e., London LSOA 2015 data.csv
into R/RStudio.
# Use read.csv() to import
datafile <- read.csv(file = "London LSOA 2015 data.csv", header = TRUE, sep = ",")
NOTE: The description of the column names are as follows:
Column Name | Description |
---|---|
LSOACODE |
Unique identification code for the geographic area |
AVEPRICE (Dependent variable) |
Average house price estimated for the LSOA in 2015 |
AVEINCOME |
Estimated average annual income for households within an LSOA in 2015 |
IMDSCORE |
Deprivation score for an LSOA in 2015 |
PTAINDEX |
Measures levels of access/connectivity to public transport |
PTACAT |
PTAINDEX rendered into a categorical variable |
The above data were sourced from the London DATASTORE. Next, we import the shape files for London (i.e., LSOA- and Borough-level):
London LSOA Areas.shp
London Borough Areas.shp
# Use read_sf() function to load shape file
LSOAshp <- read_sf("London LSOA Areas.shp")
BOROUGHshp <- read_sf("London Borough Areas.shp")
The code chunk below generates an empty map with the tmap
functions. It inspects the spatial configuration of London’s LSOA with the Boroughs superimposed.
# Generate an empty map to visualise the spatial configuration and hierarchy of LSOA and Boroughs
# First add LSOA layer
tm_shape(LSOAshp) + tm_polygons() +
# Add Borough layer on top of LSOA layer and make it transparent with alpha = 0
tm_shape(BOROUGHshp) + tm_polygons(alpha = 0, border.alpha = 1, border.col = "black") +
# Apply cosmetics by adding compass and scale
tm_compass(position = c("right", "top")) + tm_scale_bar(position = c("left", "bottom"))
2.1. Reporting basic summary statistical measures
For any statistical & spatial analysis its always useful to conduct some descriptive analysis of your dataset. You can obtain basic summary measures using the summary()
function which reports the overall result i.e., minimum, maximum, median, mean, 25th and 75th percentile values. For instance, we can report these estimates for the house price variable:
summary(datafile$AVEPRICE)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 133997 310040 409982 528509 590064 5751342
We can see that the mean price across the LSOAs is £528,509
with ranges from £133,997
to £5,751,342
. You can also report the standard deviation across all LSOAs using the sd()
function. As you can see the result is large and estimated as ±£418,842.5
.
sd(datafile$AVEPRICE)
## [1] 418842.5
NOTE: We have compiled some useful functions for reporting basic measures for descriptive analysis of the fly:
min()
: minimummax()
: maximummean()
: meanmedian()
: mediansd()
: standard deviation2.2. Reporting the spatial distribution of our variables
Now lets examine the spatial distribution for the house price. First, the LSOA data is stored in datafile
which is just a data frame object. We need to merge this to the spatial object LSOAshp
before using the tmap
functions. Lets create a new spatial data frame and name it spatialdatafile
# Merge datafile to LSOAshp uniquely by using "LSOACODE column
spatialdatafile <- merge(LSOAshp, datafile, by.x = "LSOACODE", by.y = "LSOACODE")
Lets now generate our first map to inspect the distribution of house prices. We can actually store a picture as an image object called plot1
plot1 <- tm_shape(spatialdatafile) + tm_fill("AVEPRICE", style = "quantile", n = 7, palette = "Greens") +
tm_shape(BOROUGHshp) + tm_polygons(alpha = 0, border.alpha = 1, border.col = "black") +
tm_text("BOROUGHN", size = "AREA") +
tm_compass(position = c("right", "top")) +
tm_scale_bar(position = c("left", "bottom")) +
tm_layout(frame = FALSE, legend.title.size = 0.5, legend.text.size = 0.5)
# plot the image object
plot1
Descriptively, we can observe already an interesting pattern. LSOAs in parts of North, Central and the central West of Inner London tend to have high-priced properties on average exceeding £758,000.00
, whereas in parts of Hillingdon, Enfield, Croydon, Bexley, Barking & Dagenham have on average cheaper properties below £273,000.00
. We need to take note that the patterns look clustered; however, this descriptive reporting is by no means an evidence-based analysis for assessing clustering or dispersion (i.e., spatial autocorrelation) - a Moran’s I test will be able to diagnosis this problem for house price but when other factors are involved we will need to use the residuals in the Moran’s I test.
Let us visualise three other variables which will subsequently be treated as independent variables in a regression to eyeball whether they are correlated with the house prices. Let us plot maps for income, deprivation and PAT (categories); and then stitch them together by invoking the tmap_arrange()
function.
# create 3 separate maps and store them in plot2, plot3 & plot4 objects
# map for income
plot2 <- tm_shape(spatialdatafile) + tm_fill("AVEINCOME", style = "quantile", n = 7, palette = "Oranges") +
tm_shape(BOROUGHshp) + tm_polygons(alpha = 0, border.alpha = 1, border.col = "black") +
tm_text("BOROUGHN", size = "AREA") +
tm_compass(position = c("right", "top")) +
tm_scale_bar(position = c("left", "bottom")) +
tm_layout(frame = FALSE, legend.title.size = 0.5, legend.text.size = 0.5)
# map for socioeconomic deprivation
plot3 <- tm_shape(spatialdatafile) + tm_fill("IMDSCORE", style = "quantile", n = 7, palette = "Reds") +
tm_shape(BOROUGHshp) + tm_polygons(alpha = 0, border.alpha = 1, border.col = "black") +
tm_text("BOROUGHN", size = "AREA") +
tm_compass(position = c("right", "top")) +
tm_scale_bar(position = c("left", "bottom")) +
tm_layout(frame = FALSE, legend.title.size = 0.5, legend.text.size = 0.5)
# map for public transport accessibility categories (PTACAT)
plot4 <- tm_shape(spatialdatafile) + tm_fill("PTACAT", style = "cat", palette = "Blues") +
tm_shape(BOROUGHshp) + tm_polygons(alpha = 0, border.alpha = 1, border.col = "black") +
tm_text("BOROUGHN", size = "AREA") +
tm_compass(position = c("right", "top")) +
tm_scale_bar(position = c("left", "bottom")) +
tm_layout(frame = FALSE, legend.title.size = 0.5, legend.text.size = 0.5)
# stitch the maps together using tmap_arrange() function
tmap_arrange(plot1, plot2, plot3, plot4, nrow = 2)
Visually, the income appears to be strongly correlated with house price. The patterns for socioeconomic deprivation on the other hand appears to have a negative correlation with house price. PTA’s relationship is unclear.
2.3. Fitting a non-spatial Linear Regression on spatial data & checking the residuals
Next, we are going to fit a linear regression where the response variable is AVEPRICE
and the predictors are AVEINCOME
, IMDSCORE
and PTACAT
, and then extract the residuals from the model in order to test for spatial autocorrelation.
When fitting a linear regression - we must test that the residuals have the following properties:
QQ-plot
.IMPORTANT NOTE: Things that can violate the above requirements for our residuals are the variables inputted to the model being very skewed. In our case, all variables are very skewed. So, it is best to transform them using the log10()
scale. Just do this transformation in the models since, we only care about the residuals. Note that there is no right or wrong when it comes to implementing the transformation, as long as the assumption for the residuals from a linear regression are not violated then all is fine.
If one of these characteristics are violated then its an indication that the data are not independent. This could possibly be due to some data artefact (i.e., a critical error in the data itself), or the residuals being correlated with each other. If they are correlated, then we should check first by mapping the residuals on the map to examine it’s spatial patterns for clustering and testing it with the Moran’s I test.
Let’s demonstrate by using the built-in function lm()
to create a simple linear or multivariable linear regression.
# lm() function builds a regression model and stores model output into the object 'modelMLR'
modelMLR <- lm(log10(AVEPRICE) ~ log10(AVEINCOME) + log10(IMDSCORE) + log10(PTAINDEX), data = spatialdatafile)
# Include the 'scipen=7' argument in the summary() function remove those annoying scientific notation!
options(scipen = 7)
# summary() calls report the output stored in object 'modelMLR'
summary(modelMLR)
##
## Call:
## lm(formula = log10(AVEPRICE) ~ log10(AVEINCOME) + log10(IMDSCORE) +
## log10(PTAINDEX), data = spatialdatafile)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.39249 -0.06489 -0.00572 0.06046 0.62993
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.100992 0.095592 -42.901 < 2e-16 ***
## log10(AVEINCOME) 2.036354 0.019340 105.292 < 2e-16 ***
## log10(IMDSCORE) 0.136713 0.007681 17.798 < 2e-16 ***
## log10(PTAINDEX) 0.030055 0.004816 6.241 0.000000000471 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1027 on 4964 degrees of freedom
## Multiple R-squared: 0.789, Adjusted R-squared: 0.7889
## F-statistic: 6189 on 3 and 4964 DF, p-value: < 2.2e-16
IMPORTANT SIDENOTE: The above results presented in the above table indicates the relationship between the dependent and independent variable. Under the column for Estimate
reports the overall intercept and coefficients for each independent variable. The p-values for the estimates are reported under the column Pr(>|t|)
which determines whether the relationship between the dependent and independent variables are statistically significant or not.
Lastly, the other two important pieces of information is the Adjusted R-Squared
value and the model’s p-value
at the bottom. These tells us the performance of the model in general. The former tells you the percentage of variation explained in the house price when including AVEINCOME
, IMDSCORE
, PTAINDEX
in the regression, while the latter informs us whether if this is significant or not.
HOW TO INTERPRET RESULTS: This is how we fully interpret the coefficients and model performance from the above table:
AVEINCOME
was to increase by 1.0%
, we expect the house prices to increase by 2.04%
and this average increase is statistically significant since the p-value = 0.00000000002
<0.05
.PTAINDEX
was to increase by 1.0%
, we expect the house prices to increase marginally by 0.03%
. The marginally increase is statistically significant since the p-value = 0.000000000471
< 0.05
.IMDSCORE
was to increase by 1.0%
, we expect the house prices to increase marginally by 0.13%
and this average increase is statistically significant since the p-value = 0.00000000002
<0.05
.Adjusted R-Squared
value 0.7889 (78.89%)
of the variation in the house prices across LSOAs were explained by the model after accounting for AVEINCOME
, IMDSCORE
, PTAINDEX
. Since the Adjusted R-Squared
more than 50.0%
it is hence a very good model and significant (i.e., p-value = 0.0000000002
< 0.05
).Now that we have fitted the model, we can extract the residuals and insert them as a new column into the spatialdatafile
. To perform this action, use the modelMLR
object and extract the residuals output from it.
# Extract residuals from "modelLMR" object and dump into "spatialdatafile" and call the column "RESIDUALS"
spatialdatafile$RESIDUALS <- modelMLR$residuals
# Reporting basic summary measures to have an idea of its distribution before plotting them on map
summary(spatialdatafile$RESIDUALS)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.392492 -0.064891 -0.005722 0.000000 0.060463 0.629932
Let us generate a map to examine if these residuals show patterns of spatial autocorrelation. We have divergent values for the legends (i.e., negative and positive value) therefore it best to specify in the tm_fill()
function that style = "cont"
and the midpoint = 0
, and a divergent colour scheme e.g. Reds to Blue (-RdBu
).
tm_shape(spatialdatafile) + tm_fill("RESIDUALS", style = "cont", midpoint = 0, palette = "-RdBu") +
tm_shape(BOROUGHshp) + tm_polygons(alpha = 0, border.alpha = 1, border.col = "black") +
tm_text("BOROUGHN", size = "AREA") +
tm_compass(position = c("right", "top")) +
tm_scale_bar(position = c("left", "bottom")) +
tm_layout(frame = FALSE, legend.title.size = 0.5, legend.text.size = 0.5)
Notice the spatial patterning and clusters of the LSOAs and the over-prediction (i.e., areas that have negative residuals, or blue tones) and under-prediction (i.e., areas that positive residuals, or red tones). This visual inspection of the residuals is telling you that spatial autocorrelation may be present here. This, however, would require a more formal test.
Now, let use the Moran’s I test to confirm the presence of spatial autocorrelation. Recall last week’s lectures and computer labs session? Lets create the spatial adjacency matrix and apply the Moran’s I test on the modelMLR
object using the lm.morantest()
function.
#generate unique number for each row
spatialdatafile$ROWNUM <- 1:nrow(spatialdatafile)
# We need to coerce the sf spatialdatafile object into a new sp object
spatialdatafile_2.0 <- as(spatialdatafile, "Spatial")
# Create spatial weights matrix for areas
Weights <- poly2nb(spatialdatafile_2.0, row.names = spatialdatafile_2.0$ROWNUM)
WeightsMatrix <- nb2mat(Weights, style='B')
Residual_WeightMatrix <- mat2listw(WeightsMatrix , style='W')
# Run the test on the regression model output object "modelMLR" using lm.morantest()
lm.morantest(modelMLR, Residual_WeightMatrix, alternative="two.sided")
##
## Global Moran I for regression residuals
##
## data:
## model: lm(formula = log10(AVEPRICE) ~ log10(AVEINCOME) +
## log10(IMDSCORE) + log10(PTAINDEX), data = spatialdatafile)
## weights: Residual_WeightMatrix
##
## Moran I statistic standard deviate = 56.28, p-value < 2.2e-16
## alternative hypothesis: two.sided
## sample estimates:
## Observed Moran I Expectation Variance
## 0.47489527088 -0.00060260241 0.00007138138
You will notice we obtained a statistically significant value (i.e., p-value <0.001
) for Moran’s I of value = 0.475
. The value of the Moran’s I test somewhat high. This is an indication that the errors (the residuals) are somewhat related to each other and thus not independent. A spatial regression would be much appropriate for modelling this type of data since there’s evidence of spatial autocorrelation.
3.1. Types of Spatial Lag and Error Models
At this point, we will introduce two kinds of spatial regressions models that can address the issues of spatial autocorrelation. These models are similar in a sense that they all require the inclusion of a spatial weight matrix to account for the spatial configuration of the areas under investigation.
Previously, we fitted a linear regression model that’s non-spatial, which takes the mathematical formula as follows:
\[y = \beta_{0} + x_{1}\beta_{1} + x_{2}\beta_{2} + ... +\epsilon\]
Here, \(y\) is the response variable (i.e., AVEPRICE
). The \(x_{1}\), \(x_{2}\)., etc., are the independent variables (i.e, \(x_{1}\), \(x_{2}\) & \(x_{3}\) are AVEINCOME
, IMDSCORE
and PTAINDEX
respectively). \(\beta_{1}\), \(\beta_{2}\)., etc., are the estimated coefficients for the independent variables. The \(\epsilon\) represents the uncorrelated error term.
To make this a Spatial Lag Model lagged on the dependent variable, we tweak the above equation by including the spatial weight matrix \(W\) which is multiplied by the dependent variable \(y\). This product will have an estimated coefficient termed \(\rho\). The \(\rho\) parameter tells us the degree at which our observed outcome inside the area of interest is influenced by outcomes measured from its neighbours. See model below:
The Spatial Error Model, on the other hand, incorporates spatial weight matrix \(W\) into the error term, where \(\lambda\) is an estimated coefficient for the product between \(W\) and \(u\), which \(u\) is a correlated spatial error term. See model below:
Here, we show you how to implement the two models to see which one better address the issue of spatial autocorrelation. But there are some important points:
impacts
which we will use in the interpretation.3.2. Spatial Lag Model lagged on the dependent variable
A Spatial Lag Model on the dependent variable assumes that dependencies exist directly among the levels of the dependent variable. That is, the observed outcome in one primary location is affected by other outcomes in nearby or neighbouring locations. For instance, if we implement this model on the LSOA house price, we are assuming that the property prices in one LSOA is impacted by the property prices from nearby neighbouring LSOAs.
We can perform this analysis in four steps:
lagsarlm()
function.summary()
function to report the results. Here, we are interested in the \(\rho\) parameter & its p-value. Here, we also want to check if the model is appropriate than a non-spatial model - examine the AIC (for lag versus LM) and the one with the lowest AIC is the better model.moran.mc()
to ensure that the statistic is less than what was obtained for the Moran’s I test for the linear model. Use the tmap
to examine its spatial patterning.impact()
functionSTEP ONE
# Fit model using lagsarlm()
# reuse spatial weight matrix created earlier as an object called "Residual_WeighMatrix"
modelSLY <- lagsarlm(log10(AVEPRICE) ~ log10(IMDSCORE) + log10(AVEINCOME) + log10(PTAINDEX), data = spatialdatafile_2.0, Residual_WeightMatrix)
STEP TWO
# Report results with summary()
# We are interested in the rho-coefficient, log-likelihood ratio test's p-value and the AIC
summary(modelSLY)
##
## Call:lagsarlm(formula = log10(AVEPRICE) ~ log10(IMDSCORE) + log10(AVEINCOME) +
## log10(PTAINDEX), data = spatialdatafile, listw = Residual_WeightMatrix)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.3608275 -0.0540603 -0.0039772 0.0518209 0.6492007
##
## Type: lag
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.6649683 0.0924917 -28.8130 < 2.2e-16
## log10(IMDSCORE) 0.0435882 0.0068588 6.3551 2.083e-10
## log10(AVEINCOME) 1.2144821 0.0286833 42.3412 < 2.2e-16
## log10(PTAINDEX) 0.0106795 0.0041275 2.5874 0.009671
##
## Rho: 0.4522, LR test value: 1354.5, p-value: < 2.22e-16
## Asymptotic standard error: 0.012282
## z-value: 36.819, p-value: < 2.22e-16
## Wald statistic: 1355.6, p-value: < 2.22e-16
##
## Log likelihood: 4937.637 for lag model
## ML residual variance (sigma squared): 0.0077091, (sigma: 0.087801)
## Number of observations: 4968
## Number of parameters estimated: 6
## AIC: -9863.3, (AIC for lm: -8510.8)
## LM test for residual autocorrelation
## test value: 443.54, p-value: < 2.22e-16
INTERPRETATION: The \(\rho\) statistic informs us of how the neighbouring LSOA house price values affect the house price at \(y\). The \(\rho\) value is a positive value of 0.4522
which means the neighbouring LSOAs affect is a positive manner, and it is statistically significant (i.e., p-value < 0.05
). We can see the AIC for the lag model is lower than the original linear regression model (i.e., Lag: -9863.3
vs LM: -8510.8
) therefore the lag model is okay.
IMPORTANT NOTE: In a lag model, do not even try to interpret the coefficients for the independent variables - ignore them and their p-values… they are nonsense! Why? This is because there is a global feedback effect happening here - i.e., whenever we change something in our own region (i.e., LSOA) for instance, like the AVEINCOME
in a LSOA, that will not only affect our own house price, but when it causes the house price to go up in its own area, this in turn will cause the house prices to increase in its neighbour’s area; and when the neighbour’s house price go up - it is again going to affect our house price… so its an infinite loop. Instead, we interpret the result churn out from the impact
function which reports their direct
and indirect
effects.
STEP THREE
# extract the residuals for modelSLY object and dump back to original sf spatialdatafile object
spatialdatafile$RESID_SLY <- modelSLY$residuals
# use Moran's I test using moran.mc() function
moran.mc(spatialdatafile$RESID_SLY, Residual_WeightMatrix, 1000, zero.policy = T)
##
## Monte-Carlo simulation of Moran I
##
## data: spatialdatafile$RESID_SLY
## weights: Residual_WeightMatrix
## number of simulations + 1: 1001
##
## statistic = 0.13417, observed rank = 1001, p-value = 0.000999
## alternative hypothesis: greater
INTERPRETATION: The Moran’s I from the original model was 0.4748
. Here, it is 0.1341
which is much lower thus the lag model has accounted for a lot of spatial autocorrelation although it still significantly remains. We can conclude that spatial lag model does address some of the issues of spatial autocorrelation in the model’s residuals but not all since it significantly positive. This is evidenced in the map output of the residual lags.
IMPORTANT NOTE: We have to make a mental note of this and compare it with the performance of the Spatial Error Model.
# generate the map
tm_shape(spatialdatafile) + tm_fill("RESID_SLY", style = "cont", midpoint = 0, palette = "-RdBu") +
tm_shape(BOROUGHshp) + tm_polygons(alpha = 0, border.alpha = 1, border.col = "black") +
tm_text("BOROUGHN", size = "AREA") +
tm_compass(position = c("right", "top")) +
tm_scale_bar(position = c("left", "bottom")) +
tm_layout(frame = FALSE, legend.title.size = 0.5, legend.text.size = 0.5)
STEP FOUR
# Interpretation of results using impacts
# impacts
Weights_2.0 <- as(Residual_WeightMatrix, "CsparseMatrix")
trMC <- trW(Weights_2.0, type="MC")
summary(impacts(modelSLY, tr = trMC, R=100), zstats=TRUE)
## Impact measures (lag, trace):
## Direct Indirect Total
## log10(IMDSCORE) 0.04549157 0.03407862 0.07957020
## log10(AVEINCOME) 1.26751610 0.94952099 2.21703709
## log10(PTAINDEX) 0.01114590 0.00834961 0.01949551
## ========================================================
## Simulation results ( variance matrix):
## Direct:
##
## Iterations = 1:100
## Thinning interval = 1
## Number of chains = 1
## Sample size per chain = 100
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## log10(IMDSCORE) 0.04562 0.007422 0.0007422 0.0007422
## log10(AVEINCOME) 1.26928 0.026387 0.0026387 0.0026387
## log10(PTAINDEX) 0.01088 0.004328 0.0004328 0.0003738
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## log10(IMDSCORE) 0.032402 0.040469 0.04610 0.05076 0.05849
## log10(AVEINCOME) 1.229476 1.251055 1.26728 1.28584 1.32415
## log10(PTAINDEX) 0.003192 0.007979 0.01125 0.01371 0.01960
##
## ========================================================
## Indirect:
##
## Iterations = 1:100
## Thinning interval = 1
## Number of chains = 1
## Sample size per chain = 100
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## log10(IMDSCORE) 0.034022 0.005253 0.0005253 0.0005253
## log10(AVEINCOME) 0.948021 0.028339 0.0028339 0.0028339
## log10(PTAINDEX) 0.008139 0.003280 0.0003280 0.0001944
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## log10(IMDSCORE) 0.024297 0.031156 0.034191 0.03725 0.04410
## log10(AVEINCOME) 0.894938 0.927677 0.954014 0.96687 1.00306
## log10(PTAINDEX) 0.002246 0.005972 0.008407 0.01032 0.01484
##
## ========================================================
## Total:
##
## Iterations = 1:100
## Thinning interval = 1
## Number of chains = 1
## Sample size per chain = 100
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## log10(IMDSCORE) 0.07964 0.012563 0.0012563 0.0012563
## log10(AVEINCOME) 2.21730 0.030430 0.0030430 0.0030430
## log10(PTAINDEX) 0.01902 0.007596 0.0007596 0.0006497
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## log10(IMDSCORE) 0.056781 0.07171 0.08026 0.08809 0.10206
## log10(AVEINCOME) 2.160813 2.19553 2.21715 2.23481 2.28226
## log10(PTAINDEX) 0.005438 0.01400 0.01958 0.02396 0.03445
##
## ========================================================
## Simulated standard errors
## Direct Indirect Total
## log10(IMDSCORE) 0.007422257 0.005253003 0.012562739
## log10(AVEINCOME) 0.026387444 0.028338708 0.030430407
## log10(PTAINDEX) 0.004327630 0.003279803 0.007595969
##
## Simulated z-values:
## Direct Indirect Total
## log10(IMDSCORE) 6.146614 6.476621 6.339657
## log10(AVEINCOME) 48.101697 33.453228 72.864687
## log10(PTAINDEX) 2.513220 2.481530 2.503330
##
## Simulated p-values:
## Direct Indirect Total
## log10(IMDSCORE) 7.9154e-10 9.3799e-11 2.3028e-10
## log10(AVEINCOME) < 2.22e-16 < 2.22e-16 < 2.22e-16
## log10(PTAINDEX) 0.011963 0.013082 0.012303
INTERPRETATION: Here is where we derive meaningful interpretation of the coefficients. A big table is churned out - all we care about is the first table titled: Impact Measures (lag, trace):
and the last table titled: Simulated p-values
. For instance, let’s interpret the log10(AVEINCOME)
(on the log-scale), for the direct effects in its own LSOA, if the levels of income were to increase by 1%, this will cause an increase in the property prices by 1.267%
(p < 0.05
) in its own LSOA. But for the indirect affects, if the log10(AVEINCOME)
were to change across neighbouring LSOAs, this will affect the value of our house prices by 0.949%
(p < 0.05
). The total column is the combined effect.
3.3. Spatial Error Models In a Spatial Error Model, we assume that the error terms are correlated across observations (i.e., the error of an observed value affects the errors of its neighbors).
We essentially repeat the first 3 steps highlighted in section 3.2. for this analysis:
errorsarlm()
function.summary()
function to report the results for the \(\lambda\) parameter and it’s p-value to check if the model is appropriate than a non-spatial one. You can also check with the AIC.moran.mc()
to ensure that the statistic is less than what was obtained for the Moran’s I test for the linear model. Use the tmap
to examine its spatial patterning.STEP ONE
modelSER <- errorsarlm(log10(AVEPRICE) ~ log10(IMDSCORE) + log10(AVEINCOME) + log10(PTAINDEX), data = spatialdatafile_2.0, Residual_WeightMatrix)
STEP TWO
# Report results with summary()
# We are interested in the rho-coefficient, log-likelihood ratio test's p-value and the AIC
summary(modelSER)
##
## Call:
## errorsarlm(formula = log10(AVEPRICE) ~ log10(IMDSCORE) + log10(AVEINCOME) +
## log10(PTAINDEX), data = spatialdatafile_2.0, listw = Residual_WeightMatrix)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.3756388 -0.0458748 -0.0037674 0.0427292 0.6020240
##
## Type: error
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.99537424 0.14501324 -20.6559 < 2.2e-16
## log10(IMDSCORE) 0.05526337 0.00896825 6.1621 7.178e-10
## log10(AVEINCOME) 1.82950617 0.02925677 62.5327 < 2.2e-16
## log10(PTAINDEX) -0.00028732 0.00487203 -0.0590 0.953
##
## Lambda: 0.72207, LR test value: 2196, p-value: < 2.22e-16
## Asymptotic standard error: 0.012525
## z-value: 57.65, p-value: < 2.22e-16
## Wald statistic: 3323.6, p-value: < 2.22e-16
##
## Log likelihood: 5358.42 for error model
## ML residual variance (sigma squared): 0.0060109, (sigma: 0.07753)
## Number of observations: 4968
## Number of parameters estimated: 6
## AIC: -10705, (AIC for lm: -8510.8)
INTERPRETATION: The \(\lambda\) statistic informs us that if there’s a sudden change in the error term for house prices in neighbouring LSOAs how did it impact the error term for the house price in our LSOA at \(y\). The \(\lambda\) value is a positive value of 0.7221
which means the affect of neighbouring LSOAs are positive and the impact is statistically significant (i.e., p-value < 0.05
). We can see the AIC for the error model is lower than both the original linear regression & lag model (i.e., Error: -10705.3
vs LM: -8510.8
& Lag:-9863.3
) therefore the error model better than the two.
IMPORTANT NOTE: Unlike the lag model, we can interpret the coefficients from the error model for the independent variables! We can interpret them the same way we did for the linear regression model (see section 2.3.
).
STEP THREE
# extract the residuals for modelSLY object and dump back to original sf spatialdatafile object
spatialdatafile$RESID_SER <- modelSER$residuals
# use Moran's I test using moran.mc() function
moran.mc(spatialdatafile$RESID_SER, Residual_WeightMatrix, 1000, zero.policy = T)
##
## Monte-Carlo simulation of Moran I
##
## data: spatialdatafile$RESID_SER
## weights: Residual_WeightMatrix
## number of simulations + 1: 1001
##
## statistic = -0.057944, observed rank = 1, p-value = 0.999
## alternative hypothesis: greater
INTERPRETATION: The Moran’s I from the original model was 0.4748
. Here, it is -0.0579
which is negative and the lowest for the error model. On top of that there is no evidence of spatial autocorrelation since its p-value is not significant. Therefore, we can conclude that the spatial error model does address the issue of spatial autocorrelation in the residuals. Its a better model than the Linear regression and Lag model for exploring the relationship with those independent variables and accounting for spatial autocorrelation. This is also evidenced in the map output of the residual errors.
IMPORTANT NOTE:
# generate the map
tm_shape(spatialdatafile) + tm_fill("RESID_SER", style = "cont", midpoint = 0, palette = "-RdBu") +
tm_shape(BOROUGHshp) + tm_polygons(alpha = 0, border.alpha = 1, border.col = "black") +
tm_text("BOROUGHN", size = "AREA") +
tm_compass(position = c("right", "top")) +
tm_scale_bar(position = c("left", "bottom")) +
tm_layout(frame = FALSE, legend.title.size = 0.5, legend.text.size = 0.5)
4.1. Reading materials