8A: Geographically Weighted Predictive Models

In this exercise, we will learn how to build predictive models using the geographical random forest method to predict outcomes based on geospatial factors and historical geospatial locations.

Author

Teng Kok Wai (Walter)

Published

September 28, 2024

Modified

September 29, 2024

1 Exercise 8A Reference

R for Geospatial Data Science and Analytics - 14  Geographically Weighted Predictive Models

2 Overview

In this exercise, we will learn how to build predictive models using the geographical random forest method to predict outcomes based on geospatial factors and historical geospatial locations.

Predictive modeling uses statistical and machine learning techniques to forecast future outcomes. To build these models, we start with a dataset where the outcomes are already known, along with various input variables (predictors) that might influence those outcomes. The model learns from these known examples to make accurate predictions about future events.

Geospatial predictive modelling is conceptually rooted in the principle that the occurrences of events being modeled are limited in distribution.

When geographically referenced data are used, occurrences of events are neither uniform nor random in distribution over space. There are geospatial factors (infrastructure, sociocultural, topographic, etc.) that constrain and influence where the locations of events occur.

Geospatial predictive modeling identifies and analyzes these factors by examining the relationship between past event locations and the environmental conditions that might have influenced those locations. This helps to better understand and predict where similar events are likely to happen in the future.

Tip

Several sections in this exercise will take a while to compute… , we will save several intermediate rds file as checkpoints for future uses.

3 Learning Outcome

  • Prepare training and test datasets using appropriate data sampling methods.
  • Calibrate predictive models using both geospatial statistical learning and machine learning methods.
  • Compare and select the best model for predicting future outcomes.
  • Predict future outcomes using the best-calibrated model.

4 The Data

The following datasets will be used in this study:

Data Type Description Format
Aspatial dataset HDB Resale data: a list of HDB resale transacted prices in Singapore from Jan 2017 onwards. CSV
Geospatial dataset MP14_SUBZONE_WEB_PL: A polygon feature dataset providing information on URA 2014 Master Plan Planning Subzone boundary data. ESRI Shapefile
Locational factors with geographic coordinates Various data sets including Eldercare, Hawker Centre, Parks, Supermarket, CHAS clinics, Childcare services, Kindergartens, MRT, and Bus stops. GeoJSON/Shapefile
Locational factors without geographic coordinates Primary school data, CBD coordinates, Shopping malls, Good primary schools (ranking by popularity). CSV/Other Sources

5 Installing and Launching the R Packages

The following R packages will be used in this exercise:

Package Purpose Use Case in Exercise
sf Handles vector-based geospatial data. Importing and manipulating polygon and point feature data.
spdep Provides functions for spatial dependence analysis, including spatial weights and spatial autocorrelation. Performing spatially constrained cluster analysis using geographically weighted regression (GWR).
GWmodel Provides geographically weighted modeling methods. Calibrating models to predict HDB resale prices using geographically weighted regression.
SpatialML Supports geographical random forest models and spatial machine learning methods. Calibrating models using geographically weighted random forest (GW RF).
tmap Creates static and interactive thematic maps. Visualizing geospatial data, model predictions, and other geographic patterns.
rsample Provides tools for data resampling. Splitting datasets into training and testing subsets.
Metrics Provides evaluation metrics for statistical and machine learning models. Calculating RMSE (Root Mean Square Error) to evaluate model accuracy.
tidyverse A collection of packages for data science tasks such as data manipulation, visualization, and modeling. Data wrangling, visualization, and performing statistical operations on datasets.

To install and load these packages, use the following code:

pacman::p_load(sf, spdep, GWmodel, SpatialML, tmap, rsample, Metrics, tidyverse)

6 Import Data and Preparation

6.1 Reading Data File

We begin by loading the input data, which is stored as a simple feature data frame.

mdata <- read_rds("data/aspatial/mdata.rds")

6.2 Data Sampling

The data is split into 65% training and 35% test sets using the initial_split() function from the rsample package (part of tidymodels).

set.seed(1234)

resale_split <- initial_split(mdata, 
                              prop = 6.5/10,)

train_data <- training(resale_split)
test_data <- testing(resale_split)

Save the split datasets:

write_rds(train_data, "data/model/train_data.rds")
write_rds(test_data, "data/model/test_data.rds")

6.3 Computing Correlation Matrix

It is important to check for multicollinearity using a correlation matrix before building the predictive model.

mdata_nogeo <- mdata %>% st_drop_geometry()
corrplot::corrplot(cor(mdata_nogeo[, 2:17]), 
                   diag = FALSE, 
                   order = "AOE", 
                   tl.pos = "td", 
                   tl.cex = 0.5, 
                   method = "number", 
                   type = "upper")

Note

Observations: All correlation values are below 0.8, indicating no sign of multicollinearity.

6.4 Retrieving Stored Data

Finally, load the previously saved training and test data:

train_data <- read_rds("data/model/train_data.rds")
test_data <- read_rds("data/model/test_data.rds")

7 Building a Non-Spatial Multiple Linear Regression

Next, we will build a multiple linear regression (MLR) model to predict resale_price using various predictors:

price_mlr <- lm(resale_price ~ 
                  floor_area_sqm +
                  storey_order + 
                  remaining_lease_mths +
                  PROX_CBD + 
                  PROX_ELDERLYCARE + 
                  PROX_HAWKER +
                  PROX_MRT + 
                  PROX_PARK + 
                  #PROX_GOOD_PRISCH + 
                  PROX_MALL +
                  #PROX_CHAS + 
                  PROX_SUPERMARKET + 
                  WITHIN_350M_KINDERGARTEN +
                  WITHIN_350M_CHILDCARE + 
                  WITHIN_350M_BUS +
                  WITHIN_1KM_PRISCH,
                data=train_data)

summary(price_mlr)

Call:
lm(formula = resale_price ~ floor_area_sqm + storey_order + remaining_lease_mths +
    PROX_CBD + PROX_ELDERLYCARE + PROX_HAWKER + PROX_MRT + PROX_PARK +
    PROX_MALL + PROX_SUPERMARKET + WITHIN_350M_KINDERGARTEN +
    WITHIN_350M_CHILDCARE + WITHIN_350M_BUS + WITHIN_1KM_PRISCH,
    data = train_data)

Residuals:
    Min      1Q  Median      3Q     Max
-205193  -39120   -1930   36545  472355

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)
(Intercept)              107601.073  10601.261  10.150  < 2e-16 ***
floor_area_sqm             2780.698     90.579  30.699  < 2e-16 ***
storey_order              14299.298    339.115  42.167  < 2e-16 ***
remaining_lease_mths        344.490      4.592  75.027  < 2e-16 ***
PROX_CBD                 -16930.196    201.254 -84.124  < 2e-16 ***
PROX_ELDERLYCARE         -14441.025    994.867 -14.516  < 2e-16 ***
PROX_HAWKER              -19265.648   1273.597 -15.127  < 2e-16 ***
PROX_MRT                 -32564.272   1744.232 -18.670  < 2e-16 ***
PROX_PARK                 -5712.625   1483.885  -3.850 0.000119 ***
PROX_MALL                -14717.388   2007.818  -7.330 2.47e-13 ***
PROX_SUPERMARKET         -26881.938   4189.624  -6.416 1.46e-10 ***
WITHIN_350M_KINDERGARTEN   8520.472    632.812  13.464  < 2e-16 ***
WITHIN_350M_CHILDCARE     -4510.650    354.015 -12.741  < 2e-16 ***
WITHIN_350M_BUS             813.493    222.574   3.655 0.000259 ***
WITHIN_1KM_PRISCH         -8010.834    491.512 -16.298  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 61650 on 10320 degrees of freedom
Multiple R-squared:  0.7373,    Adjusted R-squared:  0.737
F-statistic:  2069 on 14 and 10320 DF,  p-value: < 2.2e-16

Save the model for future use:

write_rds(price_mlr, "data/model/price_mlr.rds") 

8 GWR Predictive Method

In this section, we use Geographically Weighted Regression (GWR) to predict HDB resale prices.

8.1 Converting sf Data Frame to SpatialPointsDataFrame

Convert the training data into a SpatialPointsDataFrame for use in GWR:

train_data_sp <- as_Spatial(train_data)
train_data_sp
class       : SpatialPointsDataFrame
features    : 10335
extent      : 11597.31, 42623.63, 28217.39, 48741.06  (xmin, xmax, ymin, ymax)
crs         : +proj=tmerc +lat_0=1.36666666666667 +lon_0=103.833333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
variables   : 17
names       : resale_price, floor_area_sqm, storey_order, remaining_lease_mths,          PROX_CBD,     PROX_ELDERLYCARE,        PROX_HAWKER,           PROX_MRT,          PROX_PARK,   PROX_GOOD_PRISCH,        PROX_MALL,            PROX_CHAS,     PROX_SUPERMARKET, WITHIN_350M_KINDERGARTEN, WITHIN_350M_CHILDCARE, ...
min values  :       218000,             74,            1,                  555, 0.999393538715878, 1.98943787433087e-08, 0.0333358643817954, 0.0220407324774434, 0.0441643212802781, 0.0652540365486641,                0, 6.20621206270077e-09, 1.21715176356525e-07,                        0,                     0, ...
max values  :      1186888,            133,           17,                 1164,  19.6500691667807,     3.30163731686804,   2.86763031236184,   2.13060636038504,   2.41313695915468,   10.6223726149914, 2.27100643784442,    0.808332738794272,     1.57131703651196,                        7,                    20, ... 

8.2 Computing Adaptive Bandwidth

Next, we use bw.gwr() to compute the optimal adaptive bandwidth using cross-validation:

bw_adaptive <- bw.gwr(resale_price ~ 
                        floor_area_sqm +
                        storey_order + 
                        remaining_lease_mths +
                        PROX_CBD + 
                        PROX_ELDERLYCARE + 
                        PROX_HAWKER +
                        PROX_MRT + 
                        PROX_PARK + 
                        #PROX_GOOD_PRISCH +
                        PROX_MALL +
                        #PROX_CHAS +
                        PROX_SUPERMARKET + 
                        WITHIN_350M_KINDERGARTEN +
                        WITHIN_350M_CHILDCARE + 
                        WITHIN_350M_BUS +
                        WITHIN_1KM_PRISCH,
                      data=train_data_sp,
                      approach="CV",
                      kernel="gaussian",
                      adaptive=TRUE,
                      longlat=FALSE)
Note

Observations:

Based on the output from bw.gwr() function, the optimal bandwidth is determined to be 40 neighbor points. This means the model will consider the 40 nearest neighbors when estimating parameters for a specific location.

Save the result:

write_rds(bw_adaptive, "data/model/bw_adaptive.rds")

8.3 Constructing the Adaptive Bandwidth GWR Model

Load the saved bandwidth and calibrate the GWR model:

bw_adaptive <- read_rds("data/model/bw_adaptive.rds")
bw_adaptive
[1] 40

Now, we can go ahead to calibrate the gwr-based hedonic pricing model by using adaptive bandwidth and Gaussian kernel as shown in the code below.

gwr_adaptive <- gwr.basic(formula = resale_price ~
                            floor_area_sqm + 
                            storey_order +
                            remaining_lease_mths + 
                            PROX_CBD + 
                            PROX_ELDERLYCARE + 
                            PROX_HAWKER +
                            PROX_MRT + 
                            PROX_PARK +
                            #PROX_GOOD_PRISCH +
                            PROX_MALL + 
                            #PROX_CHAS +
                            PROX_SUPERMARKET + 
                            WITHIN_350M_KINDERGARTEN +
                            WITHIN_350M_CHILDCARE + 
                            WITHIN_350M_BUS +
                            WITHIN_1KM_PRISCH,
                          data = train_data_sp,
                          bw = bw_adaptive, 
                          kernel = 'gaussian', 
                          adaptive = TRUE,
                          longlat = FALSE)

Save the GWR model:

write_rds(gwr_adaptive, "data/model/gwr_adaptive.rds")

8.4 Retrieving the GWR Model

To retrieve and display the saved GWR model:

gwr_adaptive <- read_rds("data/model/gwr_adaptive.rds")
gwr_adaptive
   ***********************************************************************
   *                       Package   GWmodel                             *
   ***********************************************************************
   Program starts at: 2024-09-29 03:07:48.292409
   Call:
   gwr.basic(formula = resale_price ~ floor_area_sqm + storey_order +
    remaining_lease_mths + PROX_CBD + PROX_ELDERLYCARE + PROX_HAWKER +
    PROX_MRT + PROX_PARK + PROX_MALL + PROX_SUPERMARKET + WITHIN_350M_KINDERGARTEN +
    WITHIN_350M_CHILDCARE + WITHIN_350M_BUS + WITHIN_1KM_PRISCH,
    data = train_data_sp, bw = bw_adaptive, kernel = "gaussian",
    adaptive = TRUE, longlat = FALSE)

   Dependent (y) variable:  resale_price
   Independent variables:  floor_area_sqm storey_order remaining_lease_mths PROX_CBD PROX_ELDERLYCARE PROX_HAWKER PROX_MRT PROX_PARK PROX_MALL PROX_SUPERMARKET WITHIN_350M_KINDERGARTEN WITHIN_350M_CHILDCARE WITHIN_350M_BUS WITHIN_1KM_PRISCH
   Number of data points: 10335
   ***********************************************************************
   *                    Results of Global Regression                     *
   ***********************************************************************

   Call:
    lm(formula = formula, data = data)

   Residuals:
    Min      1Q  Median      3Q     Max
-205193  -39120   -1930   36545  472355

   Coefficients:
                              Estimate Std. Error t value Pr(>|t|)
   (Intercept)              107601.073  10601.261  10.150  < 2e-16 ***
   floor_area_sqm             2780.698     90.579  30.699  < 2e-16 ***
   storey_order              14299.298    339.115  42.167  < 2e-16 ***
   remaining_lease_mths        344.490      4.592  75.027  < 2e-16 ***
   PROX_CBD                 -16930.196    201.254 -84.124  < 2e-16 ***
   PROX_ELDERLYCARE         -14441.025    994.867 -14.516  < 2e-16 ***
   PROX_HAWKER              -19265.648   1273.597 -15.127  < 2e-16 ***
   PROX_MRT                 -32564.272   1744.232 -18.670  < 2e-16 ***
   PROX_PARK                 -5712.625   1483.885  -3.850 0.000119 ***
   PROX_MALL                -14717.388   2007.818  -7.330 2.47e-13 ***
   PROX_SUPERMARKET         -26881.938   4189.624  -6.416 1.46e-10 ***
   WITHIN_350M_KINDERGARTEN   8520.472    632.812  13.464  < 2e-16 ***
   WITHIN_350M_CHILDCARE     -4510.650    354.015 -12.741  < 2e-16 ***
   WITHIN_350M_BUS             813.493    222.574   3.655 0.000259 ***
   WITHIN_1KM_PRISCH         -8010.834    491.512 -16.298  < 2e-16 ***

   ---Significance stars
   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
   Residual standard error: 61650 on 10320 degrees of freedom
   Multiple R-squared: 0.7373
   Adjusted R-squared: 0.737
   F-statistic:  2069 on 14 and 10320 DF,  p-value: < 2.2e-16
   ***Extra Diagnostic information
   Residual sum of squares: 3.922202e+13
   Sigma(hat): 61610.08
   AIC:  257320.2
   AICc:  257320.3
   BIC:  247249
   ***********************************************************************
   *          Results of Geographically Weighted Regression              *
   ***********************************************************************

   *********************Model calibration information*********************
   Kernel function: gaussian
   Adaptive bandwidth: 40 (number of nearest neighbours)
   Regression points: the same locations as observations are used.
   Distance metric: Euclidean distance metric is used.

   ****************Summary of GWR coefficient estimates:******************
                                   Min.     1st Qu.      Median     3rd Qu.
   Intercept                -3.2594e+08 -4.7727e+05 -8.3004e+03  5.5025e+05
   floor_area_sqm           -2.8714e+04  1.4475e+03  2.3011e+03  3.3900e+03
   storey_order              3.3186e+03  8.5899e+03  1.0826e+04  1.3397e+04
   remaining_lease_mths     -1.4431e+03  2.6063e+02  3.9048e+02  5.2865e+02
   PROX_CBD                 -1.0837e+07 -5.7697e+04 -1.3787e+04  2.6552e+04
   PROX_ELDERLYCARE         -3.2291e+07 -4.0643e+04  1.0562e+04  6.1054e+04
   PROX_HAWKER              -2.3985e+08 -5.1365e+04  3.0026e+03  6.4287e+04
   PROX_MRT                 -1.1660e+07 -1.0488e+05 -4.9373e+04  5.1037e+03
   PROX_PARK                -6.5961e+06 -4.8671e+04 -8.8128e+02  5.3498e+04
   PROX_MALL                -1.8112e+07 -7.4238e+04 -1.3982e+04  4.9779e+04
   PROX_SUPERMARKET         -4.5761e+06 -6.3461e+04 -1.7429e+04  3.5616e+04
   WITHIN_350M_KINDERGARTEN -4.1881e+05 -6.0040e+03  9.0209e+01  4.7127e+03
   WITHIN_350M_CHILDCARE    -1.0273e+05 -2.2375e+03  2.6668e+02  2.6388e+03
   WITHIN_350M_BUS          -1.1757e+05 -1.4719e+03  1.1626e+02  1.7584e+03
   WITHIN_1KM_PRISCH        -6.6465e+05 -5.5959e+03  2.6916e+02  5.7500e+03
                                  Max.
   Intercept                1.6493e+08
   floor_area_sqm           5.0907e+04
   storey_order             2.9537e+04
   remaining_lease_mths     1.8119e+03
   PROX_CBD                 2.2489e+07
   PROX_ELDERLYCARE         8.2444e+07
   PROX_HAWKER              5.9654e+06
   PROX_MRT                 2.0189e+08
   PROX_PARK                1.5224e+07
   PROX_MALL                1.0443e+07
   PROX_SUPERMARKET         3.8330e+06
   WITHIN_350M_KINDERGARTEN 6.6799e+05
   WITHIN_350M_CHILDCARE    1.0802e+05
   WITHIN_350M_BUS          3.7313e+04
   WITHIN_1KM_PRISCH        5.0262e+05
   ************************Diagnostic information*************************
   Number of data points: 10335
   Effective number of parameters (2trace(S) - trace(S'S)): 1730.101
   Effective degrees of freedom (n-2trace(S) + trace(S'S)): 8604.899
   AICc (GWR book, Fotheringham, et al. 2002, p. 61, eq 2.33): 238871.8
   AIC (GWR book, Fotheringham, et al. 2002,GWR p. 96, eq. 4.22): 237036.9
   BIC (GWR book, Fotheringham, et al. 2002,GWR p. 61, eq. 2.34): 238209
   Residual sum of squares: 4.829177e+12
   R-square value:  0.9676571
   Adjusted R-square value:  0.9611535

   ***********************************************************************
   Program stops at: 2024-09-29 03:09:04.929015 

8.5 Converting Test Data to SpatialPointsDataFrame

Convert the test data:

test_data_sp <- test_data %>%
  as_Spatial()
test_data_sp
class       : SpatialPointsDataFrame
features    : 5566
extent      : 11597.31, 42623.63, 28287.8, 48669.59  (xmin, xmax, ymin, ymax)
crs         : +proj=tmerc +lat_0=1.36666666666667 +lon_0=103.833333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
variables   : 17
names       : resale_price, floor_area_sqm, storey_order, remaining_lease_mths,         PROX_CBD,     PROX_ELDERLYCARE,        PROX_HAWKER,           PROX_MRT,          PROX_PARK,   PROX_GOOD_PRISCH,        PROX_MALL,            PROX_CHAS,     PROX_SUPERMARKET, WITHIN_350M_KINDERGARTEN, WITHIN_350M_CHILDCARE, ...
min values  :       230888,             74,            1,                  546, 1.00583660772922, 3.34897933104965e-07, 0.0474019664161957, 0.0414043955932523, 0.0502664084494264, 0.0907500295577619,                0, 4.55547870890763e-09, 1.21715176356525e-07,                        0,                     0, ...
max values  :      1050000,            138,           14,                 1151,  19.632402730488,     3.30163731686804,   2.83106651960209,   2.13060636038504,   2.41313695915468,   10.6169590126272, 2.26056404492346,     0.79249074802552,     1.53786629004208,                        7,                    16, ... 

8.6 Computing Adaptive Bandwidth for the Test Data

Similarly, we use the bw.gwr() function from the GWmodel package to determine the optimal bandwidth for our GWR model on the test data.

gwr_bw_test_adaptive <- bw.gwr(resale_price ~ 
                                 floor_area_sqm +
                                 storey_order + 
                                 remaining_lease_mths +
                                 PROX_CBD + 
                                 PROX_ELDERLYCARE + 
                                 PROX_HAWKER +
                                 PROX_MRT + 
                                 PROX_PARK + 
                                 #PROX_GOOD_PRISCH +
                                 PROX_MALL + 
                                 #PROX_CHAS +
                                 PROX_SUPERMARKET + 
                                 WITHIN_350M_KINDERGARTEN +
                                 WITHIN_350M_CHILDCARE + 
                                 WITHIN_350M_BUS +
                                 WITHIN_1KM_PRISCH,
                               data = test_data_sp,
                               approach = "CV",
                               kernel = "gaussian",
                               adaptive = TRUE,
                               longlat = FALSE)
write_rds(gwr_bw_test_adaptive, "data/model/gwr_bw_test_adaptive.rds")
gwr_bw_test_adaptive <- read_rds("data/model/gwr_bw_test_adaptive.rds")

8.7 Computing Predicted Values of the Test Data

Finally, we use the gwr.predict() function from the GWmodel package to compute the predicted values of the test data based on our GWR model. We specify our formula, training data, test data, bandwidth, kernel type, and set adaptive=TRUE and longlat=FALSE.

gwr_pred <- gwr.predict(formula = resale_price ~
                          floor_area_sqm + storey_order +
                          remaining_lease_mths + PROX_CBD + 
                          PROX_ELDERLYCARE + PROX_HAWKER + 
                          PROX_MRT + PROX_PARK + PROX_MALL + 
                          PROX_SUPERMARKET + WITHIN_350M_KINDERGARTEN +
                          WITHIN_350M_CHILDCARE + WITHIN_350M_BUS + 
                          WITHIN_1KM_PRISCH, 
                        data=train_data_sp, 
                        predictdata = test_data_sp, 
                        bw=40, 
                        kernel = 'gaussian', 
                        adaptive=TRUE, 
                        longlat = FALSE)

9 Preparing Coordinates Data

9.1 Extracting Coordinates

The code below extracts the x, y coordinates for the full, training, and test datasets:

coords <- st_coordinates(mdata)
coords_train <- st_coordinates(train_data)
coords_test <- st_coordinates(test_data)

Save the extracted coordinates for future use:

coords_train <- write_rds(coords_train, "data/model/coords_train.rds")
coords_test <- write_rds(coords_test, "data/model/coords_test.rds")
coords_train <- read_rds("data/model/coords_train.rds")
coords_test <- read_rds("data/model/coords_test.rds")

9.2 Dropping Geometry Field

We remove the geometry column from the training data using st_drop_geometry():

train_data <- train_data %>% 
  st_drop_geometry()

10 Calibrating Random Forest Model

We will now calibrate a random forest model using the ranger package to predict HDB resale prices:

set.seed(1234)

rf <- ranger(resale_price ~ floor_area_sqm + storey_order + 
               remaining_lease_mths + PROX_CBD + PROX_ELDERLYCARE + 
               PROX_HAWKER + PROX_MRT + PROX_PARK + PROX_MALL + 
               PROX_SUPERMARKET + WITHIN_350M_KINDERGARTEN +
               WITHIN_350M_CHILDCARE + WITHIN_350M_BUS + 
               WITHIN_1KM_PRISCH,
             data=train_data)

Save the model:

write_rds(rf, "data/model/rf.rds")
rf <- read_rds("data/model/rf.rds")
rf
Ranger result

Call:
 ranger(resale_price ~ floor_area_sqm + storey_order + remaining_lease_mths +      PROX_CBD + PROX_ELDERLYCARE + PROX_HAWKER + PROX_MRT + PROX_PARK +      PROX_MALL + PROX_SUPERMARKET + WITHIN_350M_KINDERGARTEN +      WITHIN_350M_CHILDCARE + WITHIN_350M_BUS + WITHIN_1KM_PRISCH,      data = train_data)

Type:                             Regression
Number of trees:                  500
Sample size:                      10335
Number of independent variables:  14
Mtry:                             3
Target node size:                 5
Variable importance mode:         none
Splitrule:                        variance
OOB prediction error (MSE):       731404460
R squared (OOB):                  0.9493789 

11 Calibrating Geographical Random Forest Model

We now calibrate a geographic random forest model using grf() from the SpatialML package.

11.1 Calibrating with Training Data

Calibrate the model with an adaptive bandwidth:

set.seed(1234)
gwRF_adaptive <- grf(formula = resale_price ~ floor_area_sqm + storey_order +
                       remaining_lease_mths + PROX_CBD + PROX_ELDERLYCARE +
                       PROX_HAWKER + PROX_MRT + PROX_PARK + PROX_MALL +
                       PROX_SUPERMARKET + WITHIN_350M_KINDERGARTEN +
                       WITHIN_350M_CHILDCARE + WITHIN_350M_BUS +
                       WITHIN_1KM_PRISCH,
                     dframe=train_data, 
                     bw=55,
                     kernel="adaptive",
                     coords=coords_train)

Save the model:

write_rds(gwRF_adaptive, "data/model/gwRF_adaptive.rds")
gwRF_adaptive <- read_rds("data/model/gwRF_adaptive.rds")

11.2 Predicting with Test Data

11.2.1 Preparing the Test Data

Combine the test data with its coordinates:

test_data <- cbind(test_data, coords_test) %>% 
  st_drop_geometry()

11.2.2 Predicting Resale Prices

Use the trained geographical random forest model to predict prices:

gwRF_pred <- predict.grf(gwRF_adaptive, 
                           test_data, 
                           x.var.name = "X",
                           y.var.name = "Y", 
                           local.w = 1,
                           global.w = 0)

Save the predicted values:

GRF_pred <- write_rds(gwRF_pred, "data/model/GRF_pred.rds")

11.2.3 Converting Predicted Output into a Data Frame

The output of the predict.grf() function is a vector of predicted values. For further visualization and analysis, it’s useful to convert it into a data frame. To convert the prediction output to a data frame for analysis:

GRF_pred <- read_rds("data/model/GRF_pred.rds")
GRF_pred_df <- as.data.frame(GRF_pred)
# append pred values into the test data
test_data_p <- cbind(test_data, GRF_pred_df)
write_rds(test_data_p, "data/model/test_data_p.rds")

11.3 Calculating Root Mean Square Error (RMSE)

Compute RMSE to evaluate the model’s predictive accuracy:

test_data_p <- read_rds("data/model/test_data_p.rds")

rmse(test_data_p$resale_price, 
     test_data_p$GRF_pred)
[1] 27302.9

The lower the RMSE value, the better the predictive model is.

11.4 Visualizing the Predicted Values

Create a scatterplot to compare actual vs. predicted resale prices:

ggplot(data = test_data_p, aes(x = GRF_pred, y = resale_price)) +
  geom_point()+
    geom_abline(slope = 1, intercept = 0, color = "red", linetype = "solid")

Note

A good predictive model should have the scatter point close to the diagonal line. The scatter plot can be also used to detect if any outliers in the model.