::p_load(sf, tidyverse, tmap, spdep, httr,
pacman
onemapsgapi, units, matrixStats, readxl, jsonlite,
olsrr, corrplot, ggpubr, GWmodel, SpatialML, tidymodels, devtools, kableExtra, plotly, ggthemes, onemapsgapi, rgdal)
IS415 - Project Data Preprocessing
We will first need to Pre processed all the information that we need.
Loading the Packages
The R packages we’ll use for this analysis are:
sf: used for importing, managing, and processing spatial data
tidyverse: a collection of packages for data science tasks
readr for importing delimited files (.csv)
tidyr for manipulating and tidying data
dplyr for wrangling and transforming data
ggplot2 for visualising data
tmap: used for creating thematic maps, such as choropleth and bubble maps
spdep: used to create spatial weights matrix objects, global and local spatial autocorrelation statistics and related calculations (e.g. spatially lag attributes)
onemapsgapi: used to query Singapore-specific spatial data, alongside additional functionalities.
[httr]((https://cran.r-project.org/web/packages/httr/): used to make API calls, such as a GET request
units: used to for manipulating numeric vectors that have physical measurement units associated with them
matrixStats: a set of high-performing functions for operating on rows and columns of matrices
jsonlite: a JSON parser that can convert from JSON to the appropraite R data types
rgdal: provides bindings to the ‘Geospatial’ Data Abstraction Library
In addition, these R packages are specific to building + visualising hedonic pricing models:
olsrr: used for building least squares regression models
coorplot + ggpubr: both are used for multivariate data visualisation & analysis
GWmodel: provides a collection of localised spatial statistical methods, such as summary statistics, principal components analysis, discriminant analysis and various forms of GW regression
SpatialML: Implements a spatial extension of the random forest algorithm
Lastly, here are the extra R packages that aren’t necessary for the analysis itself, but help us go the extra mile with visualisations and presentation of our analysis:
devtools: used for installing any R packages which is not available in RCRAN. In this context, it’ll be used to download the xaringanExtra package for panelsets
kableExtra: an extension of kable, used for table customisation
plotly: used for creating interactive web graphics, and can be used in conjunction with ggplot2 with the
ggplotly()
functionggthemes: an extension of ggplot2, with more advanced themes for plotting
The Data
List of Data
Note that this list is not exhaustive and if you find other point data you can add them as well.
Type | Name | Format | Source |
---|---|---|---|
Aspatial | Resale Flat Price | .csv | data.gov.sg |
Geospatial | Singapore National Boundary | .shp | data.gov.sg |
Geospatial | Master Plan 2019 Subzone Boundary (Web) | .shp | Prof Kam |
Geospatial | MRT Exit Point | .shp | LTA Data Mall |
Geospatial | Bus Stop Locations Aug 2023 | .shp | LTA Data Mall |
Geospatial - Extracted | Childcare Services | .shp | OneMap API |
Geospatial - Extracted | Eldercare Services | .shp | OneMap API |
Geospatial - Extracted | Hawker Centres | .shp | OneMap API |
Geospatial - Extracted | Kindergartens | .shp | OneMap API |
Geospatial - Extracted | Parks | .shp | OneMap API |
Geospatial - Extracted | Supermarkets | .kml | OneMap API |
Geospatial - Extracted | Primary Schools | MOE Website | |
Geospatial - Selfsourced | List of Shopping Mall | .html | Wikipedia |
Geospatial - Selfsourced | Pharmacy | .kml | OneMap API |
Geospatial - Selfsourced | Integrated Screening Programme (ISP) Clinics | .kml | OneMap API |
Geospatial - Selfsourced | Libraries | .kml | OneMap API |
Geospatial - Selfsourced | Tourist | .kml | OneMap API |
Preparing the Aspatial Data
Importing the Resale Data
<- read_csv("data/aspatial/resale-flat-prices.csv")
resale glimpse(resale)
Taking a look at the data we have noticed that the data set contains 11 columns with 148576 observations. They have the following columns: months, town, flat_type, block, street_name, storey_range, floor_area_sqm, flat_model, remaining_lease, resale_price. We are only interested in the following period from: 1st January 2021 to 31st December 2022 and January 2023 to February 2022.
Filtering the Aspatial Data
We are making use of the filter function from the dplyr package to help us filter out the data. Find out more here. As there is more steps to be done on the aspatial data before it can be used for geographically weighted regression, we will split the data set into the period: 1st January 2021 to February 2022, for now.
<- filter(resale,flat_type == "4 ROOM") %>%
resale_flat_full filter(month >= "2021-01" & month <= "2023-02")
Once we have split it, we can make use of the unique function of Base R to ensure that all the data extracted out are correct. Find out more here.
unique(resale_flat_full$flat_type)
unique(resale_flat_full$month)
From the result above, we can confirm that the data for data set is extracted correctly.
Transforming the Resale Data
Now that we have correctly filter out the dataset that we wish to use, we are left with another problem. Lets have a look at our data as an example for me to better illustrate the problem.
head(resale_flat_full)
In our analysis, we are looking at the following key_factors:
Area of the unit
Floor level
Remaining lease
Age of the unit
With regards to the key factor. Notice how there are 4 key issues that we need to addressed:
No geospatial data: There is no geospatial data for us to plot out the points. This is worrying as the geospatial data is needed for us to perform geographically weighted regression. Fortunately, the data frame provided 2 columns that are critical in retrieving the coordinates of the flat, however it is found in 2 different columns: block, street_name. We would need to concatenate them together into to search for their coordinates.
remaining_lease is recorded as a string: The remaining leases data is found as a string, when it should be an continuos variable. It is current written as a string currently, which will be treated as a categorical data instead, as such we would need to convert it into the correct format first.
storey_range is given as a range: In our dataset, the floor of the exact unique is not given, but rather a range is given this could be a huge potential issue as this would mean thatthe data will treated as a categorical data. We would need to convert it into the correct format first
No age: There is no age in our dataset, which would mean that we would need to solve this issue as well.
Retrieving Postal Codes and Coordinates of the address
As mentioned before, one of the key issues that we would need to perform is to retrieve all the relevant data such as postal code and coordinates of the address that is needed for later analysis.
The steps are as followed:
Combining Block and Street Name to form an address
Filtering out Unique Address
Retrieving coordinates from OneMap.API
Inspecting the Result and Fixing the Issues
Step 1: Combining Block and Street Name to form an address
In this step, we will be combining the street_name and block to form an address.
We can make use of the paste()
function from base R to concatenate the two data together. Find out more here.
Afterwards, we will placed the data in a new columns called address in the dataframe by using the mutate()
function from dplyr. Find out more about here.
<- resale_flat_full %>%
resale_flat_full mutate(resale_flat_full, address = paste(block,street_name))
head(resale_flat_full)
Step 2: Filtering out Unique Address.
This step is performed in order to minimize the amount of API Call that we need to perform. Furthermore, this also makes it easier for us to see which of the address will result in an error.
We will first get the unique address out first before sorting the data. This can be done with the sort()
from base R. Find out more here.
<- sort(unique(resale_flat_full$address)) add_list
Step 3: Retrieiving Coordinates from OneMap API
To retrieve the coordinates from OneMap API, it will be easier to create a function to retrieve all the coordinate instead. To do so, lets take a deep dive into the OneMap API. Documentation can be found here. In this case, we will be making use of the OneMap API search API to retrieve the necessary coordinates.
According to the documentation, the request requires the following.
searchVal: Keywords entered by user that is used to filter out the results.
returnGeom {Y/N}: Checks if user wants to return the geometry.
getAddrDetails {Y/N}: Checks if user wants to return address details for a point.
pageNum: Specifies the page to retrieve your search results from. This is optional (We will not be using this in this case)
A provided example link would be something like this: https://developers.onemap.sg/commonapi/search?searchVal=revenue&returnGeom=Y&getAddrDetails=Y&pageNum=1
This will be the following response (Taken from OneMap API):
We are only interested in the LATITUDE and LONGITUDE data in this case.
Now that we understand the API better, we will now create a function that will help us sort through all the data. The following code chunk below does a number of critical steps:
We will create a data frame called postal_coords that will store all the data frame.
We will make use of the
GET()
function from httr package to make a get request call. Find out more here.We will create a data frame called new role to store all the coordinates
We also need to check the number of responses returned and append to the main data frame accordingly. This is because there are a few conditions to take note of
The number of results can be greater than one or none at all. (indicated by found in the JSON).
The results returned can have no postal code (which we will not consider as valid)
We will take the first result with a valid postal code as the correct coordinates.
Lastly, we will append the returned response (new_row) with the necessary fields to the main dataframe (postal_coords) using
rbind()
function of base R package. Find out more here.
All of this can be found in the code chunk below:
<- function(add_list){
get_coords
# Create a data frame to store all retrieved coordinates
<- data.frame()
postal_coords
for (i in add_list){
#print(i)
<- GET('https://developers.onemap.sg/commonapi/search?',
r query=list(searchVal=i,
returnGeom='Y',
getAddrDetails='Y'))
<- fromJSON(rawToChar(r$content))
data <- data$found
found <- data$results
res
# Create a new data frame for each address
<- data.frame()
new_row
# If single result, append
if (found == 1){
<- res$POSTAL
postal <- res$LATITUDE
lat <- res$LONGITUDE
lng <- data.frame(address= i, postal = postal, latitude = lat, longitude = lng)
new_row
}
# If multiple results, drop NIL and append top 1
else if (found > 1){
# Remove those with NIL as postal
<- res[res$POSTAL != "NIL", ]
res_sub
# Set as NA first if no Postal
if (nrow(res_sub) == 0) {
<- data.frame(address= i, postal = NA, latitude = NA, longitude = NA)
new_row
}
else{
<- head(res_sub, n = 1)
top1 <- top1$POSTAL
postal <- top1$LATITUDE
lat <- top1$LONGITUDE
lng <- data.frame(address= i, postal = postal, latitude = lat, longitude = lng)
new_row
}
}
else {
<- data.frame(address= i, postal = NA, latitude = NA, longitude = NA)
new_row
}
# Add the row
<- rbind(postal_coords, new_row)
postal_coords
}return(postal_coords)
}
With the function define we can call the function to generate out the data frame of the postal codes.
Please note that this function does take some time to run.
<- get_coords(add_list) coords
Step 4: Inspecting the Result and Fixing the Issues
Remember all the issues that we can face when making the api call, we would need to check to make sure that all the data is accounted for.
We can make use of the is.na()
function to check which street address contains any NA values
is.na(coords$postal) | is.na(coords$latitude) | is.na(coords$longitude) | coords$postal=="NIL"), ] coords[(
Based on the data frame above, we have noticed that there seems to be 2 address with no postal code
215 CHOA CHU KANG
216 CHOA CHU KANG
When searching directly with OneMap API instead, we found that OneMap API classified them as the same building instead with the results being “BLK 216 AND 215 CHOA CHU KANG CENTRAL”. Furthermore a brief check on the website: property indicates the postal code as
We shall proceed with keeping them as the same as there is the latitude and longitude data.
Transforming Remaining Lease
In this section, we will be transforming the remaining lease into an integer.
First, we will split the remaining lease into years and months columns for calculation. To do so, we will make use of str_sub()
function from tidyverse. Find out more here.
we will convert the string into an integer using the as.integer()
function from base R. Find out more here.
<- resale_flat_full %>%
resale_flat_full mutate(resale_flat_full, remaining_lease_yr = as.integer(str_sub(remaining_lease, 0, 2))) %>%
mutate(resale_flat_full, remaining_lease_mth = as.integer(str_sub(remaining_lease, 9, 11)))
We will then convert the NA values of the remaining_lease_mth columns into 0. We can make use of the is.na()
function from base R. Find out more here.
Followed by changing the remaining_lease_yr into months.
Finally, we can sum the two columns together with rowSums()
function from base R. Find out more here.
$remaining_lease_mth[is.na(resale_flat_full$remaining_lease_mth)] <- 0
resale_flat_full$remaining_lease_yr <- resale_flat_full$remaining_lease_yr * 12
resale_flat_full<- resale_flat_full %>%
resale_flat_full mutate(resale_flat_full, remaining_lease_mths = rowSums(resale_flat_full[, c("remaining_lease_yr", "remaining_lease_mth")]))
Transforming Storey Range
Categorical variables require special attention in regression analysis because, unlike continuous variables, they cannot be entered into the regression equation just as they are. A method we can used to convert categorical variables into continuous variable is called “treatment” coding, or “dummy” coding. This method involve converting the categorical variable into a reference number.
However, we must understand that storey range is an ordinal data, in other words, each categorical has an order, or in this case, low to high. Using a dummy variable might not make as much sense in this case. Hence, we will assign a higher value to higher floors. The reasoning behind it is that a higher floor offers more privacy, better security and lower noise pollution due to the high height.
We will first create a dataframe to store all the unique storey range and at the same time sort them. After they are sorted we will create a list of encoding based on the unique values then merging them to form a data frame
<- sort(unique(resale_flat_full$storey_range))
storeys <- 1:length(storeys)
storey_order <- data.frame(storeys, storey_order) storey_range_order
head(storey_range_order)
From the above results, we can see that:
01 TO 03 is assigned the value: 1
04 TO 06 is assigned the value: 2
07 TO 09 is assigned the value: 3
10 TO 12 is assigned the value: 4
13 TO 15 is assigned the value: 5
16 TO 18 is assigned the value: 6
Hence, the storey range are in the correct order
Transforming Age
Age is one of the key factors that we are using in our analysis. However, there is no direct reference to age of the HDB. One method we can use is to infer the age of the building based on the remaining lease. It is well known that Singapore HDB are leased to us for 99 years as such we can calculate the age based on the difference between the total lease and remaining lease.
<- resale_flat_full %>%
resale_flat_full mutate(resale_flat_full, age = 99 * 12 - resale_flat_full$remaining_lease_mths)
head(resale_flat_full)
Combining all the data and storing as RDS
We will now combine all the relevant fields to a data frame using left_join()
function of dplyr package. Find out more here.
<- left_join(resale_flat_full, storey_range_order, by= c("storey_range" = "storeys"))
resale_flat_full
<- left_join(resale_flat_full, coords, by = c('address' = 'address')) rs_coords
We can then store the data frame into a rds for use later.
<- write_rds(rs_coords, "data/aspatial/rds/rs_coords.rds") rs_coords_rds
Viewing RDS Data
Reading RDS
Instead of performing all the steps above again, once you save the data frame in a rds format, you can read the rds format again. Notice how the data frame is saved exactly as it is.
<- read_rds("data/aspatial/rds/rs_coords.rds")
rs_coords head(rs_coords)
Converting to Sf Object
Since the coordinate columns are Latitude & Longitude which are in decimal degrees, the projected CRS will be WGS84. We will need to assign them the respective EPSG code 4326 first before transforming it to 3414 which is the EPSG code for SVY21.
We will first use the st_as_sf()
function of sf package to convert the data frame into sf object. Find out more here.
Followed by the st_transform()
function of sf package to transform the coordinates of the sf object.
<- st_as_sf(rs_coords,
rs_coords_sf coords = c("longitude",
"latitude"),
crs=4326) %>%
st_transform(crs = 3414)
rs_coords_sf
Checking for invalid geometries
The step below is not exactly mandatory, but it is good to check if the sf object is valid or not.
length(which(st_is_valid(rs_coords_sf) == FALSE))
Plotting the HDB Resale points
tmap_mode("view")
tm_shape(rs_coords_sf)+
tm_dots(col="blue", size = 0.02)
tmap_mode("plot")
Preparing the Locational Data
In this section, we will be retrieving all the other co location factors from the OneMap API.
Extracting the Data from OneMap API
Before we do a deep dive into examining the data, the first step that we need to do first is to extract out all the relevant data that we need. But first we need to see what OneMap API allows us to extract first.
Before we start off, I will first be introducing to you the onemapsgapi package that we will be using. You can learn more about the package here. You will need to sign up for an account so that they will generate a unique token for you to use. I have already pre-loaded my token into a variable called token.
Step 1: Searching for Themes
The first step that we can do is to see what themes are available for us to use, with the code chunk below. In this case, we will be using the function search_themes()
function to help us view all the available list. Find out more here.
# Please replace token with the values of your own token.
<- search_themes(token)
avail_themes avail_themes
As you can see there seems to be a huge amount of themes available for us to use, which in this case I am interested only in a few namely, Eldercare Services, Childcare Services, Hawker Centres, Parks, Kindergartens. As you have noticed, there seems to be some themes that have repeated themselves, we would need to do a deep dive to find out more about what each theme contains.
Step 2: A Closer Look at each theme
One ability of the search_themes()
function is that you can make a query as well. This way there is no need to see all the other information that is not necessary for our evaluation.
Eldercare
<- search_themes(token, "elder")
avail_themes_elder avail_themes_elder
There is only one available theme relating to Eldercare, so this theme will be use. The query is eldercare
Childcare
<- search_themes(token, "childcare")
avail_themes_child avail_themes_child
There is only one available theme relating to ChildCare, so this theme will be use. The query is childcare
Hawker Centres
<- search_themes(token, "hawker")
avail_themes_hawker avail_themes_hawker
In this case, there appears to be 3 different themes with the name “Hawker”. In this case, we can ignore the healthier_hawker_centres, as it category is in health rather than environment.
<- get_theme(token, "hawkercentre_new")
hawkercentre_new <- get_theme(token, "hawkercentre")
hawkercentre head(hawkercentre_new)
head(hawkercentre)
If we were to compare to 2 data frame, both data set has the same number of observations, with the only difference being that hawkercentre_new theme has more variables as compared to that of the hawkercentre. However, the difference in variables does seems to add value to the data frame at all and will be drop as those variables are not essential. Using either of the themes should have be fine, but I will be making use of the former for this exercise.
The query picked is hawkercentre_new
Parks
<- search_themes(token, "parks")
avail_themes_parks avail_themes_parks
Based on the dataframe, above there seems to be a total of 25 different themes that matches the name “parks”, however this is due to the fact that it is the result also returns the results from the “National Parks Boards”. There are two themes that stands out the most here: “Parks” and “NParks Parks and Nature Reserve”. If we were to look into their category we found that the former belongs to recreation, while the latter belongs to environment. Seeing as we’re trying to relate the locational factors to the pricing of resale housing units, it makes more sense to go with the former!
The query chosen is nationalparks
Kindergarten
<- search_themes(token, "kinder")
avail_themes_kindergarten avail_themes_kindergarten
There is only one available theme relating to Kindergartens, so this theme will be use. The query is kindergartens
Clinic
<- search_themes(token, "health")
avail_themes_clinic avail_themes_clinic
There is only one available theme relating to Clinic, so this theme will be use. The query is moh_isp_clinics
Another theme of interest could also be Pharmacy, so this theme will be use. The query is registered_pharmacy
Tourist
<- search_themes(token, "tourism")
avail_themes_tourist avail_themes_kindergarten
There is only one available theme relating to Tourism, so this theme will be use. The query is tourism
Libraries
<- search_themes(token, "libraries")
avail_themes_libraries avail_themes_libraries
There is only one available theme relating to Tourism, so this theme will be use. The query is libraries
Others
One thing to note is that we are no longer able to access some information that Megan has access to due to the update of the API such as the name of the primary school as well, which we will address in the other section.
Step 3: Saving the Data into a data frame
In order to facilitate in the retrieving of themes, I have created the following function to aid me in retrieving the themes and saving them as a data frame.
We will be making use of the get_theme()
function from the onemapsgapi package to retrieve the relevant and output it in a tibble frame. Read more here.
Afterwards, we will convert the tibble to simple features dataframe. All the themes for this project use Lat and Lng as the latitude and longitude respectively, and our project coordinates system should be in the WGS84 system, aka ESPG code 4326.
To ensure that we can access the data again, we will save it into asf into a shapefile, which we can do with st_write()
function.
<- function(themename){
save_theme_sf
<- get_theme(token, themename) %>%
themetibble select(c("NAME", "Lat", "Lng") )
<- st_as_sf(themetibble, coords=c("Lng", "Lat"), crs=4326)
themesf <- paste("data/extracted/", themename, ".shp", sep="")
themename_file st_write(themesf, themename_file)
}
save_theme_sf("eldercare")
save_theme_sf("childcare")
save_theme_sf("hawkercentre_new")
save_theme_sf("nationalparks")
save_theme_sf("kindergartens")
save_theme_sf("moh_isp_clinics")
save_theme_sf("registered_pharmacy")
save_theme_sf("libraries")
save_theme_sf("tourism")
Extracting the Other Data
Now that we have handled those data, one issue that we have face is that there is no data for most the Shopping Mall or Primary School.
Shopping Mall Data
Unlike Megan’s Time where both Shopping Mall Data can be found in OneMap API, we need will need to make to do with the list from Wikipedia, which might not necessary be the most accurate, however, it appears to be the best list of data that we have.
You can find the link here. https://en.wikipedia.org/wiki/List_of_shopping_malls_in_Singapore
I have extracted the all the data from the wikipedia and clean it up manually.
= c("100 AM","313@Somerset","Aperia","Balestier Hill Shopping Centre","Bugis Cube","Bugis Junction","Bugis+","Capitol Piazza","Cathay Cineleisure Orchard","Clarke Quay Central","The Centrepoint","City Square Mall","City Gate Mall","CityLink Mall","Duo","Far East Plaza","Funan","Great World City","HDB Hub","Holland Village Shopping Mall","ION Orchard","Junction 8","Knightsbridge","Liat Towers","Lucky Plaza","Marina Bay Sands","The Shoppes at Marina Bay Sands","Marina Bay Link Mall","Marina Square","Millenia Walk","Mustafa Shopping Centre","Ngee Ann City","Orchard Central","Orchard Gateway","Orchard Plaza","Midpoint Orchard","Palais Renaissance","People's Park Centre","People's Park Complex","Plaza Singapura","PoMo","Raffles City","Scotts Square","Shaw House and Centre","Sim Lim Square","Singapore Shopping Centre","The South Beach","Square 2","Sunshine Plaza","Suntec City","Tanglin Mall","Tanjong Pagar Centre","Tekka Centre","The Adelphi","The Paragon","Tiong Bahru Plaza","The Poiz","Thomson Plaza","United Square","Thomson V","Velocity@Novena Square","Wheelock Place","Wisma Atria","Zhongshan Mall","Bedok Mall","Century Square","Our Tampines Hub","Changi City Point","Downtown East","Djitsun Mall Bedok","Eastpoint Mall","Jewel Changi Airport","KINEX","Katong Shopping Centre","Katong Square","Kallang Wave Mall","Leisure Park Kallang","i12 Katong","Parkway Parade","Paya Lebar Square","Paya Lebar Quarter","Roxy Square","Singpost Centre","Tampines 1","Tampines Mall","White Sands","City Plaza","Elias Mall","Loyang Point","888 Plaza","Admiralty Place","AMK Hub","Canberra Plaza","Causeway Point","Woodlands Civic Centre","Broadway Plaza","Djitsun Mall","Jubilee Square","Junction 8","Junction Nine","Marsiling Mall","Northpoint City","Sembawang Shopping Centre","Sun Plaza","Vista Point","Wisteria Mall","Woodlands Mart","Woodlands North Plaza","Waterway Point","Compass One","Hougang Mall","Heartland Mall","NEX","Buangkok Square","Greenwich V","Hougang 1","Hougang Green Shopping Mall","Hougang Rivercourt","myVillage At Serangoon Garden","Northshore Plaza","Oasis Terraces","Punggol Plaza","Rivervale Mall","Rivervale Plaza","The Seletar Mall","Upper Serangoon Shopping Centre","Beauty World Centre","Beauty World Plaza","Bukit Panjang Plaza","Bukit Timah Plaza","Fajar Shopping Centre","Greenridge Shopping Centre","Hillion Mall","HillV2","Junction 10","Keat Hong Shopping Centre","Limbang Shopping Centre","Lot One","Rail Mall","Sunshine Place","Teck Whye Shopping Centre","West Mall","Yew Tee Point","Yew Tee Square","VivoCity","HarbourFront Centre","Alexandra Retail Centre","321 Clementi","The Clementi Mall","IMM","JCube","Jem","Westgate","Jurong Point","Pioneer Mall","The Star Vista","Alexandra Central","Anchorpoint","OD Mall","Boon Lay Shopping Centre","Grantral Mall","Fairprice Hub","Gek Poh Shopping Centre","Rochester Mall","Taman Jurong Shopping Centre","West Coast Plaza","Queensway Shopping Centre") shopping_mall_list
Remember the function that we have created to retrieve the hdb, we can make use of it here as well to find their coordinates.
Note that this function does take some time to run.
<- get_coords(shopping_mall_list) shop_coords
shop_coords
Now then we can check which Mall does not have the necessary data
is.na(shop_coords$postal) | is.na(shop_coords$latitude) | is.na(shop_coords$longitude) | shop_coords$postal=="NIL"), ] shop_coords[(
From the results above, we can see that there are 7 addresses that does not contain any information at all:
Clarke Quay Central
City Gate Mall
Holland Village Shopping Mall
Mustafa Shopping Centre
PoMo
Shaw House and Centre
OD Mall
I research deeper into the situation and found reverse search the malls to find their Postal Code, afterwards I will search out the name based on the One Map API and found the following:
Clarke Quay Central is called
The Central
City Gate Mall is called
City Gate
Holland Village Shopping Mall is called
Holland Road Shopping Centre
Mustafa Shopping Centre is called
Mustafa Centre
PoMO is called
GR.ID
Shaw House and Centre is called
Shaw Centre
OD Mall is called
The GrandStand
I will append the names in the list below as it is easier:
= c("100 AM","313@Somerset","Aperia","Balestier Hill Shopping Centre","Bugis Cube","Bugis Junction","Bugis+","Capitol Piazza","Cathay Cineleisure Orchard","The Central","The Centrepoint","City Square Mall","City Gate","CityLink Mall","Duo","Far East Plaza","Funan","Great World City","HDB Hub","Holland Road Shopping Centre","ION Orchard","Junction 8","Knightsbridge","Liat Towers","Lucky Plaza","Marina Bay Sands","The Shoppes at Marina Bay Sands","Marina Bay Link Mall","Marina Square","Millenia Walk","Mustafa Centre","Ngee Ann City","Orchard Central","Orchard Gateway","Orchard Plaza","Midpoint Orchard","Palais Renaissance","People's Park Centre","People's Park Complex","Plaza Singapura","GR.ID","Raffles City","Scotts Square","Shaw Centre","Sim Lim Square","Singapore Shopping Centre","The South Beach","Square 2","Sunshine Plaza","Suntec City","Tanglin Mall","Tanjong Pagar Centre","Tekka Centre","The Adelphi","The Paragon","Tiong Bahru Plaza","The Poiz","Thomson Plaza","United Square","Thomson V","Velocity@Novena Square","Wheelock Place","Wisma Atria","Zhongshan Mall","Bedok Mall","Century Square","Our Tampines Hub","Changi City Point","Downtown East","Djitsun Mall Bedok","Eastpoint Mall","Jewel Changi Airport","KINEX","Katong Shopping Centre","Katong Square","Kallang Wave Mall","Leisure Park Kallang","i12 Katong","Parkway Parade","Paya Lebar Square","Paya Lebar Quarter","Roxy Square","Singpost Centre","Tampines 1","Tampines Mall","White Sands","City Plaza","Elias Mall","Loyang Point","888 Plaza","Admiralty Place","AMK Hub","Canberra Plaza","Causeway Point","Woodlands Civic Centre","Broadway Plaza","Djitsun Mall","Jubilee Square","Junction 8","Junction Nine","Marsiling Mall","Northpoint City","Sembawang Shopping Centre","Sun Plaza","Vista Point","Wisteria Mall","Woodlands Mart","Woodlands North Plaza","Waterway Point","Compass One","Hougang Mall","Heartland Mall","NEX","Buangkok Square","Greenwich V","Hougang 1","Hougang Green Shopping Mall","Hougang Rivercourt","myVillage At Serangoon Garden","Northshore Plaza","Oasis Terraces","Punggol Plaza","Rivervale Mall","Rivervale Plaza","The Seletar Mall","Upper Serangoon Shopping Centre","Beauty World Centre","Beauty World Plaza","Bukit Panjang Plaza","Bukit Timah Plaza","Fajar Shopping Centre","Greenridge Shopping Centre","Hillion Mall","HillV2","Junction 10","Keat Hong Shopping Centre","Limbang Shopping Centre","Lot One","Rail Mall","Sunshine Place","Teck Whye Shopping Centre","West Mall","Yew Tee Point","Yew Tee Square","VivoCity","HarbourFront Centre","Alexandra Retail Centre","321 Clementi","The Clementi Mall","IMM","JCube","Jem","Westgate","Jurong Point","Pioneer Mall","The Star Vista","Alexandra Central","Anchorpoint","The GrandStand","Boon Lay Shopping Centre","Grantral Mall","Fairprice Hub","Gek Poh Shopping Centre","Rochester Mall","Taman Jurong Shopping Centre","West Coast Plaza","Queensway Shopping Centre") new_shopping_mall_list
<- get_coords(new_shopping_mall_list)
shop_coords_new shop_coords_new
We will check again to make sure that the data is correct
is.na(shop_coords_new$postal) | is.na(shop_coords_new$latitude) | is.na(shop_coords_new$longitude) | shop_coords_new$postal=="NIL"), ] shop_coords_new[(
Since we have confirm that the data is clean, we will just rename the columns to make it easier to understand.
rename(shop_coords_new, "name" = "address")
We will save it as a shape file so that we can access it much more easily
<- st_as_sf(shop_coords_new, coords=c("longitude", "latitude"), crs=4326) shopping_sf
st_write(shopping_sf, "data/extracted/shopping_mall.shp")
Primary School
Unlike Megan’s Time where Primary School Data can be found in OneMap API and is found in Data.gov, we need will need to make to do with the list from MOE, which appears to be the best list of data that we have.
You can find the link here. https://www.moe.gov.sg/about-us/organisation-structure/sd/school-clusters
I have extracted the all the data from the MOE and clean it up manually.
= c("North Vista Primary School","Palm View Primary School","Rivervale Primary School","Seng Kang Primary School","Xinmin Primary School","Ahmad Ibrahim Primary School","Chongfu School","Endeavour Primary School","Jiemin Primary School","Northland Primary School","Northoaks Primary School","Xishan Primary School","Singapore Chinese Girls' School (Primary)","Anchor Green Primary School","Compassvale Primary School","Edgefield Primary School","Fernvale Primary School","Hougang Primary School","Yio Chu Kang Primary School","Catholic High School (Primary)","Anchor Green Primary School","Compassvale Primary School","Edgefield Primary School","Fernvale Primary School","Hougang Primary School","Yio Chu Kang Primary School","Catholic High School (Primary)","Greendale Primary School","Horizon Primary School","Mee Toh School","Montfort Junior School","Nan Chiau Primary School","North Spring Primary School","Maris Stella High School (Primary)","Admiralty Primary School")
primary_sch_list
=c("Evergreen Primary School","Greenwood Primary School","Marsiling Primary School","Qihua Primary School","Woodgrove Primary School","Woodlands Ring Primary School","Anderson Primary School","Huamin Primary School","Naval Base Primary School","North View Primary School","Peiying Primary School","Sembawang Primary School","Yishun Primary School","CHIJ St Nicholas (Primary)","Ang Mo Kio Primary School","CHIJ Our Lady of Nativity","Holy Innocents' Primary School","Jing Shan Primary School","Mayflower Primary School","Punggol Primary School","Blangah Rise Primary School","Fairfield Methodist School (Primary)","New Town Primary School","Pei Tong Primary School","Queenstown Primary School","Anglo-Chinese Primary School","CHIJ (Toa Payoh) Primary School","First Toa Payoh Primary School","Kheng Cheng School","Marymount Convent School","Pei Chun Public School","Raffles Girls’ Primary School","Alexandra Primary School","Cantonment Primary School","CHIJ (Kellock)","Gan Eng Seng Primary School","Hong Wen School","Radin Mas Primary School","River Valley Primary School","Zhangde Primary School","Anglo-Chinese Junior","Bendemeer Primary School","Farrer Park Primary School","St. Andrew's Junior School","St. Joseph's Institution Junior","St. Margaret's Primary School","Cedar Primary School","CHIJ Our Lady of Good Counsel","St Gabriel’s Primary School","Xinghua Primary School","Yangzheng Primary School","Zhonghua Primary School","Ai Tong School","Kuo Chuan Presbyterian Primary School","Teck Ghee Primary School","Townsville Primary School","Elias Park Primary School","Meridian Primary School","Northshore Primary School","Punggol Cove Primary School","Punggol Green Primary School","Punggol View Primary School","Valour Primary School","Bedok Green Primary School","Junyuan Primary School","Poi Ching School","Red Swastika School","Temasek Primary School","Yu Neng Primary School","Angsana Primary School","Chongzheng Primary School","East Spring Primary School","Fern Green Primary School","Gongshang Primary School","Sengkang Green Primary School","Springdale Primary School","Yumin Primary School","Changkat Primary School","Damai Primary School","Kong Hwa School","St Anthony’s Canossian Primary School","Telok Kurau Primary School","Canossa Catholic Primary School","Fengshan Primary School","Geylang Methodist School (Primary)","Haig Girls’ School","Paya Lebar Methodist Girls’ School (Primary)","Tanjong Katong Primary School","Casuarina Primary School","Oasis Primary School","Park View Primary School","Pasir Ris Primary School","St. Hilda’s Primary School","Tampines North Primary School","Tampines Primary School","Waterway Primary School","White Sands Primary School","CHIJ (Katong) Primary","Maha Bodhi School","Ngee Ann Primary School","Opera Estate Primary School","St Stephen's Primary School","Tao Nan School","Clementi Primary School","Henry Park Primary School","Nan Hua Primary School","Qifa Primary School","Yuhua Primary School","Chua Chu Kang Primary School","Concord Primary School","De La Salle School","Nanyang Primary School","South View Primary School","Unity Primary School","Corporation Primary School","Frontier Primary School","Jurong West Primary School","Pioneer Primary School","West Grove Primary School","Xingnan Primary School","Methodist Girls' School (Primary)","Bukit View Primary School","Dazhong Primary School","Jurong Primary School","Keming Primary School","Lianhua Primary School","St. Anthony's Primary School","Beacon Primary School","Greenridge Primary School","Pei Hwa Presbyterian Primary School","Teck Whye Primary School","West View Primary School","Zhenghua Primary School","Boon Lay Garden Primary School","Bukit Panjang Primary School","CHIJ Our Lady Queen of Peace","Kranji Primary School","West Spring Primary School","Westwood Primary School","Yew Tee Primary School","Bukit Timah Primary School","Fuhua Primary School","Lakeside Primary School","Princess Elizabeth Primary School","Rulang Primary School","Shuqun Primary School") primary_sch_list_2
We will make use of the same function to get the list of Primary School Coordinates of Primary School
<- get_coords(primary_sch_list)
primary_sch_list_coor <- get_coords(primary_sch_list_2) primary_sch_list_coor2
Once we Again we will check if both have any error
is.na(primary_sch_list_coor$postal) | is.na(primary_sch_list_coor$latitude) | is.na(primary_sch_list_coor$longitude) | primary_sch_list_coor$postal=="NIL"), ] primary_sch_list_coor[(
is.na(primary_sch_list_coor2$postal) | is.na(primary_sch_list_coor2$latitude) | is.na(primary_sch_list_coor2$longitude) | primary_sch_list_coor2$postal=="NIL"), ] primary_sch_list_coor2[(
From the results above, we can see that there are 7 addresses that does not contain any information at all:
Maris Stella High School (Primary)
CHIJ St Nicholas (Primary)
CHIJ (Toa Payoh) Primary School
St Gabriel’s Primary School
St Anthony’s Canossian Primary School
St. Hilda’s Primary School
St Stephen’s Primary School
I research deeper into the situation and found reverse search the malls to find their Postal Code, afterwards I will search out the name based on the One Map API and found the following:
Maris Stella High School (Primary) is called
Maris Stella High School
CHIJ St Nicholas (Primary) is called
CHIJ St Nicholas
CHIJ (Toa Payoh) Primary School is called
CHIJ Primary (Toa Payoh)
St Gabriel’s Primary School is called
Saint Gabriel Primary School
St Anthony’s Canossian Primary School is called
Saint Anthony Canossian Primary School
St. Hilda’s Primary School is called
Saint Hilda Primary School
St Stephen’s Primary School is called
Saint Stephen School
We will just replace it in the names in the list again.
= c("North Vista Primary School","Palm View Primary School","Rivervale Primary School","Seng Kang Primary School","Xinmin Primary School","Ahmad Ibrahim Primary School","Chongfu School","Endeavour Primary School","Jiemin Primary School","Northland Primary School","Northoaks Primary School","Xishan Primary School","Singapore Chinese Girls' School (Primary)","Anchor Green Primary School","Compassvale Primary School","Edgefield Primary School","Fernvale Primary School","Hougang Primary School","Yio Chu Kang Primary School","Catholic High School (Primary)","Anchor Green Primary School","Compassvale Primary School","Edgefield Primary School","Fernvale Primary School","Hougang Primary School","Yio Chu Kang Primary School","Catholic High School (Primary)","Greendale Primary School","Horizon Primary School","Mee Toh School","Montfort Junior School","Nan Chiau Primary School","North Spring Primary School","Maris Stella High School","Admiralty Primary School")
primary_sch_list
=c("Evergreen Primary School","Greenwood Primary School","Marsiling Primary School","Qihua Primary School","Woodgrove Primary School","Woodlands Ring Primary School","Anderson Primary School","Huamin Primary School","Naval Base Primary School","North View Primary School","Peiying Primary School","Sembawang Primary School","Yishun Primary School","CHIJ St Nicholas","Ang Mo Kio Primary School","CHIJ Our Lady of Nativity","Holy Innocents' Primary School","Jing Shan Primary School","Mayflower Primary School","Punggol Primary School","Blangah Rise Primary School","Fairfield Methodist School (Primary)","New Town Primary School","Pei Tong Primary School","Queenstown Primary School","Anglo-Chinese Primary School","CHIJ Primary (Toa Payoh)","First Toa Payoh Primary School","Kheng Cheng School","Marymount Convent School","Pei Chun Public School","Raffles Girls’ Primary School","Alexandra Primary School","Cantonment Primary School","CHIJ (Kellock)","Gan Eng Seng Primary School","Hong Wen School","Radin Mas Primary School","River Valley Primary School","Zhangde Primary School","Anglo-Chinese Junior","Bendemeer Primary School","Farrer Park Primary School","St. Andrew's Junior School","St. Joseph's Institution Junior","St. Margaret's Primary School","Cedar Primary School","CHIJ Our Lady of Good Counsel","Saint Gabriel Primary School","Xinghua Primary School","Yangzheng Primary School","Zhonghua Primary School","Ai Tong School","Kuo Chuan Presbyterian Primary School","Teck Ghee Primary School","Townsville Primary School","Elias Park Primary School","Meridian Primary School","Northshore Primary School","Punggol Cove Primary School","Punggol Green Primary School","Punggol View Primary School","Valour Primary School","Bedok Green Primary School","Junyuan Primary School","Poi Ching School","Red Swastika School","Temasek Primary School","Yu Neng Primary School","Angsana Primary School","Chongzheng Primary School","East Spring Primary School","Fern Green Primary School","Gongshang Primary School","Sengkang Green Primary School","Springdale Primary School","Yumin Primary School","Changkat Primary School","Damai Primary School","Kong Hwa School","Saint Anthony Canossian Primary School","Telok Kurau Primary School","Canossa Catholic Primary School","Fengshan Primary School","Geylang Methodist School (Primary)","Haig Girls’ School","Paya Lebar Methodist Girls’ School (Primary)","Tanjong Katong Primary School","Casuarina Primary School","Oasis Primary School","Park View Primary School","Pasir Ris Primary School","Saint Hilda Primary School","Tampines North Primary School","Tampines Primary School","Waterway Primary School","White Sands Primary School","CHIJ (Katong) Primary","Maha Bodhi School","Ngee Ann Primary School","Opera Estate Primary School","Saint Stephen School","Tao Nan School","Clementi Primary School","Henry Park Primary School","Nan Hua Primary School","Qifa Primary School","Yuhua Primary School","Chua Chu Kang Primary School","Concord Primary School","De La Salle School","Nanyang Primary School","South View Primary School","Unity Primary School","Corporation Primary School","Frontier Primary School","Jurong West Primary School","Pioneer Primary School","West Grove Primary School","Xingnan Primary School","Methodist Girls' School (Primary)","Bukit View Primary School","Dazhong Primary School","Jurong Primary School","Keming Primary School","Lianhua Primary School","St Anthony's Primary School","Beacon Primary School","Greenridge Primary School","Pei Hwa Presbyterian Primary School","Teck Whye Primary School","West View Primary School","Zhenghua Primary School","Boon Lay Garden Primary School","Bukit Panjang Primary School","CHIJ Our Lady Queen of Peace","Kranji Primary School","West Spring Primary School","Westwood Primary School","Yew Tee Primary School","Bukit Timah Primary School","Fuhua Primary School","Lakeside Primary School","Princess Elizabeth Primary School","Rulang Primary School","Shuqun Primary School") primary_sch_list_2
<- get_coords(primary_sch_list)
primary_sch_list_coor <- get_coords(primary_sch_list_2) primary_sch_list_coor2
We will need to check again to make sure that there is no missing data
is.na(primary_sch_list_coor$postal) | is.na(primary_sch_list_coor$latitude) | is.na(primary_sch_list_coor$longitude) | primary_sch_list_coor$postal=="NIL"), ] primary_sch_list_coor[(
is.na(primary_sch_list_coor2$postal) | is.na(primary_sch_list_coor2$latitude) | is.na(primary_sch_list_coor2$longitude) | primary_sch_list_coor2$postal=="NIL"), ] primary_sch_list_coor2[(
Since we have confirm that the data is clean, we will just rename the columns to make it easier to understand.
<- rbind(primary_sch_list_coor, primary_sch_list_coor2) primary_sch_list_coor_full
rename(primary_sch_list_coor_full, "name" = "address")
We can write it into a shape file for us to access everything again.
<- st_as_sf(primary_sch_list_coor_full, coords=c("longitude", "latitude"), crs=4326) primary_sch_sf
st_write(primary_sch_sf, "data/extracted/primary_school.shp")
Good Primary School
Education and academic institutions are an especially important locational factors for families with children, or expect to have children. Since it has already been proven that distance affects priority admission, the number of good Primary School around the area is important as well. For this analysis, our focus will be on the primary-school level of education, we would need to understand what is defined as “Good Primary School”. MOE does not released a list of Primary Schools ranked based on their result as such, there is now way to determine how the Primary School are actually ranked.
One possible example of the ranking would be the schlah’s Primary School Rankings from the 2020. Although the list is a bit dated, It does offer a transparent method on how they determined what is considered a “Good Primary School” primarily through:
Popularity in Primary 1 (P1) Registration: 20%
Gifted Education Programme (GEP): 20%
Special Assistance Plan (SAP): 15%
Achievements in the Singapore Youth Festival Arts Presentation: 15%
Representation in the Singapore National School Games: 15%
Singapore Uniformed Groups Unit Recognition: 15%
For my analysis, I will consider the Top 10 Primary School as Good Primary School:
<- c("Nanyang Primary School","Tao Nan School","Catholic High School (Primary)","Nan Hua Primary School","Saint Hilda Primary School","Henry Park Primary School","Anglo-Chinese Primary School","Raffles Girls’ Primary School","Pei Hwa Presbyterian Primary School","CHIJ St Nicholas") good_primary_school
We will once again make use of the OneMap API to get the theme
<- get_coords(good_primary_school) good_primary_school_coor
We will need to check again to make sure that there is no missing data.
is.na(good_primary_school_coor$postal) | is.na(good_primary_school_coor$latitude) | is.na(good_primary_school_coor$longitude) | good_primary_school_coor$postal=="NIL"), ] good_primary_school_coor[(
Since we have confirm that the data is cleaned we can save it as a shape file to reference again.
rename(good_primary_school_coor, "name" = "address")
<- st_as_sf(good_primary_school_coor, coords=c("longitude", "latitude"), crs=4326) good_primary_sch_sf
st_write(good_primary_sch_sf, "data/extracted/good_primary_school.shp")
CBD
Lastly, we need to factor in the proximity to the Central Business District - in the Downtown Core. It’s located in the southwest of Singapore. As such, let’s take the coordinates of Downtown Core to be the coordinates of the CBD, we will store it as a CBD as well.
<- 1.287953
lat <- 103.851784
lng
<- data.frame(lat, lng) %>%
cbd_sf st_as_sf(coords = c("lng", "lat"), crs=4326)
st_write(cbd_sf, "data/extracted/cbd.shp")
Load in all the shape data
Remember all the Shape File that we have save from the above pre processing we can now load them again.
<- st_read(dsn="data/geospatial", "MPSZ-2019")
mpsz <- st_read(dsn="data/geospatial", "CostalOutline")
national_boundary <- st_read(dsn="data/geospatial", layer="childcare")
childcare <- st_read(dsn="data/geospatial", layer="eldercare")
eldercare <- st_read(dsn="data/geospatial", layer="hawkercentre")
hawker_centre <- st_read(dsn="data/geospatial", layer="kindergartens")
kindergarten <- st_read(dsn="data/geospatial", layer="nationalparks")
parks <- st_read(dsn="data/geospatial", layer="libraries")
libraries <- st_read(dsn="data/geospatial", layer="moh_isp_clinics")
isp_clinics <- st_read(dsn="data/geospatial", layer="tourism")
tourism <- st_read(dsn="data/geospatial", layer="BusStop")
bus_stop <- st_read(dsn="data/geospatial", layer="primary_sch") %>% st_transform(primary_sch, crs=4326)
primary_sch <- st_read(dsn="data/geospatial", layer="top10_pri_sch") %>% st_transform(top_primary_sch, crs=4326)
top_primary_sch <- st_read(dsn="data/geospatial", layer="registered_pharmacy") %>% st_transform(registered_pharmacy, crs=4326)
registered_pharmacy <- st_read(dsn="data/geospatial", layer="cbd") %>% st_transform(CBD, crs=4326)
CBD <- st_read(dsn="data/geospatial", layer="mrt_lrt") %>% st_transform(mrts, crs=4326) mrts
We will need to convert the shape file into CRS 3414 just in case.
<- st_transform(childcare, crs=3414)
childcare3414 <- st_transform(eldercare, crs=3414)
eldercare3414 <- st_transform(hawker_centre, crs=3414)
hawker_centre3414 <- st_transform(kindergarten, crs=3414)
kindergarten3414 <- st_transform(parks, crs=3414)
parks3414 <- st_transform(libraries, crs=3414)
libraries3414 <- st_transform(isp_clinics, crs=3414)
isp_clinics3414 <- st_transform(tourism, crs=3414)
tourism3414 <- st_transform(primary_sch, crs=3414)
primary_sch_sf_3414 <- st_transform(top_primary_sch, crs=3414)
top_primary_sch_sf_3414 <- st_transform(registered_pharmacy, crs=3414)
registered_pharmacy_3414 <- st_transform(CBD, crs=3414)
CBD_3414 <- st_transform(mrts, crs=3414)
mrts_3414 <- st_transform(bus_stop, crs=3414) bus_stop_3414
We already have a shopping mall shape file that we can used from the pre processing done above. Another method which is faster could be done if you already have a csv file, which can be done in the code chunk below.
<- read_csv("data/aspatial/shopping_mall.csv") shopping_mall
Please note that there must be longitude and latitude attributes in the csv file.
<- st_as_sf(shopping_mall, coords=c("longitude", "latitude"), crs=4326)
shopping_mall_sf <- st_transform(shopping_mall_sf, crs=3414) shopping_mall_sf_3414
Importing SuperMarket Data
<- st_read("data/geospatial/supermarkets.kml") supermarket_sf
Notice how the supermarket has a Z coordinates. The Z coordinates only contains 0 values and hence we would need to drop it.
<- st_zm(supermarket_sf)
supermarket_sf <- st_transform(supermarket_sf, crs=3414) supermarket_sf_3414
Importing the Resale Data
<- read_csv("data/aspatial/resale-flat-prices-from-jan-2017-onwards.csv")
resale_flat_full glimpse(resale_flat_full)
Taking a look at the data we have noticed that the data set contains 11 columns with 23656 observations. Depending on when your data is loaded, it can have more observations. They have the following columns: months, town, flat_type, block, street_name, storey_range, floor_area_sqm, flat_model, remaining_lease, resale_price.
Transforming the Resale Data
Now that we have correctly filter out the dataset that we wish to use, we are left with another problem. Lets have a look at our data as an example for me to better illustrate the problem.
head(resale_flat_full)
In our analysis, we are looking at the following key_factors:
Area of the unit
Floor level
Remaining lease
Age of the unit
With regards to the key factor. Notice how there are 4 key issues that we need to addressed:
No geospatial data: There is no geospatial data for us to plot out the points. This is worrying as the geospatial data is needed for us to perform geographically weighted regression. Fortunately, the data frame provided 2 columns that are critical in retrieving the coordinates of the flat, however it is found in 2 different columns: block, street_name. We would need to concatenate them together into to search for their coordinates.
remaining_lease is recorded as a string: The remaining leases data is found as a string, when it should be an continuos variable. It is current written as a string currently, which will be treated as a categorical data instead, as such we would need to convert it into the correct format first.
storey_range is given as a range: In our dataset, the floor of the exact unique is not given, but rather a range is given this could be a huge potential issue as this would mean thatthe data will treated as a categorical data. We would need to convert it into the correct format first
No age: There is no age in our dataset, which would mean that we would need to solve this issue as well.
Retrieving Postal Codes and Coordinates of the address
As mentioned before, one of the key issues that we would need to perform is to retrieve all the relevant data such as postal code and coordinates of the address that is needed for later analysis.
The steps are as followed:
Combining Block and Street Name to form an address
Filtering out Unique Address
Retrieving coordinates from OneMap.API
Inspecting the Result and Fixing the Issues
Step 1: Combining Block and Street Name to form an address
In this step, we will be combining the street_name and block to form an address.
We can make use of the paste()
function from base R to concatenate the two data together. Find out more here.
Afterwards, we will placed the data in a new columns called address in the dataframe by using the mutate()
function from dplyr. Find out more about here.
<- resale_flat_full %>%
resale_flat_full mutate(resale_flat_full, address = paste(block,street_name))
head(resale_flat_full)
Step 2: Filtering out Unique Address.
This step is performed in order to minimize the amount of API Call that we need to perform. Furthermore, this also makes it easier for us to see which of the address will result in an error.
We will first get the unique address out first before sorting the data. This can be done with the sort()
from base R. Find out more here.
<- sort(unique(resale_flat_full$address)) add_list
Step 3: Retrieiving Coordinates from OneMap API
To retrieve the coordinates from OneMap API, it will be easier to create a function to retrieve all the coordinate instead. To do so, lets take a deep dive into the OneMap API. Documentation can be found here. In this case, we will be making use of the OneMap API search API to retrieve the necessary coordinates.
According to the documentation, the request requires the following.
searchVal: Keywords entered by user that is used to filter out the results.
returnGeom {Y/N}: Checks if user wants to return the geometry.
getAddrDetails {Y/N}: Checks if user wants to return address details for a point.
pageNum: Specifies the page to retrieve your search results from. This is optional (We will not be using this in this case)
A provided example link would be something like this: https://developers.onemap.sg/commonapi/search?searchVal=revenue&returnGeom=Y&getAddrDetails=Y&pageNum=1
We are only interested in the LATITUDE and LONGITUDE data in this case.
Now that we understand the API better, we will now create a function that will help us sort through all the data. The following code chunk below does a number of critical steps:
We will create a data frame called postal_coords that will store all the data frame.
We will make use of the
GET()
function from httr package to make a get request call. Find out more here.We will create a data frame called new role to store all the coordinates
We also need to check the number of responses returned and append to the main data frame accordingly. This is because there are a few conditions to take note of
The number of results can be greater than one or none at all. (indicated by found in the JSON).
The results returned can have no postal code (which we will not consider as valid)
We will take the first result with a valid postal code as the correct coordinates.
Lastly, we will append the returned response (new_row) with the necessary fields to the main dataframe (postal_coords) using
rbind()
function of base R package. Find out more here.
All of this can be found in the code chunk below:
<- function(add_list){
get_coords
# Create a data frame to store all retrieved coordinates
<- data.frame()
postal_coords
for (i in add_list){
#print(i)
<- GET('https://developers.onemap.sg/commonapi/search?',
r query=list(searchVal=i,
returnGeom='Y',
getAddrDetails='Y'))
<- fromJSON(rawToChar(r$content))
data <- data$found
found <- data$results
res
# Create a new data frame for each address
<- data.frame()
new_row
# If single result, append
if (found == 1){
<- res$POSTAL
postal <- res$LATITUDE
lat <- res$LONGITUDE
lng <- data.frame(address= i, postal = postal, latitude = lat, longitude = lng)
new_row
}
# If multiple results, drop NIL and append top 1
else if (found > 1){
# Remove those with NIL as postal
<- res[res$POSTAL != "NIL", ]
res_sub
# Set as NA first if no Postal
if (nrow(res_sub) == 0) {
<- data.frame(address= i, postal = NA, latitude = NA, longitude = NA)
new_row
}
else{
<- head(res_sub, n = 1)
top1 <- top1$POSTAL
postal <- top1$LATITUDE
lat <- top1$LONGITUDE
lng <- data.frame(address= i, postal = postal, latitude = lat, longitude = lng)
new_row
}
}
else {
<- data.frame(address= i, postal = NA, latitude = NA, longitude = NA)
new_row
}
# Add the row
<- rbind(postal_coords, new_row)
postal_coords
}return(postal_coords)
}
With the function define we can call the function to generate out the data frame of the postal codes.
Please note that this function does take some time to run.
<- get_coords(add_list) coords
Step 4: Inspecting the Result and Fixing the Issues
Remember all the issues that we can face when making the api call, we would need to check to make sure that all the data is accounted for.
We can make use of the is.na()
function to check which street address contains any NA values
is.na(coords$postal) | is.na(coords$latitude) | is.na(coords$longitude) | coords$postal=="NIL"), ] coords[(
Based on the data frame above, we have noticed that there seems to be 2 address with no postal code
215 CHOA CHU KANG
216 CHOA CHU KANG
When searching directly with OneMap API instead, we found that OneMap API classified them as the same building instead with the results being “BLK 216 AND 215 CHOA CHU KANG CENTRAL”. Furthermore a brief check on the website: property indicates the postal code as
We shall proceed with keeping them as the same as there is the latitude and longitude data.
Transforming Remaining Lease
In this section, we will be transforming the remaining lease into an integer.
First, we will split the remaining lease into years and months columns for calculation. To do so, we will make use of str_sub()
function from tidyverse. Find out more here.
we will convert the string into an integer using the as.integer()
function from base R. Find out more here.
<- resale_flat_full %>%
resale_flat_full mutate(resale_flat_full, remaining_lease_yr = as.integer(str_sub(remaining_lease, 0, 2))) %>%
mutate(resale_flat_full, remaining_lease_mth = as.integer(str_sub(remaining_lease, 9, 11)))
We will then convert the NA values of the remaining_lease_mth columns into 0. We can make use of the is.na()
function from base R. Find out more here.
Followed by changing the remaining_lease_yr into months.
Finally, we can sum the two columns together with rowSums()
function from base R. Find out more here.
$remaining_lease_mth[is.na(resale_flat_full$remaining_lease_mth)] <- 0
resale_flat_full$remaining_lease_yr <- resale_flat_full$remaining_lease_yr * 12
resale_flat_full<- resale_flat_full %>%
resale_flat_full mutate(resale_flat_full, remaining_lease_mths = rowSums(resale_flat_full[, c("remaining_lease_yr", "remaining_lease_mth")]))
Transforming Storey Range
Categorical variables require special attention in regression analysis because, unlike continuous variables, they cannot be entered into the regression equation just as they are. A method we can used to convert categorical variables into continuous variable is called “treatment” coding, or “dummy” coding. This method involve converting the categorical variable into a reference number.
However, we must understand that storey range is an ordinal data, in other words, each categorical has an order, or in this case, low to high. Using a dummy variable might not make as much sense in this case. Hence, we will assign a higher value to higher floors. The reasoning behind it is that a higher floor offers more privacy, better security and lower noise pollution due to the high height.
We will first create a dataframe to store all the unique storey range and at the same time sort them. After they are sorted we will create a list of encoding based on the unique values then merging them to form a data frame
<- sort(unique(resale_flat_full$storey_range))
storeys <- 1:length(storeys)
storey_order <- data.frame(storeys, storey_order) storey_range_order
head(storey_range_order)
From the above results, we can see that:
01 TO 03 is assigned the value: 1
04 TO 06 is assigned the value: 2
07 TO 09 is assigned the value: 3
10 TO 12 is assigned the value: 4
13 TO 15 is assigned the value: 5
16 TO 18 is assigned the value: 6
Hence, the storey range are in the correct order
Transforming Age
Age is one of the key factors that we are using in our analysis. However, there is no direct reference to age of the HDB. One method we can use is to infer the age of the building based on the remaining lease. It is well known that Singapore HDB are leased to us for 99 years as such we can calculate the age based on the difference between the total lease and remaining lease.
<- resale_flat_full %>%
resale_flat_full mutate(resale_flat_full, age = 99 * 12 - resale_flat_full$remaining_lease_mths)
head(resale_flat_full)
<- left_join(resale_flat_full, storey_range_order, by= c("storey_range" = "storeys"))
resale_flat_full
<- left_join(resale_flat_full, coords, by = c('address' = 'address')) rs_coords
<- write_rds(rs_coords, "data/rds/rs_coords.rds") rs_coords_rds
<- read_rds("data/rds/rs_coords.rds") rs_coords
<- st_as_sf(rs_coords,
rs_coords_sf coords = c("longitude",
"latitude"),
crs=4326) %>%
st_transform(crs = 3414)
Calculating Proximity
One of the things we need to find is the proximity to particular facilities as one of factors.
Megan has kindly provided the function for us to use above. First, she use compute with st_distance(),
to compute the proximity to the facility and find the closest facility (shortest distance) with the rowMins()
function of our matrixStats package. The values will be appended to the data frame as a new column.
<- function(df1, df2, varname) {
proximity <- st_distance(df1, df2) %>%
dist_matrix ::drop_units()
units<- rowMins(dist_matrix)
df1[,varname] return(df1)
}
<-
rs_coords_sf # the columns will be truncated later on when viewing
# so we're limiting ourselves to two-character columns for ease of viewing between
proximity(rs_coords_sf, CBD_3414, "PROX_CBD") %>%
proximity(., childcare3414, "PROX_CHILDCARE") %>%
proximity(., eldercare3414, "PROX_ELDERCARE") %>%
proximity(., hawker_centre3414, "PROX_HAWKER") %>%
proximity(., mrts_3414 , "PROX_MRT") %>%
proximity(., parks3414, "PROX_PARK") %>%
proximity(., top_primary_sch_sf_3414, "PROX_TOPPRISCH") %>%
proximity(., shopping_mall_sf_3414, "PROX_MALL") %>%
proximity(., supermarket_sf_3414, "PROX_SPRMKT") %>%
proximity(., isp_clinics3414, "PROX_CLINIC") %>%
proximity(., registered_pharmacy_3414, "PROX_PHARMACY") %>%
proximity(., tourism3414, "PROX_TOURISM") %>%
proximity(., libraries3414, "PROX_LIBRARY")
Facility Count Within Radius
Other than proximity, which calculates the shortest distance, we also want to find the number of facilities within a particular radius as another factors.
Megan has kindly provided the function for us to use above. First, she use st_distance()
to compute the distance between the flats and the desired facilities, and then sum up the observations with rowSums()
. The values will be appended to the data frame as a new column.
<- function(df1, df2, varname, radius) {
num_radius <- st_distance(df1, df2) %>%
dist_matrix drop_units() %>%
as.data.frame()
<- rowSums(dist_matrix <= radius)
df1[,varname] return(df1)
}
<-
rs_coords_sf num_radius(rs_coords_sf, kindergarten3414, "NUM_KNDRGTN", 350) %>%
num_radius(., childcare3414, "NUM_CHILDCARE", 350) %>%
num_radius(., bus_stop_3414, "NUM_BUS_STOP", 350) %>%
num_radius(., isp_clinics3414, "NUM_ISP_CLIN", 350) %>%
num_radius(., registered_pharmacy_3414, "NUM_REGISTERED_PHARM", 350) %>%
num_radius(., libraries3414, "NUM_LIBRARIES", 350) %>%
num_radius(., primary_sch_sf_3414, "NUM_PRI_SCH", 1000)
Saving the Data into RDS
Well we do not really need all the columns, and we also do not want to read all the files again, we can save it as a rds format so that we can access it again.
Before we save it into RDS, we can clean up columns that we do not really need.
<- write_rds(rs_coords_sf, "data/rds/rs_coords_full.rds") rs_coords_full_rds
You have now pre processed the data and you can now load the data into the Shiny application
Building the Models
Generating the new GeoGraphically Weighted Linear Model
Loading the Data
<- read_rds("data/rds/rs_coords_full.rds") rs_coords_full_rds
Filtering the Data
<- "2021-01"
start_date <- "2022-12"
end_date <- "4 ROOM"
filter_flat_type <- filter(rs_coords_full_rds, flat_type == filter_flat_type) %>%
resale_flat_full_nogeo filter(month >= start_date & month <= end_date) %>%
select(2, 7, 11, 15:17, 19:38) %>%
rename("AREA_SQM" = "floor_area_sqm",
"LEASE_YRS" = "remaining_lease_mths",
"PRICE"= "resale_price",
"AGE"= "age",
"STOREY_ORDER" = "storey_order") %>%
relocate("PRICE") %>%
relocate(geometry, .after = last_col()) %>%
st_drop_geometry()
<- filter(rs_coords_full_rds, flat_type == filter_flat_type) %>%
resale_flat_full filter(month >= start_date & month <= end_date) %>%
select(2, 7, 11, 15:17, 19:38) %>%
rename("AREA_SQM" = "floor_area_sqm",
"LEASE_YRS" = "remaining_lease_mths",
"PRICE"= "resale_price",
"AGE"= "age",
"STOREY_ORDER" = "storey_order") %>%
relocate("PRICE") %>%
relocate(geometry, .after = last_col())
<- resale_flat_full %>% select(3:26) resale_flat_full_perdict
<- write_rds(as_Spatial(tail(resale_flat_full_perdict,10)), "data/rds/rs_coords_predict.rds") rs_coords_pred_rds
Creating a Linear Regression Model
<- lm(PRICE ~ AREA_SQM + LEASE_YRS + AGE + STOREY_ORDER + PROX_CBD + PROX_CHILDCARE + PROX_ELDERCARE + PROX_HAWKER + PROX_MRT + PROX_PARK + PROX_TOPPRISCH + PROX_MALL + PROX_SPRMKT + PROX_CLINIC + PROX_PHARMACY + PROX_TOURISM + PROX_LIBRARY + NUM_KNDRGTN + NUM_CHILDCARE + NUM_BUS_STOP + NUM_ISP_CLIN + NUM_REGISTERED_PHARM + NUM_LIBRARIES, resale_flat_full_nogeo) resale_mlr1
Moran I Calculation
<- cbind(resale_flat_full,resale_mlr1$residuals) %>%
resale_res_sf rename(`MLR_RES` = resale_mlr1.residuals)
<- as_Spatial(resale_res_sf) resale_sp
Calculating Nearest Neighbour
<- dnearneigh(coordinates(resale_sp), 0, 2500, longlat = FALSE) nb
Calculating Neighborhood Matrix
<- nb2listw(nb, style = 'W', zero.policy = TRUE) nb_lw
Calculating Moran I
<- lm.morantest(resale_mlr1, nb_lw) moran_value
<- write_rds(moran_value, "data/rds/moran_value.rds") moran_value_rds
moran_value_rds
Generating Geographically Weighted Regression
Calculating Adaptive Bandwidth
<- bw.gwr(formula = PRICE ~ AREA_SQM + LEASE_YRS + AGE + STOREY_ORDER + PROX_CBD + PROX_CHILDCARE + PROX_ELDERCARE + PROX_HAWKER + PROX_MRT + PROX_PARK + PROX_TOPPRISCH + PROX_MALL + PROX_SPRMKT + PROX_CLINIC + PROX_PHARMACY + PROX_TOURISM + PROX_LIBRARY + NUM_KNDRGTN + NUM_CHILDCARE + NUM_BUS_STOP + NUM_ISP_CLIN + NUM_REGISTERED_PHARM + NUM_LIBRARIES, data=resale_sp, approach="CV", kernel="gaussian",
bw_adaptive adaptive=TRUE, longlat=FALSE)
Building GWR Model
<- gwr.basic(formula = PRICE ~ AREA_SQM + LEASE_YRS + AGE + STOREY_ORDER + PROX_CBD + PROX_CHILDCARE + PROX_ELDERCARE + PROX_HAWKER + PROX_MRT + PROX_PARK + PROX_TOPPRISCH + PROX_MALL + PROX_SPRMKT + PROX_CLINIC + PROX_PHARMACY + PROX_TOURISM + PROX_LIBRARY + NUM_KNDRGTN + NUM_CHILDCARE + NUM_BUS_STOP + NUM_ISP_CLIN + NUM_REGISTERED_PHARM + NUM_LIBRARIES,
gwr_adaptive data=resale_sp, bw=9121,
kernel = 'gaussian', adaptive=TRUE, longlat = FALSE)
<- write_rds(gwr_adaptive, "data/rds/gwr_adaptive.rds") gwr_adaptive_rds