Adding Google Drive Times and Distance Coefficients to Regression Models with ggmap and sp

Space, a wise man once said, is the final frontier.

Not the Buzz Alrdin/Light Year, Neil deGrasse Tyson kind (but seriously, have you seen Cosmos?). Geographic space. Distances have been finding their way into metrics since the cavemen (probably). GIS seem to make nearly every science way more fun…and accurate!

Most of my research deals with spatial elements of real estate modeling. Unfortunately, “location, location, location” has become a clichéd way to begin any paper or presentation pertaining to spatial real estate methods. For you geographers, it’s like setting the table with Tobler’s first law of geography: a quick fix (I’m not above that), but you’ll get some eye-rolls. But location is important!

One common method of taking location and space into account in real estate valuation models is by including distance coefficients (e.g. distance to downtown, distance to center of city). The geographers have this straight-line calculation of distance covered,  and R can spit out distances between points in a host of measurement systems (euclidean, great circle, etc.). This straight-line distance coefficient is a helpful tool when you want to help reduce some spatial autocorrelation in a model, but it doesn’t always tell the whole story by itself (please note: the purpose of this post is to focus on the tools of R and introduce elements of spatial consideration into modeling. I’m purposefully avoiding any lengthy discussions on spatial econometrics or other spatial modeling techniques, but if you would like to learn more about the sheer awesomeness that is spatial modeling, as well as the pit-falls/pros and cons of each, check out Luc Anselin and Stewart Fotheringham for starters. I also have papers being publishing this fall and would be more than happy to forward you a copy if you email me. They are:

Bidanset, P. & Lombard, J. (2014). The effect of kernel and bandwidth specification in geographically weighted regression models on the accuracy and uniformity of mass real estate appraisal. Journal of Property Tax Assessment & Administration. 11(3). (copy on file with editor).

and

Bidanset, P. & Lombard, J. (2014). Evaluating spatial model accuracy in mass real estate appraisal: A comparison of geographically weighted regression (GWR) and the spatial lag model (SLM). Cityscape: A Journal of Policy Development and Research. 16(3). (copy on file with editor).).

Straight-line distance coefficients certainly can help account for location, as well as certain distance-based effects on price. Say you are trying to model negative externalities of a landfill in August, assuming wind is either random or non-existent, straight-line distance from the landfill to house sales could help capture the cost of said stank. Likewise with capturing potential spill-over effects of an airport  – the sound of jets will diminish as space increases, and the path of sound will be more or less a straight line.

But  again, certain distance-based elements cannot be accurately represented with this method. You may expect ‘distance to downtown’ to have an inverse relationship with price: the further you out you go, more of a cost is incurred (in time, gas, and overall inconvenience) getting to work and social activities, so demand for these further out homes decreases, resulting in cheaper priced homes (pardon the hasty economics). Using straight-line distances to account commute in a model, presents some problems (aside: There is nary a form of visualization capable of presenting one’s point more professionally than Paint, and as anyone who has ever had the misfortune of being included in a group email chain with me knows, I am a bit of a Paint artist.). If a trip between a person’s work and a person’s home followed a straight line, this would be less of a problem (artwork below).

commute1But we all know commuting is more complicated than this. There could be a host of things between you and your place of employment that would make a straight-line distance coefficient an inept method of quantifying this effect on home values … such as a lake:

commute2… or a Sarlacc pit monster:

commute3

Some cutting edge real estate valuation modelers are now including a ‘drive time’ variable. DRIVE TIME! How novel is that? This presents a much more accurate way to account for a home’s distance – as a purchaser would see it – from work, shopping, mini-golf, etc. Sure it’s been available in (expensive) ESRI packages for some time, but where is the soul in that? The altruistic R community has yet again risen to the task.

To put some real-life spin on the example above, let’s run through a very basic regression model for modeling house prices.

sample = read.csv("C:/houses.csv", header=TRUE)
model1 <- lm(ln.ImpSalePrice. ~ TLA + TLA.2 + Age + Age.2 + quality + condition, data = sample)

We read in a csv file “houses” that is stored on the C:/ drive and name it “sample”. You can name it anything, even willywonkaschocolatefactory. We’ll name the first model “model1″. The dependent variable, ln.ImpSalePrice.,  is a log form of the sale price. TLA is ‘total living area’ in square feet. Age is, well, age of the house, and quality and condition are dummy variables. The squared variables of TLA and Age are to capture any diminishing marginal returns.

AIC stands for ‘Akaike information criterion’. Some guy from Japan coined it in the 70’s and it’s a goodness-of-fit measurement to compare models used on the same sample (the lower the AIC, the better).

> AIC(model1)
[1] 36.35485

The AIC of model1 is 36.35.

Now we are going to create some distance variables to add to the model. First we’ll do the straight-line distances. We make a matrix  called “origin” consisting of  start-points, which in this case is the lat/long of each house in our dataset.

origin <- matrix(c(sample$lat, sample$lon), ncol=2)

We next create a destination – to where we will be measuring the distance. For this example, I decided to measure the distance to a popular shopping mall downtown (why not?). I obtained the lat/long coordinates for the mall by right clicking on it in Google Maps and clicking “whats here?” (also could’ve geocoded in R).

destination <- c(36.84895, -76.288018)

Now we use the  spDistsN1 function to calculate the distance. We denote longlat=TRUE so we can get the value from origin to destination in kilometers. The second line just adds this newly created column of distances to our dataset and names it dist.

km <- spDistsN1(origin, destination, longlat=TRUE)
sample$dist <- km

This command I learned from a script on Github – initially committed by Peter Schmiedeskamp – which alerted me to the fact that R was capable of grabbing drive-times from the Google Maps API.  You can learn a great deal from his/their work so give ‘em a follow!

library(ggmap)
library(plyr)

google_results <- rbind.fill(apply(subset(sample, select=c("location", "locMall")), 1, function(x) mapdist(x[1], x[2], mode="driving")))

location is the column containing each house’s lat/long coordinates, in the following format (36.841287,-76.218922). locMall is a column in my data set with the lat/long coords of the mall in each row. Just to clarify: each cell in this column had the exact same value, while each cell of “location” was different.  Also something amazing: mode can either be “driving,” “walking,” or “bicycling”!

Now let’s look at the results:

> head(google_results,4)
           from                        to          m     km    miles seconds minutes
1 (36.901373,-76.219024) (36.848950, -76.288018) 10954 10.954 6.806816 986 16.433333
2 (36.868871,-76.243859) (36.848950, -76.288018) 7279 7.279 4.523171 662 11.033333
3 (36.859805,-76.296122) (36.848950, -76.288018) 2101 2.101 1.305561 301 5.016667
4 (36.938692,-76.264474) (36.848950, -76.288018) 12844 12.844 7.981262 934 15.566667
    hours
1 0.27388889
2 0.18388889
3 0.08361111
4 0.25944444

Amazing, right? And we can add this to our sample and rename it “newsample”:

newsample <- c(sample, google_results)

Now let’s add these variables to the model and see what happens.

model2 <- lm(ln.ImpSalePrice. ~ TLA + TLA.2 + Age + Age.2 + quality+ condition + dist,data = newsample)
> AIC(model2)
[1] 36.44782

Gah, well, no significant change. Hmm…let’s try the drive-time variable…

model3 <- lm(ln.ImpSalePrice. ~ TLA + TLA.2 + Age + Age.2 + quality + condition + minutes,data = newsample)
> AIC(model3)
[1] 36.10303

Hmm…still no dice. Let’s try them together.

> AIC(model3)
model4 <- lm(ln.ImpSalePrice. ~ TLA + TLA.2 + Age + Age.2 + quality + condition + minutes + dist,data = newsample)
> AIC(model4)
[1] 32.97605

Alright! AIC has been reduced by more than 2 so they together have a statistically significant effect on the model.

Of course this is a grossly reduced model, and would never be used for actual valuation/appraisal purposes, but it does lay elementary ground work for creating distance-based variables, integrating them, and demonstrating their ability to marginally improve models.

Thanks for reading. So to bring back Cap’n Kirk, I think a frontier more ultimate than space, in the modeling sense,  is space-time – not Einstein’s, rather ‘spatiotemporal’.  That will will be for another post!

Toodles,

Paul

Creating an Interactive Map of Craft Breweries in VA Using the plotly R Package

Well folks, another new year’s resolution down the drain. I was initially shooting for a post each month for 2014. More projects came. Plates were full. Plates were emptied. More plates were filled again. I think I will just alter my resolution to 12 posts this year. That’s a fair compromise with myself, right? That’s what we Americans do. Needless to say, it will likely be a busy last week of December for me.

I’m taking a short break from the previous series to share a great data visualization platform I stumbled upon called plotly.There is even an R package that allows you to feed data directly to their site for further analysis and manipulation. Blew my mind and I had to share. Anyway, check out their site for some mesmerizing graphics and data visualization capabilities!

This post is based off of a guest blog post by Matt Sundquist of plotly on Corey Chivers’ blog bayesianbiologist. I tweaked the code only slightly to accommodate my data and I added a geocoding section. Other than that, they are the masterminds.

Alright so with the obvious boom of craft breweries here in Virginia (and well, across the country), I thought I’d be well-received doing a post on two of my favorite things: geographic data visualization and booze.

First off, in order to harness the great powers of plotly, you must register at https://plot.ly/ for your own account. Next, we install the package that will allow us to connect from R to our fresh, new plotly account.

install.packages("devtools")
library("devtools")
devtools::install_github("R-api","plotly")

After loading the packages, we can log in to our plotly account straight from R by typing in our respective username and API key (to obtain your API key, log in to plot.ly via your web browser, click Profile > Edit Profile and you will see your API key)

library(plotly)
library(maps)
p <- plotly(username="bobdole", key="abcbaseonme") 

For my data set of craft brewery locations in Virginia, I queried a data set of current brewery licensees in the state from the Virginia Department of Alcoholic Beverage Control website. I then removed the ‘big guys’ (sorry, this bud is not for you) and aggregated the count of breweries by city/town and saved as a .csv file. Now we read in our data:

data = read.csv("C:/breww.csv", header=TRUE)

Matt’s data already had location coordinates. Since mine only has the respective city/state, I need to geocode it so R will understand how to plot locations on the map. For this I am using the ever-faithful ggmap package.

We named the sheet “data” when we read it in and the column that has the city/state of each brewery is called “City”. We can now batch geocode each city. The function geocode() returns and m x 2 matrix, where m is the number of rows of data (cities) and the 2 columns are the latitude (default column name is lat ) and longitude (default column name is lon) of each respective city. We create two new columns in our data set and set them equal to the two columns of the data frame loc we just created.

library(ggmap)
loc <- geocode(as.character(data$City))
data$lon<-loc$lon
data$lat<-loc$lat

We call the state outlines using the map() function, take its xy coordinates, and assign this as the first trace for plotting the map.

trace1 <- list(x=map("state")$x,
               y=map("state")$y)

We then create the second trace by extracting the longitude and latitude from our data (assigning as x and y plots, respectively). We specify that the size of the bubbles on the map is based on data$No (i.e. bigger bubble, more breweries), which is the column containing the number of breweries in each respective city.

trace2 <- list(x= data$lon,
               y=data$lat,
               text=data$City,
               type="scatter",
               mode="markers",
               marker=list(
                 "size"=sqrt(data$No/max(data$No))*100,
                 "opacity"=0.5)
)

Finally, we combine the two traces and send our data to our plotly profile.

response <- p$plotly(trace1,trace2)
url <- response$url
filename <- response$filename
browseURL(response$url)

Like magic, running the last code will open your browser and load your fancy new map in the plot.ly interface, ready for you to zoom, crop, and manipulate to your heart’s content!

Map Browser Interface

Static shots can also be exported at very high resolutions from the plotly site:

Craft Breweries in Virginia via R & plotly

Maps like this often produce results that may mislead folks. They often just reflect populations, not a higher propensity to consume craft beer – more people in an area (i.e. Richmond, DC, Virginia Beach) results in the capacity and demand for more breweries overall. A ‘craft breweries per capita map’ would arguably tell a more interesting story. Thanks for reading!

Presenting Paper @ URISA and IAAO 18th GIS/CAMA Technologies Conference, Feb 24-27, Jacksonville, Florida

I will be attending as well as presenting an original paper at this conference next week. Come say hello if you’ll be there!

“Learning more about Geographically Weighted Regression: Optimal Spatial Weighting Functions Used in Mass Appraisal of Residential Real Estate” by Paul E. Bidanset and John R. Lombard

http://www.urisa.org/gis-cama-technologies-conference/

Shapefile Polygons Plotted on Google Maps Using ggmap in R – Throw some, throw some STATS on that map…(Part 2)

Well it’s been long enough since my last post. Had a few things on my plate (vacation, holidays, another holiday, some more holidays, and quite a lot of research). March is almost here but the good news is that I have plenty of work stored up to start serving out some intuitive approaches for learning R. Speaking of that…

In the hefty amounts of research I’ve been doing lately, I’ve come across many, MANY R-based blogs and tutorials. There are so many fantastic resources out there. But there are also a few not-so-good ones. Some code examples seem to confuse more than clarify. Scrolling to the bottom of a tutorial for a glance at the comments is usually a good way to gauge whether or not the audience received it well (I’ve also noticed that R-learners are much less negative and troll-like than the majority of those who comment on say, well, every other community-based website in the world). A couple tutorials don’t even include code. Unfortunately, I think often times these bloggers have ulterior motives (showing off their technical/statistical/whatever capacity), consequently flushing a graspable, empowering learning experience down the toilet.

With this site, I’m going to continue attempting to hammer on what is actually transpiring in R, ideally without dragging my feet and stagnating the more advanced users, using an intuitive approach so readers understand not just THAT something happens, but HOW and WHY it happens. This hopefully means they remember, are able to reproduce results, and ultimately grow in their learning. So please, keep the feedback coming (even if it is in troll form)!

Alright enough about me. Let’s pick up from where we left off.

For this post, I am going to show you how to plot, or overlay, the polygons of a shapefile on top of a Google Map. The polygons in this example will be of neighborhoods in the city of Baltimore.

The City of Baltimore is a kind of paragon when it comes to a municipality’s dissemination of public data. Rob Mealy’s amazing blog (on R and maps n’ stuff) first tipped me off to this website. I downloaded the neighborhood shapefiles (Neighborhood 2010.zip) from the site and unzipped the file to my C drive.

Now since we are going to be reading shapefiles into R, we need to install a package that is capable of doing so. There are several, but for this example we are going to use rgdal.

install.packages('rgdal')
library(rgdal)

Since I unzipped my shapefile data to my C drive, I am going to tell R it is from THERE I will be working. I set this as my working directory with:

setwd("C:/")

From now on during this session, R will automatically use this location to retrieve and save files, unless specifically told to do otherwise.

We read in the shapefile with:

Neighborhoods <- readOGR(".","nhood_2010")

We named the shapefile “Neighborhoods” (by typing the name of to the left of <-). The first set of quotations in the command is looking for the location of the data. We already set the working directory to C: so the dot is telling R "slow your roll; you don't need to look any further". The second set of quotations in the command is looking for the name of the layer, which in this case is "nhood_2010".

Now we need to prepare our object so that it may be portrayed on a map. R doesn’t know what to do with it in its current form. A few lines of code will transform this caterpillar into a beautiful, map-able butterfly. First, run:

Neighborhoods <- spTransform(Neighborhoods, CRS("+proj=longlat +datum=WGS84"))

spTransform allows us to convert and transform between different mapping projections and datums. This line of code is telling R to convert our Neighborhoods file to longitude/latitude projection and World Geodetic System 1984 datum – a global coordinate (GPS) system used by Google Maps (the initial object was set to a Lambert Conic Conformal projection and a NAD83 datum, as well as a GRS80 ellipsoid). This last bit of information is useful, but you really don’t have to know exactly what it means. Just know that there are a bunch of various coordinate systems that die-hard geography nerds have created (for what I’m sure are good reasons), and all you have to do is smile and remember that we’re essentially just converting our coordinates into a friendly format for integrating with Google Maps (I’m sure I’m going to get heat from one of those geography nerds for diluting it in this way).

Now the fortify command (from the package ggplot2) takes all that wonderful spatial data and converts it into a data frame that R knows understands how to put onto a map.

install.packages('ggplot2')
library(ggplot2)

Neighborhoods <- fortify(Neighborhoods)

Alright meow, we are going to take the map we previously created, BaltimoreMap, and add polygons outlining the neighborhoods from our shapefile. *Side Note: I keep the name of the object the same with each transformation I make. This is a preferential thing. As you are learning, you may wish to name each step differently (e.g. BaltimoreMap1, BaltimoreMap2; Neighborhoods1, Neighborhoods2) so you may go back and look at each one and understand the transformations that take place at each step, which also allows you to identify the area you messed up if you end up receiving an error message along the way.

And now we run the final code:

BaltimoreMap <- BaltimoreMap + geom_polygon(aes(x=long, y=lat, group=group), fill='grey', size=.2,color='green', data=Neighborhoods, alpha=0)
BaltimoreMap

Shapefile Polygons Plotted on Google Maps Using ggmap in R

There we have it. She looks good! Notice in the last command we specified that the name of our data is Neighborhoods. This is important. When we set x=long and y=lat, this isn’t just us declaring that we want to use longitude and latitude for our projection; we are telling R that the coordinates for the x (horizontal) and y (vertical) axis of our plot (map) are stored in the columns of our data (Neighborhoods) called ‘long’ and ‘lat’, respectively.

Now play around a bit with the various options for fill, size, color, and alpha (which is the level of transparency from 0 to 1, with the level of opaqueness increasing as you approach 1), as well as the various maptypes and zoom levels from part one. Next session we’ll plot some values (more examples below). Thanks for reading!

Shapefile Polygons Plotted on Google Maps Using ggmap in R

Shapefile Polygons Plotted on Google Maps Using ggmap in R

Shapefile Polygons Plotted on Google Maps Using ggmap in R

All works on this site (spatioanalytics.com) are subject to copyright (all rights reserved) by Paul Bidanset, 2013-2014. All that is published is my own and does not represent my employers or affiliated institutions.

Throw some, throw some STATS on that map…(Part 1)

R is a very powerful and free (and fun) software package that allows you to do, pretty much anything you could ever want. Someone told me that there’s even code that allows you to order pizza (spoiler alert: you actually cannot order pizza using R :( ). But if you’re not hungry, the statistical capabilities are astounding. Some people hate code; their brains shut down and they get sick when they look at it, subsequently falling to the floor restricted to the fetal position. I used to be that guy, but have since stood up, gained composure, sat back down, and developed a passion for statistical programming. I hope to teach the R language with some more intuition in order to keep the faint-of-heart vertical and well.

Alright so for the start in this series, I’m going to lay the foundation for a Baltimore, MD real estate analysis and demonstrate some extremely valuable spatial and statistical functions of R. So without too much blabbing, let’s jump in…

For those of you completely new to R, its interface allows you to download different packages which perform different functions. People use R for so many different data-related reasons, and the inclusion of all or most of the packages would be HUGE, so each one, housed in various servers located around the world, can be downloaded simply. For the first-time use of each package, you’ll need to install it. They will then be on your machine and you will simply load them for each future use.

For the initial map creation, we need to install the following (click Packages->Install Package(s)and holding Ctrl allows you to select multiple ones at a time):

foreign
RgoogleMaps
ggmap

Since these are now installed to our machine, we simply load these packages each session we use them. Loading just ggmap and RgoogleMaps will automatically load the others we just downloaded. With each session, open a script and once you’ve written out your code, highlight it and right-click “Run Line or Selection,” or just press Ctrl -> R. A quick note: unlike other programming languages like SAS and SQL, R is case sensitive.

To load run:

library(ggmap)
library(RgoogleMaps)

We will specify the object of the map center as CenterOfMap. Anything to the left of “<-" in R is the title and anything to the right are the specified contents of the object. Now for the map we're using, the shape of Baltimore behaves pretty well, so we can just type within the geocode() command “Baltimore, MD” ( R is smart and that’s all it takes).

CenterOfMap <- geocode("Baltimore, MD")

Not all areas are as symmetrically well behaved as Baltimore, and for other cases, my preferred method of centrally displaying an area’s entirety begins with entering the lat/long coordinates of your preferred center. For this, I go to Google Maps, find the area I wish to map, right click on my desired center and click “What’s here?” and taking the lat/long coordinates which are then populated in the search bar above. For Baltimore, I’m going to click just north of the harbor.

The code would then look like this:

CenterOfMap <- geocode(" 39.299768,-76.614929")

Now that we told R where the center of our map will be, lets make a map! So remember, left of the “<-" will be our name. I'd say naming the map 'BaltimoreMap' will do.

Baltimore <- get_map(c(lon=CenterOfMap$lon, lat=CenterOfMap$lat),zoom = 12, maptype = "terrain", source = "google")
BaltimoreMap <- ggmap(Baltimore)
BaltimoreMap 

Alright, to explain what just happened, getmap() is the command to construct the map perimeters and lay down its foundation. I’m going to retype the code with what will hopefully explain it more intuitively.

get_map(c(lon=‘The longitude coordinate of the CenterOfMap object we created. The dollar sign shows what follows is part of what is before it, for example ExcelSpreadsheet$ColumnA, lat=‘The latitude coordinate of the CenterOfMap object we created.’, zoom = ‘The zoom level of map display. Play around with this and see how it changes moving from say,5 to 25′, maptype = ‘We assigned “terrain” but there are others to suit your tastes and preferences. Will show more later.’, source = ‘We assigned “google” but there are other agents who provide types of mapping data’)

And the grand unveiling of the first map…

Now that is one good lookin’ map. Just a few lines of code, too.

I’ll show you some other ways to manipulate it. I like to set the map to black & white often times so the contrast (or lack thereof) of the values later plotted are more defined. I prefer the Easter bunny/night club/glow-in-the-dark type spectrums, and so, I usually plot on the following:

Baltimore <- get_map(c(lon=CenterOfMap$lon, lat=CenterOfMap$lat),zoom = 12, maptype = "toner", source = "stamen")
BaltimoreMap <- ggmap(Baltimore)
BaltimoreMap 

We just set the night sky for the meteor shower. Notice that all we did was change maptype from “terrain” to “toner,” and source from “google” to “stamen.”

A few other examples:

Baltimore <- get_map(c(lon=CenterOfMap$lon, lat=CenterOfMap$lat),zoom = 12,source = "osm")
BaltimoreMap <- ggmap(Baltimore)
BaltimoreMap

This map looks great but it’s pretty busy – probably not the best to use if you will be plotting a colorful array of values later.

Here’s a fairly standard looking one, similar to Google terrain we covered above.

Baltimore <- get_map(c(lon=CenterOfMap$lon, lat=CenterOfMap$lat),zoom=12)
BaltimoreMap <- ggmap(Baltimore, extent="normal")
BaltimoreMap

And one for the hipsters…

Baltimore <- get_map(c(lon=CenterOfMap$lon, lat=CenterOfMap$lat),zoom = 12, maptype = "watercolor",source = "stamen")
BaltimoreMap <- ggmap(Baltimore)
BaltimoreMap

George Washington and the cartographers of yesteryear would be doing cartwheels if they could see this now. The upcoming installments in this series will cover:

1) Implementing Shapefiles and GIS Data
2) Plotting Statistics and other Relationship Variables on the Maps
3) Analyzing Real Estate Data and Patterns of Residential Crime and Housing Prices

Thanks for reading this!If you have any problems with coding or questions whatsoever, please shoot me an email (pbidanset[@]gmail.com) or leave a comment below and I’ll get back to you as my schedule permits (should be quickly). Cheers.

All works on this site (spatioanalytics.com) are subject to copyright (all rights reserved) by Paul Bidanset, 2013-2014. All that is published is my own and does not represent my employers or affiliated institutions.

Follow

Get every new post delivered to your Inbox.