a blog on spatial research and data visualization in R. by paul bidanset © 2013-15

Posts tagged programming

Giving a Darn About Statistics: Baseball, Shark Attacks, and Green M&M’s


Trying to hide from statistics is tough. Believe me. I tried.

sc1

And I did a pretty decent job keeping the subject at bay until my early twenties. Much like grammar, statistics are everywhere (wait…statistics …is everywhere?).  It’s unavoidable. We see statistics in the news:

sc8

in the news…

sc9

and, well, it’s really all over the news.

With such an abundance of statistics being thrown around in today’s society, most people must have a firm grasp on their meaning, right? Could it be we don’t have as strong of a grasp on statistics as we thought?  Well this certainly would explain why people still buy lottery tickets (1:175,000,000), but are still too scared of sharks to swim in the ocean (1:11,500,000).

sc11sc12

Statistics education seems to lie dormant throughout grade school.  Take me for example (pardon me while I make a widespread claim based solely upon my own experience – not very stats like). Up until my junior year of high school, the only stats lessons I received were from gym class teachers and sports coaches.

sc15

And it didn’t get any better. My first real encounter with a statistics course in high school did everything it possibly could to prevent any sort of intuitive, applied, ‘real world’ relevance to the subject.  Distribution curves, t-tests, z-scores, and countless problems about flipping coins and green M&Ms didn’t win me over.

sc13d

sc14a

sc14b

I did the bare minimum, and true story: I got a D in that class. On the last day, I literally got down on my knees and begged my teacher for a C- because I was applying to colleges. She did. I should send her some flowers.

Eventually senior college courses happened and my plague-like avoidance of the subject became a natural fascination. Fast forward a few years and I now do it for a living. What made the switch? Thanks to the dedication and passion of many professors, students, authors, and Wikipedia editors, I finally realized that I’d be hard pressed to find something MORE likely to be used ‘in the real world’ – a previous struggle that unfortunately fueled many years of academic apathy for me. This was my road to Damascus.

Statistics makes everything better. It equips us with the power to not only measure what’s going on, but to monitor changes over time, and most importantly, the ability to solve problems that arise.

It’s the driving force of modern medicine.  With clinical research, it helps doctors know what makes us better, and what makes us worse.

sc21

The car you ride in made it off the lot because each part made fell within a certain range of acceptability. Statistics ensure safety.

sc32 sc33c

Statistical tools can tell us whether or not public health campaigns are working. Do advertisements that highlight how bad tobacco is make people smoke less?

sc22

sc23

Businesses use statistics to measure customer happiness and to see what they can do better.

sc25

Making customers happy means more business.

sc26f

More business means more people have jobs,  and more people earn money, to buy more … stuff.

sc29c

Statistics even makes sports better! Games are more entertaining.  If teams drafted players with low batting averages and low RBIs, baseball would be more boring than it already is (if that’s possible!).

sc35

R is an amazing statistical tool to create, execute, visualize … and essentially solve (almost) all of the world’s problems. I’m looking forward to the rest of the year and continuing to hammer out some intuitive exercises with this blog. If you had some rough early encounters with stats or statistical programming, and they’ve left you with an unpleasant taste in your mouth, I urge you to reconsider and revisit this area. Statistics isn’t very hard; it’s just all about finding a teaching approach that makes things click for you, and I am here to *attempt* to provide that using a relevant, intuitive approach. If I had thrown in the towel after my first one (or three) terrible encounters with statistics teachers who may have been a little too dry – a little too abstract in their teachings for my personal learning style – I never would’ve gotten into this field. And man, do I love this field (in case you couldn’t tell from my crudely illustrated Paint renderings). Please sign up for my email list (at the very bottom) and even email me with some things you’d like to see on the site!

Statistics makes everything better (110%).

sc16b

Shapefile Polygons Plotted on Google Maps Using ggmap in R – Throw some, throw some STATS on that map…(Part 2)


Well it’s been long enough since my last post. Had a few things on my plate (vacation, holidays, another holiday, some more holidays, and quite a lot of research). March is almost here but the good news is that I have plenty of work stored up to start serving out some intuitive approaches for learning R. Speaking of that…

In the hefty amounts of research I’ve been doing lately, I’ve come across many, MANY R-based blogs and tutorials. There are so many fantastic resources out there. But there are also a few not-so-good ones. Some code examples seem to confuse more than clarify. Scrolling to the bottom of a tutorial for a glance at the comments is usually a good way to gauge whether or not the audience received it well (I’ve also noticed that R-learners are much less negative and troll-like than the majority of those who comment on say, well, every other community-based website in the world). A couple tutorials don’t even include code. Unfortunately, I think often times these bloggers have ulterior motives (showing off their technical/statistical/whatever capacity), consequently flushing a graspable, empowering learning experience down the toilet.

With this site, I’m going to continue attempting to hammer on what is actually transpiring in R, ideally without dragging my feet and stagnating the more advanced users, using an intuitive approach so readers understand not just THAT something happens, but HOW and WHY it happens. This hopefully means they remember, are able to reproduce results, and ultimately grow in their learning. So please, keep the feedback coming (even if it is in troll form)!

Alright enough about me. Let’s pick up from where we left off.

For this post, I am going to show you how to plot, or overlay, the polygons of a shapefile on top of a Google Map. The polygons in this example will be of neighborhoods in the city of Baltimore.

The City of Baltimore is a kind of paragon when it comes to a municipality’s dissemination of public data. Rob Mealy’s amazing blog (on R and maps n’ stuff) first tipped me off to this website. I downloaded the neighborhood shapefiles (Neighborhood 2010.zip) from the site and unzipped the file to my C drive.

Now since we are going to be reading shapefiles into R, we need to install a package that is capable of doing so. There are several, but for this example we are going to use rgdal.

install.packages('rgdal')
library(rgdal)

Since I unzipped my shapefile data to my C drive, I am going to tell R it is from THERE I will be working. I set this as my working directory with:

setwd("C:/")

From now on during this session, R will automatically use this location to retrieve and save files, unless specifically told to do otherwise.

We read in the shapefile with:

Neighborhoods <- readOGR(".","nhood_2010")

We named the shapefile "Neighborhoods" (by typing the name of to the left of <-). The first set of quotations in the command is looking for the location of the data. We already set the working directory to C: so the dot is telling R "slow your roll; you don't need to look any further". The second set of quotations in the command is looking for the name of the layer, which in this case is "nhood_2010".

Now we need to prepare our object so that it may be portrayed on a map. R doesn't know what to do with it in its current form. A few lines of code will transform this caterpillar into a beautiful, map-able butterfly. First, run:

Neighborhoods <- spTransform(Neighborhoods, CRS("+proj=longlat +datum=WGS84"))

spTransform allows us to convert and transform between different mapping projections and datums. This line of code is telling R to convert our Neighborhoods file to longitude/latitude projection and World Geodetic System 1984 datum - a global coordinate (GPS) system used by Google Maps (the initial object was set to a Lambert Conic Conformal projection and a NAD83 datum, as well as a GRS80 ellipsoid). This last bit of information is useful, but you really don't have to know exactly what it means. Just know that there are a bunch of various coordinate systems that die-hard geography nerds have created (for what I'm sure are good reasons), and all you have to do is smile and remember that we're essentially just converting our coordinates into a friendly format for integrating with Google Maps (I'm sure I'm going to get heat from one of those geography nerds for diluting it in this way).

Now the fortify command (from the package ggplot2) takes all that wonderful spatial data and converts it into a data frame that R knows understands how to put onto a map.

install.packages('ggplot2')
library(ggplot2)

Neighborhoods <- fortify(Neighborhoods)

Alright meow, we are going to take the map we previously created, BaltimoreMap, and add polygons outlining the neighborhoods from our shapefile. *Side Note: I keep the name of the object the same with each transformation I make. This is a preferential thing. As you are learning, you may wish to name each step differently (e.g. BaltimoreMap1, BaltimoreMap2; Neighborhoods1, Neighborhoods2) so you may go back and look at each one and understand the transformations that take place at each step, which also allows you to identify the area you messed up if you end up receiving an error message along the way.

And now we run the final code:

BaltimoreMap <- BaltimoreMap + geom_polygon(aes(x=long, y=lat, group=group), fill='grey', size=.2,color='green', data=Neighborhoods, alpha=0)
BaltimoreMap

Shapefile Polygons Plotted on Google Maps Using ggmap in R

There we have it. She looks good! Notice in the last command we specified that the name of our data is Neighborhoods. This is important. When we set x=long and y=lat, this isn't just us declaring that we want to use longitude and latitude for our projection; we are telling R that the coordinates for the x (horizontal) and y (vertical) axis of our plot (map) are stored in the columns of our data (Neighborhoods) called 'long' and 'lat', respectively.

Now play around a bit with the various options for fill, size, color, and alpha (which is the level of transparency from 0 to 1, with the level of opaqueness increasing as you approach 1), as well as the various maptypes and zoom levels from part one. Next session we'll plot some values (more examples below). Thanks for reading!

Shapefile Polygons Plotted on Google Maps Using ggmap in R

Shapefile Polygons Plotted on Google Maps Using ggmap in R

Shapefile Polygons Plotted on Google Maps Using ggmap in R

All works on this site (spatioanalytics.com) are subject to copyright (all rights reserved) by Paul Bidanset, 2013-2014. All that is published is my own and does not represent my employers or affiliated institutions.

Throw some, throw some STATS on that map…(Part 1)


R is a very powerful and free (and fun) software package that allows you to do, pretty much anything you could ever want. Someone told me that there’s even code that allows you to order pizza (spoiler alert: you actually cannot order pizza using R :( ). But if you’re not hungry, the statistical capabilities are astounding. Some people hate code; their brains shut down and they get sick when they look at it, subsequently falling to the floor restricted to the fetal position. I used to be that guy, but have since stood up, gained composure, sat back down, and developed a passion for statistical programming. I hope to teach the R language with some more intuition in order to keep the faint-of-heart vertical and well.

Alright so for the start in this series, I’m going to lay the foundation for a Baltimore, MD real estate analysis and demonstrate some extremely valuable spatial and statistical functions of R. So without too much blabbing, let’s jump in…

For those of you completely new to R, its interface allows you to download different packages which perform different functions. People use R for so many different data-related reasons, and the inclusion of all or most of the packages would be HUGE, so each one, housed in various servers located around the world, can be downloaded simply. For the first-time use of each package, you’ll need to install it. They will then be on your machine and you will simply load them for each future use.

For the initial map creation, we need to install the following (click Packages->Install Package(s)and holding Ctrl allows you to select multiple ones at a time):

foreign
RgoogleMaps
ggmap

Since these are now installed to our machine, we simply load these packages each session we use them. Loading just ggmap and RgoogleMaps will automatically load the others we just downloaded. With each session, open a script and once you’ve written out your code, highlight it and right-click “Run Line or Selection,” or just press Ctrl -> R. A quick note: unlike other programming languages like SAS and SQL, R is case sensitive.

To load run:

library(ggmap)
library(RgoogleMaps)

We will specify the object of the map center as CenterOfMap. Anything to the left of “<-” in R is the title and anything to the right are the specified contents of the object. Now for the map we’re using, the shape of Baltimore behaves pretty well, so we can just type within the geocode() command “Baltimore, MD” ( R is smart and that’s all it takes).

CenterOfMap <- geocode("Baltimore, MD")

Not all areas are as symmetrically well behaved as Baltimore, and for other cases, my preferred method of centrally displaying an area's entirety begins with entering the lat/long coordinates of your preferred center. For this, I go to Google Maps, find the area I wish to map, right click on my desired center and click "What's here?" and taking the lat/long coordinates which are then populated in the search bar above. For Baltimore, I'm going to click just north of the harbor.

The code would then look like this:

CenterOfMap <- geocode(" 39.299768,-76.614929")

Now that we told R where the center of our map will be, lets make a map! So remember, left of the "<-" will be our name. I'd say naming the map 'BaltimoreMap' will do.

Baltimore <- get_map(c(lon=CenterOfMap$lon, lat=CenterOfMap$lat),zoom = 12, maptype = "terrain", source = "google")
BaltimoreMap <- ggmap(Baltimore)
BaltimoreMap

Alright, to explain what just happened, getmap() is the command to construct the map perimeters and lay down its foundation. I'm going to retype the code with what will hopefully explain it more intuitively.

get_map(c(lon='The longitude coordinate of the CenterOfMap object we created. The dollar sign shows what follows is part of what is before it, for example ExcelSpreadsheet$ColumnA, lat='The latitude coordinate of the CenterOfMap object we created.', zoom = 'The zoom level of map display. Play around with this and see how it changes moving from say,5 to 25', maptype = 'We assigned "terrain" but there are others to suit your tastes and preferences. Will show more later.', source = 'We assigned "google" but there are other agents who provide types of mapping data')

And the grand unveiling of the first map...

Now that is one good lookin' map. Just a few lines of code, too.

I'll show you some other ways to manipulate it. I like to set the map to black & white often times so the contrast (or lack thereof) of the values later plotted are more defined. I prefer the Easter bunny/night club/glow-in-the-dark type spectrums, and so, I usually plot on the following:

Baltimore <- get_map(c(lon=CenterOfMap$lon, lat=CenterOfMap$lat),zoom = 12, maptype = "toner", source = "stamen")
BaltimoreMap <- ggmap(Baltimore)
BaltimoreMap

We just set the night sky for the meteor shower. Notice that all we did was change maptype from "terrain" to "toner," and source from "google" to "stamen."

A few other examples:

Baltimore <- get_map(c(lon=CenterOfMap$lon, lat=CenterOfMap$lat),zoom = 12,source = "osm")
BaltimoreMap <- ggmap(Baltimore)
BaltimoreMap

This map looks great but it's pretty busy - probably not the best to use if you will be plotting a colorful array of values later.

Here's a fairly standard looking one, similar to Google terrain we covered above.

Baltimore <- get_map(c(lon=CenterOfMap$lon, lat=CenterOfMap$lat),zoom=12)
BaltimoreMap <- ggmap(Baltimore, extent="normal")
BaltimoreMap

And one for the hipsters...

Baltimore <- get_map(c(lon=CenterOfMap$lon, lat=CenterOfMap$lat),zoom = 12, maptype = "watercolor",source = "stamen")
BaltimoreMap <- ggmap(Baltimore)
BaltimoreMap

George Washington and the cartographers of yesteryear would be doing cartwheels if they could see this now. The upcoming installments in this series will cover:

1) Implementing Shapefiles and GIS Data
2) Plotting Statistics and other Relationship Variables on the Maps
3) Analyzing Real Estate Data and Patterns of Residential Crime and Housing Prices

Thanks for reading this!If you have any problems with coding or questions whatsoever, please shoot me an email (pbidanset[@]gmail.com) or leave a comment below and I'll get back to you as my schedule permits (should be quickly). Cheers.

All works on this site (spatioanalytics.com) are subject to copyright (all rights reserved) by Paul Bidanset, 2013-2015. All that is published is my own and does not represent my employers or affiliated institutions.