Friday, May 23, 2014

Housing Cost Comparison Tool with Zillow and R

Thinking of moving?

Using some of my favorite R packages and Zillow data I built a comparison function for median house price between two localities.


Zillow posts data outputs or derivations of its Zestimate home price model to s series of .csv. This data is freely available for researchers and includes many breakouts for home and rental values as well as some very interesting housing market metrics (price to rent, sold for loss etc.).  Enhancing the value of these series is the level of geographic detail available.  Beyond the state and county levels, zipcode and neighborhood detail levels are also available.  Zillow even gives access to the shapefiles of defined neighborhoods, enabling work with GIS tools, R. or other languages.  An R package has also been developed to work with the free API the site provides.


For some exploration of this data with R I chose a set which uses an easily consumed nominal value; Median Sale Price.

I began by reading the Median sale Price .csv into a data frame:

library(reshape)
library(quantmod)
 
zipSale <- read.csv(file="http://files.zillowstatic.com/research/public/Zip/Zip_MedianSoldPrice_AllHomes.csv", header=T)
Created by Pretty R at inside-R.org


What I found by a [1:4,] head view is that the Zillow time series data was defined in a columnar format, not as a friendly grouping and value format.  To handle this I used the melt function from the ever useful reshape package.  Using melt I created a "value" variable and assigned the month columns to a single "date_val" grouping variable.  it was especially helpful to know that the first five variables in the Zillow home price data are not the month variable data and can be excluded from the melt reshaping by assigning them to the "id" variables (similar to df[,(-1:5)].  All other variables (the Zillow data months) will be melted as "measure" variables by default.  I also created a date type value from the YYYY-MM that Zillow provides.

zipSale<- melt(zipSale,id=c(1:5),variable_name="date_val")
zipSale$rep_month <- as.Date(paste(
  substr(zipSale$date_val,2,5),"-",
  substr(zipSale$date_val,7,9),"-",
  "01", sep="")
)
Created by Pretty R at inside-R.org


 Now, with an easily manipulated set I used another of my favorite packages, quantmod, to create some charts in the style of standard Bloomberg output.

To do this I first had to subset the frames into XTS objects of the zoo construct.  After some charting I decided to divide the sales price values by 1000 better formatting.  For my initial pass I chose my hometown and the zip of my current address:


lpzipSale <- zipSale[zipSale$RegionName=="20646",]
lpXTS <- xts(lpzipSale$value/1000, order.by=lpzipSale$rep_month)
axzipSale <- zipSale[zipSale$RegionName=="22314",]
axXTS <- xts(axzipSale$value/1000, order.by=axzipSale$rep_month)
Created by Pretty R at inside-R.org

Now with the two sets I created a composite with which I could create a global range and do some other testing:
lpax <- complete.cases(c(lpXTS,axXTS))
lpax <- c(lpXTS,axXTS)[lpax,]
Created by Pretty R at inside-R.org

I then used these XTS objects with the chartSeries() to plot 20646 and 22314:

chartSeries(lpXTS, TA="addTA(axXTS, on=1)", yrange = (c(min(lpax),max(lpax))), name = "20646 vs. 22314 Median Prices")
Created by Pretty R at inside-R.org


Expensive area...

From here I felt it would be best to create a function to abstract the chart creation.

Other than some formatting changes I largely just re-purposed the above code using two zipcode value arguments.  I also used these values for the chart title, for some simple dynamic formatting.  The full code is:

zipCompare <- function(zip1, zip2) {
 
  require(quantmod)
  zip1df <- zipSale[zipSale$RegionName==zip1,]
  zip1XTS <- xts(as.integer(zip1df$value)/1000, order.by=zip1df$rep_month)
  zip2df <- zipSale[zipSale$RegionName==zip2,]
  last2 <- xts(as.integer(zip2df$value)/1000, order.by=zip2df$rep_month)
 
  zip12 <- complete.cases(c(zip1XTS,last2))
  zip12 <- c(zip1XTS,last2)[zip12,]
 
  lineChart(zip1XTS, name = paste(zip1, "vs.", zip2, "Median Sales Price"), yrange = c(min(zip12),max(zip12)))
  addTA(last2, on=1)
 
}
Created by Pretty R at inside-R.org

Here I compared my home and work:
zipCompare("22314", "22101")


I also tested these scripts using the county geographic level. With the naming conventions of the Zillow url and file format conventions this was very easy to do.  To add additional value I think exploring other plotting options would be best.  While I am partial to the aesthetic of chartSeries() it does have flaws when dealing with this type of data, specifically for axes and attribute naming.


Full code for data prep:

library(reshape)
library(quantmod)
 
zipSale <- read.csv(file="http://files.zillowstatic.com/research/public/Zip/Zip_MedianSoldPrice_AllHomes.csv", header=T)
 
zipSale<- melt(zipSale,id=c(1:5),variable_name="date_val")
zipSale$rep_month <- as.Date(paste(
  substr(zipSale$date_val,2,5),"-",
  substr(zipSale$date_val,7,9),"-",
  "01", sep="")
)
 
zipSale[1:4,]
 
 
lpzipSale <- zipSale[zipSale$RegionName=="20646",]
lpXTS <- xts(lpzipSale$value/1000, order.by=lpzipSale$rep_month)
axzipSale <- zipSale[zipSale$RegionName=="22314",]
axXTS <- xts(axzipSale$value/1000, order.by=axzipSale$rep_month)
 
lpax <- complete.cases(c(lpXTS,axXTS))
lpax <- c(lpXTS,axXTS)[lpax,]
 
chartSeries(lpXTS, TA="addTA(axXTS, on=1)", yrange = (c(min(lpax),max(lpax))), name = "20646 vs. 22314 Median Prices")
Created by Pretty R at inside-R.org