Erich Donahue: Data tools

Showing posts with label Data tools. Show all posts

Friday, June 27, 2014

Article: Open CPU Database; Race Your Favorite Processors Head-to-Head

I thought I'd share “CPU DB: Recording Microprocessor History” written by members of the Stanford sponsored team that put together the database. I highly recommend this article, I thought it is a great example of the way that data, science, and data-science can work with the hardware and broader computer science industry.

The CPU DB (cpudb.stanford.edu) is an open database curated by a group at Stanford which stores information on processors throughout history. While the goals of the database addressed in the article are mostly concerned with using the database to analyze and predict processor performance the database is intended to be used by researchers with any subject in mind. I was especially impressed at how accessible the DB is, with multiple data interactions offered and example code for analysis. This article is both an introduction to the database by its proud parents as well as a demonstration of the analytical power the database allows.

Trendy Processors
The main research subject demonstrated by the authors is the use of the historic data to test for trends in the manufacture and performance of processors over time. The authors use several examples to show these features with graphical representations and statistical insights. Addressing the ubiquitous laws of Moore and Pollack the authors use data from the DB to show the performance gains over tiem and the impact of density and area afforded through manufacture. Despite my lack of deep knowledge in the area of hardware and architecture this aspect was most interesting to me. While many who worked in the hardware industry may know the history of decisions made to improve CPUs this DB gives a birds eye view of what the actual results have been. While we sometimes take for granted today that processors improve inevitably over time, the CPU DB gives us many parameters to empirically answer the question of how these improvements were made. Among many other examples the authors show that it’s true that clock frequency has increased dramatically from the 1980s to the mid 2000s they are able to show through empirical analysis that the introduction has all but halted progress on single core clock frequency in the past decade. This fact, while somewhat obvious from a macro trend is made more interesting with the detail the CPU DB provides. Citing the constant improvements in compiler optimization the authors show that while clock frequencies have been stagnant for some time, even single core performance has been increasing (albeit very slightly) in the last decade. This fact might be overlooked by a more narrowly focused industry report or consumer oriented technology journalism.

Open Analysis
The idea this article presents is fascinating to me, and I’m especially excited by the idea of the open data. Before I got to page two I was already thinking of how I might chop this data up with R. Much to my surprise when I went to the “Downloads” page the team has posted some sample R code. Just for fun I put together a quick chart of my own based on the CPU DB data, a wonky looking count of transistors per die over time.

R script:

require(ggplot2)
 
processor <- read.csv("processor.csv")
 
processor <- processor[!is.na(processor$date),]
 
processor <- processor[!is.na(processor$transistors),]
 
processor$date <- as.Date(processor$date)
 
processor <- processor[processor$date >= '2000-01-01',]
 
ggplot(data=processor, aes(x=date, y=transistors)) + geom_line(colour="blue", size=1) + ggtitle("Transistor Count 2000 to Today")

Created by Pretty R at inside-R.org

Friday, May 23, 2014

Housing Cost Comparison Tool with Zillow and R

Thinking of moving?

Using some of my favorite R packages and Zillow data I built a comparison function for median house price between two localities.

Zillow posts data outputs or derivations of its Zestimate home price model to s series of .csv. This data is freely available for researchers and includes many breakouts for home and rental values as well as some very interesting housing market metrics (price to rent, sold for loss etc.). Enhancing the value of these series is the level of geographic detail available. Beyond the state and county levels, zipcode and neighborhood detail levels are also available. Zillow even gives access to the shapefiles of defined neighborhoods, enabling work with GIS tools, R. or other languages. An R package has also been developed to work with the free API the site provides.

For some exploration of this data with R I chose a set which uses an easily consumed nominal value; Median Sale Price.

I began by reading the Median sale Price .csv into a data frame:

library(reshape)
library(quantmod)
 
zipSale <- read.csv(file="http://files.zillowstatic.com/research/public/Zip/Zip_MedianSoldPrice_AllHomes.csv", header=T)

Created by Pretty R at inside-R.org

What I found by a [1:4,] head view is that the Zillow time series data was defined in a columnar format, not as a friendly grouping and value format. To handle this I used the melt function from the ever useful reshape package. Using melt I created a "value" variable and assigned the month columns to a single "date_val" grouping variable. it was especially helpful to know that the first five variables in the Zillow home price data are not the month variable data and can be excluded from the melt reshaping by assigning them to the "id" variables (similar to df[,(-1:5)]. All other variables (the Zillow data months) will be melted as "measure" variables by default. I also created a date type value from the YYYY-MM that Zillow provides.

zipSale<- melt(zipSale,id=c(1:5),variable_name="date_val")
zipSale$rep_month <- as.Date(paste(
  substr(zipSale$date_val,2,5),"-",
  substr(zipSale$date_val,7,9),"-",
  "01", sep="")
)

Created by Pretty R at inside-R.org

Now, with an easily manipulated set I used another of my favorite packages, quantmod, to create some charts in the style of standard Bloomberg output.

To do this I first had to subset the frames into XTS objects of the zoo construct. After some charting I decided to divide the sales price values by 1000 better formatting. For my initial pass I chose my hometown and the zip of my current address:

lpzipSale <- zipSale[zipSale$RegionName=="20646",]
lpXTS <- xts(lpzipSale$value/1000, order.by=lpzipSale$rep_month)
axzipSale <- zipSale[zipSale$RegionName=="22314",]
axXTS <- xts(axzipSale$value/1000, order.by=axzipSale$rep_month)

Created by Pretty R at inside-R.org

Now with the two sets I created a composite with which I could create a global range and do some other testing:

lpax <- complete.cases(c(lpXTS,axXTS))
lpax <- c(lpXTS,axXTS)[lpax,]

Created by Pretty R at inside-R.org

I then used these XTS objects with the chartSeries() to plot 20646 and 22314:

chartSeries(lpXTS, TA="addTA(axXTS, on=1)", yrange = (c(min(lpax),max(lpax))), name = "20646 vs. 22314 Median Prices")

Created by Pretty R at inside-R.org

Expensive area...

From here I felt it would be best to create a function to abstract the chart creation.

Other than some formatting changes I largely just re-purposed the above code using two zipcode value arguments. I also used these values for the chart title, for some simple dynamic formatting. The full code is:

zipCompare <- function(zip1, zip2) {
 
  require(quantmod)
  zip1df <- zipSale[zipSale$RegionName==zip1,]
  zip1XTS <- xts(as.integer(zip1df$value)/1000, order.by=zip1df$rep_month)
  zip2df <- zipSale[zipSale$RegionName==zip2,]
  last2 <- xts(as.integer(zip2df$value)/1000, order.by=zip2df$rep_month)
 
  zip12 <- complete.cases(c(zip1XTS,last2))
  zip12 <- c(zip1XTS,last2)[zip12,]
 
  lineChart(zip1XTS, name = paste(zip1, "vs.", zip2, "Median Sales Price"), yrange = c(min(zip12),max(zip12)))
  addTA(last2, on=1)
 
}

Created by Pretty R at inside-R.org

Here I compared my home and work:

zipCompare("22314", "22101")

I also tested these scripts using the county geographic level. With the naming conventions of the Zillow url and file format conventions this was very easy to do. To add additional value I think exploring other plotting options would be best. While I am partial to the aesthetic of chartSeries() it does have flaws when dealing with this type of data, specifically for axes and attribute naming.

Full code for data prep:

library(reshape)
library(quantmod)
 
zipSale <- read.csv(file="http://files.zillowstatic.com/research/public/Zip/Zip_MedianSoldPrice_AllHomes.csv", header=T)
 
zipSale<- melt(zipSale,id=c(1:5),variable_name="date_val")
zipSale$rep_month <- as.Date(paste(
  substr(zipSale$date_val,2,5),"-",
  substr(zipSale$date_val,7,9),"-",
  "01", sep="")
)
 
zipSale[1:4,]
 
 
lpzipSale <- zipSale[zipSale$RegionName=="20646",]
lpXTS <- xts(lpzipSale$value/1000, order.by=lpzipSale$rep_month)
axzipSale <- zipSale[zipSale$RegionName=="22314",]
axXTS <- xts(axzipSale$value/1000, order.by=axzipSale$rep_month)
 
lpax <- complete.cases(c(lpXTS,axXTS))
lpax <- c(lpXTS,axXTS)[lpax,]
 
chartSeries(lpXTS, TA="addTA(axXTS, on=1)", yrange = (c(min(lpax),max(lpax))), name = "20646 vs. 22314 Median Prices")

Created by Pretty R at inside-R.org

Tuesday, April 22, 2014

Pretty, Fast D3 charts using Datawrapper

While reading a news article I came across a US state cloropleth that piqued my curiosity.

An economic news article on state unemployment rates at The New Republic included a state level map with a two color scale and tooltips. As always when I see a new chart that I like I look for two things to steal; the data and the method used to make the chart.

In this case I was in luck. The embedded chart included links for both the method used and the data (great features of Datawrapper).

Datawrapper.de is a set of open source visual analytics tools (mostly Javascript from what I've seen) integrated into an easy to use UI. For those of us still learning D3.js this is a great way to build beautiful, interactive charts with the style and capabilities of the D3 style graphics which are used by many online publications and bloggers.

State level maps (Maps still in Beta at time of writing) are really easy and fun to create. To build you simply start by uploading or pasting in your data. I was able to simply paste in two columns of data, state abbreviation and value. Using R this might have taken me fifteen or twenty minutes or so, at Datawrapper this only took five

I took another stab at this data, trying out one of the line chart templates provided. Here I wanted to try to mimic (substance not style) one of my most favorite data tools, FRED:

Following similar steps to the state US map above I simply pasted in the time series data from FRED and moved through the Datawrapper wizard. I simply selected line chart then updated a few options to better mimic FRED (sadly the iconic recession shading is not available natively, or grid lines). In some ways the result is even more aesthetically pleasing, and could be a nice easy addition to a blog post or article

Friday, April 18, 2014

googleMap-ing with R; Communte Loop

There are two main routes I can take to get to work each day, the two make a comically circular commute when both are taken.

I thought I'd use that commute to try out yet another cool use case for R, GPS plotting with dynamically created HTML/JS using the googleVis package.

First we need some data points.

For simplicity I logged the GPS data using my phone with the GPS Logger for Android app. This app provides coordinates logged to .gpx, .kml, or comma delimited .txt files at a chosen interval. A few metrics are included like elevation, speed and bearing. I plotted my morning and evening commutes while driving two different routes to the office.

The code for this very simple, the gvisMap call to the Google Maps API does almost all of the heavy lifting. We simply pass gvisMap the latitude and longitude coordinates in the expected format (lat:lon) and then an optional label for each point plotted as a "Tip". Here I chose elevation relative to sea level, but any parameter in the set could be chosen.

#googleVis must be installed
library(googleVis)
 
#Read in the gps files.  I used .txt, but .gpx can be used also, with preparation
gps418 <- read.csv("20140418.txt")
gps418_2 <- read.csv("20140418_2.txt")
 
#Add morning and evening commutes
gps418<-merge(gps418,gps418_2, all.x=TRUE, all.y=TRUE)
 
#Prepare Latitude and Longitude for gvisMap
gps418$latlon <- paste(gps418$lat,":",gps418$lon, sep="")
 
#Call gvisMap function, here I'm using elevation as a label
gpsTry <- gvisMap(gps418,"latlon", "elevation",
                  options=list(showTip=TRUE, showLine=TRUE, enableScrollWheel=TRUE,
                               mapType='hybrid', useMapTypeControl=TRUE,
                               width=1600,height=800))
 
#Plot (Send gvisMap generated HTML to browser at local address)
plot(gpsTry)
 
#Yeah, it's that easy

Created by Pretty R at inside-R.org

The results are automatically rendered as a local address in your browser with the generated HTML calling the Google API. I've pasted the relevant portion here to display the map:

Hat tip to Geo-Kitchen, I noticed their post while doing this one. I like the plot of my alma mater

Friday, April 11, 2014

GeoGraphing with R; Part 4: Adding intensity to the county Red/Blue Map

Today I took the Red/Blue exercise from the last post a bit further. I'm a big advocate for the power of the subtle use of multiple indicators in a single chart and I thought I'd try this with the county level election graphs from the last post. In my experience the usability of chart peaks at three or so data points per visual. Any more than this and there is a real risk of boredom or confusion for the reader. The election maps in the last post had two layers (geography and winner), earlier I tried out expanding the second layer to include a measure of intensity.

Luckily R includes a package that makes this quite simple. The scales package includes the very useful alpha() function which transforms a color value using some scalar modifier. To achieve this in the scripts I used for the last post I simply had to create some scalar and use that to modify the existing color scheme.

This only adds two additional lines to the earlier scripts:

#Calculate winning percentage to use for shading
elect12$WinPct <- elect12$Win_Votes/elect12$TOTAL.VOTES.CAST
#Create transparent colors using scales package
elect12$alphaCol <- alpha(elect12$col,elect12$WinPct)
#Match colors to county.FIPS positions

Created by Pretty R at inside-R.org

I really think this helps add another dimension to the chart, answering an inevitable question that the reader might have.

There are a couple of issues with this methodology however, since the observed winning percentage values are so centered around certain values. In mid fifties for many counties, 2012 range was 46.2% (Eastford county Connecticut) to 95.9% (King county Texas). This causes a washout effect on the colors in the chart. A non-linear scaling using a log scale or binned color mapping could help with this.

Additionally, some measure of population size could be added to improve the readability of the chart. Election maps (as with most US level value maps) suffer from the cognitive dissonance of a seemingly uniform land distribution with a disparate population distribution. I was really influenced by Mark Newmans' fun take on election mapping. The cartographs he posted are especially interesting, I hope to create those in R sometime. I love the way that the population weighted cartograph allows the reader to intuit the average value from an otherwise misleading two color heatmap.

Thursday, April 10, 2014

GeoGraphing with R; Part 4: County Level Presidential Election Results

I've always loved US county level mapping because it provides enough detail to give an impression of complexity but retains a level of quick readability. I thought I'd try this out in R out a with a cliche but hopefully appealing set of charts. While I'm not particularly interested in political science I've always loved the graphics that define it. There is a kind of pop art beauty in the Red-Blue charts we're all used to seeing, and I thought I'd try to mimic those using R.

First, the data had to be located. As always, it's a little more difficult securing county level data than other metrics. The basic problem for election results county data is that while the data is well sourced at a state government level the county data is not easily found for all states in one place in an accessible way. I found a few great resources when searching for this data, and I ended up using two sources which seemed to be authorities on the subject for both the 2008 and 2012 presidential elections.

2008
For the 2008 election I used a file from Ducky/Webfoot, a blogger who cleaned up a contemporary fileset provided by USA Today. Since this set was already well cleaned there was little to do but read.csv() and code.

2012
Here I relied on a set provided by the Guardian which was referenced by some interesting blog posts on the subject. The Guardian provides the data in .xls or Google fusion table format. I chose to use the .xls file, which I cleaned somewhat and re-saved as a .csv.

I began by making sure the county FIPS codes lined up with those in the R maps package. It turned out that both sets were well populated with FIPS, but 2012 seemed to be missing some detail for Alaska (here I imputed a Romney win for the 10 or so states without data) and the 2008 set needed a transformation to create the five digit county FIPS code (state level multiplied by 1000 + county)
After reading in the .csv's I assigned a color value (#A12830 and #003A6F two of my favorite hex colors) to each FIPS based on the winning candidate (classic Red and Blue, no surprises here). This allows me to do a little trick later and quickly assign each county a color. I then assigned these colors and the Candidate names to lists to help create a legend later on:

elect12$col <- ifelse(elect12$Won=="O","#003A6F","#A12830") 
colorsElect = c("#003A6F","#A12830")
leg <- c("Obama", "Romney")

Created by Pretty R at inside-R.org

Next I created a list of colors matched and sorted on county from the county.fips data in the maps package:

elect12colors <- elect12$col [match(cnty.fips, elect12$FIPS.Code)]

Created by Pretty R at inside-R.org

After this we're ready to build the map. Here I used the png device because I wanted to make a really big zoomable image. The map() function here is pretty straightforward but I'll note that it is the "matched" color list that I'm using to assign the Red/Blue to the map in order to separate the color mapping outside of the map() function.

2008 Election R script

#R script for plotting the county level Presidential popular vote results of 2008#Read in pre-formatted csv with binary M/O values for "Won" 
elect08 <- read.csv("prez2008.csv") 
#Assign appropriate color to winning candidate#"#A12830" is a dark red, "#003A6F" a darker blue
elect08$col <- ifelse(elect08$Won=="M","#A12830","#003A6F") 
#Transform for FIPS from 2008 data
elect08$newfips <- (elect08$State.FIPS*1000)+elect08$FIPS
 
#Create lists for legend
colorsElect = c("#A12830","#003A6F")
leg <- c("McCain", "Obama") 
#Match colors to county.FIPS positions
elect08colors <- elect08$col [match(cnty.fips, elect08$newfips)] 
#Map values using standard map() function, output to png devicepng("elect08.png",width = 3000, height = 1920, units = "px") 
map("county", col = elect08colors, fill = TRUE, resolution = 0,
    lty = 0, projection = "polyconic")#Add white borders for readability
map("county", col = "white", fill = FALSE, add = TRUE, lty = 1, lwd = 0.2,
    projection="polyconic")title("2008 Presidential Election Results by County", cex.lab=5, cex.axis=5, cex.main=5, cex.sub=5)box()legend("bottomright", leg, horiz = FALSE, fill = colorsElect, cex = 4)dev.off()

Created by Pretty R at inside-R.org

2012 Election Map R Script

#R script for plotting the county level Presidential popular vote results of 2012#Read in pre-formatted csv with binary R/O values for "Won" 
elect12 <- read.csv("2012_Elect.csv") 
#Assign appropriate color to winning candidate#"#A12830" is a dark red, "#003A6F" a darker blue
elect12$col <- ifelse(elect12$Won=="O","#003A6F","#A12830") 
 
#Create lists for legend
colorsElect = c("#003A6F","#A12830")
leg <- c("Obama", "Romney") 
#Match colors to county.FIPS positions
elect12colors <- elect12$col [match(cnty.fips, elect12$FIPS.Code)] 
#Map values using standard map() function, output to png devicepng("elect12.png",width = 3000, height = 1920, units = "px") 
map("county", col = elect12colors, fill = TRUE, resolution = 0,
    lty = 0, projection = "polyconic")#Add white borders for readability
map("county", col = "white", fill = FALSE, add = TRUE, lty = 1, lwd = 0.2,
    projection="polyconic")title("2012 Presidential Election Results by County", cex.lab=5, cex.axis=5, cex.main=5, cex.sub=5)box()legend("bottomright", leg, horiz = FALSE, fill = colorsElect, cex = 4)dev.off()

Created by Pretty R at inside-R.org

Monday, January 27, 2014

Favorite Tools: NDBC and the Chesapeake Bay Interpretive Buoy System

This one is not really a tool I use at work, just a favorite public data source of mine. I really love the combination of physical computing and data. The NOAA Buoy system might be considered one of the most widespread internet of things installations.

The NOAA Buoy System, consists of a network of buoys from different programs all tracked by NOAA. Many of these are not under the direct supervision of NOAA, some are academic, others are state or local government installations.

The National Data Buoy Center website and database provides instant access to the status of many of these buoys. Also included within the network are the observations from volunteer ships outfitted with sensors and telemetry equipment. Individual buoys can be found via a map applet or within a mobile optimized site.

The Chesapeake Bay Interpretive Buoy System is part of the NDBC network and was designed to track the health of the bay using a network smartBuoys installed around the Chesapeake Bay watershed. The smart buoys include a suite of sensors a DIY Arduino weather station might dream of. The program supplements this environmental data with a parallel historic lesson, combining the bay health with the history of development in the watershed area, including the connection of the buoy locations with the historic journeys of a favorite historical figure of mine, Captain John Smith.

The CBIBS includes some cool data visualization features, like a graphing applet and csv downloads. It also has a mobile app, which I've added to my wonkApp collection along with the FRED app.

While this data is used more urgently by mariners and scientists I love checking this data to consider the environment at some of my favorite places in the area; in the lower Potomac near where I grew up, Jamestown Island (visible from the fort site), and in the Upper Potomac (visible from my apartment).

I've even written some shell commands which I use to check on the Alexandria Buoy for real-time weather stats 200 yards from my apartment building while at work:

alias bTemp='wget -q http://www.ndbc.noaa.gov/mobile/station.php?station=44042 -O - | grep Air | cut -c1-15'

(Buoy Station changed in code to reference an active station. Sadly, the Upper Potomac is offline for winter maintenance)

Wednesday, January 22, 2014

GeoGraphing with R; Part 1: Zipcode Mapping

I'd like to share some graphing work I've done with the R programming language. I have been interested in R for a few years now, and have enjoyed the extremely intuitive platform it provides for data analysis. Although I don't make much use of the powerful statistical tools R provides, I've found that this is the charm of R. It provides a platform for any use you could need, with an intuitive interface like Python. I keep R on my personal Ubuntu and Windows machines, use it at work, and have even installed R on my Raspberry Pis

I am a big fan of the RStudio IDE which provides some editing and data/file management services to the ultilitarian basic R GUI. I have also test similar code on a Raspberry Pi, which installs with a simple call to apt-get.

After seeing a presentation of some of the geographical presentation features of Tableau (GIS-lite within their visual analytics platform) I became inspired to experiment with mapping visuals, for free.

Using the wonderful wealth of user packages, I was able to get started on this quickly using some tutorials and documentation I found. I am especially in debt to Jeffrey Breen, the creator of the zipcode package and whose tutorial I found immensely helpful in creating this particular chart. This charting program is built around the plotting of latitude and longitude points against a contiguous United States map defined by state borders. Since the coordinates in each set is sympathetic, the matching between the borders and points is exact.

This particular chart is a version of a project I created for work, plotting the locations of bank branches for the top five banks by number of branches. In an era of thin branch banking, deep networks of brick and mortar branches aren't always considered key to retail banking success, but this type of analysis is still useful. This program is based on the publicly available branch location data from the FDIC downloaded as csv filess and parsed by R into data.frame objects. I have yet to find a public API for this data, bonus points to anyone who has.

The code below makes use of the zipcode package mentioned above as well as the ever useful ggplot2 graphing library. This is ready to run on any R platform with these packages installed.

#Install needed libraries (Note that zipcode is used for a dataset)
library(zipcode)
library(ggplot2)
data(zipcode)
 
#Read and format .csv's downloaded from the FDIC 
#Source http://research.fdic.gov/bankfind/
#csv's were renamed to the stock ticker of each bank but are otherwise unchanged
#The raw csv's include 7 rows of metadata, this is removed allowing row 8 to be used as headers
#Since Zip and Bank are all we care about, for now other headers are ignored
#Bank name is added to allow aggregation by entity later
#I've created a quick function for importing the data
readBank <- function(filename) {
  bank <- read.csv(paste(filename,".csv",sep=""), header=TRUE,skip=7)
  bank$Bank <- filename
  bank
}
WFC <- readBank("WFC")
JPM <- readBank("JPM")
BAC <- readBank("BAC")
USB <- readBank("USB")
PNC <- readBank("PNC")
 
#Concatenate bank files together
top5 <- rbind(WFC,JPM, BAC, USB, PNC)
#merge five bank set with zipcode to make mapping possible
top5Zip <- merge(zipcode,top5, by.x= "zip",by.y ="Zip" ) 
 
#Much of the following has been taken from Jeffrey Breen at http://jeffreybreen.wordpress.com/2011/01/05/cran-zipcode/
#Begin mapping function.  Colors denote bank names.  "size" is increased to enhance the final plot
g <- ggplot(data=top5Zip) + geom_point(aes(x=longitude, y=latitude, colour=Bank), size = 1.25)
 
#Simplify display and limit to the "lower 48"
#Some banks have Alaska branches (specifically Wells Fargo in this data), this is included, but ignored by the ggplot
g <- g + theme_bw() + scale_x_continuous(limits = c(-125,-66), breaks = NULL)
g <- g + scale_y_continuous(limits = c(25,50), breaks = NULL)
 
#Don't need axis labels
g <- g + labs(x=NULL, y=NULL)
g <- g + borders("state", colour="black", alpha=0.5)
g <- g + scale_color_brewer(palette = "Set1")
#Arbitrary title
g <- g + ggtitle("Top Five Banks by Number of Branches") + theme(plot.title = element_text(lineheight=.8, face="bold"))
g <- g+ theme(legend.direction = "horizontal", legend.position = "bottom", legend.box = "vertical")
g

Created by Pretty R at inside-R.org

Following the creation of this plot I usually use the ggplot2 ggsave feature to save the plot to an image file:
ggsave("branches5.png", plot=g)

The resulting plot:

As seen with the simplicity of the merge statement, you could substitute nearly any zipcode based data. Other charts I've created have included asset locations and temperature data.

As a preview, the next R GeoGraphing Post will focus on state level mapping data, and includes some animation tricks.

Sunday, January 19, 2014

Favorite Tools: FRED and St. Louis Fed Research Tools

I'd like to use this series as a set of love notes on my favorite data tools. Some of these I use almost constantly at work, others are personal favorites I have come across.

FRED is a tool I came across a few years ago while reading economics blogs. The distinctive color of a standard FRED graph (with obligatory recession shading) was something I began to associate with the econ blogger crowd. It seems this has been noticed by many, and Paul Krugman, his blog being one I first noticed FRED on, is quoted as saying "I think just about everyone doing short-order research — trying to make sense of economic issues in more or less real time — has become a FRED fanatic."

After using these tools at work and home I have come to feel the same way about the tool, even evangelizing its merits to my coworkers and friends.

FRED graphs are distinctive and immediately recognizable

In my work in data analysis at a national bank, I have come to greatly value FRED for two main reasons. FRED is a singularly well organized and populated database and it allows the immediate reference to data which is often useful in a one off fashion. Pulling this data out during a meeting has more than once garnered some recognition of my economic knowledge which might not have otherwise occurred.

The breadth of data available is somewhat astounding. International Data might usually take you all over the web and to a few commercial sites, but FRED has enough to do most high level macroeconomic survey work. I find the somewhat more obscure metrics very interesting at times, and it's fun to eyeball for trends.

It's too easy to make weird charts...

After discovering FRED's website I was ecstatic to find that an Excel Add-In had been developed. i immediately made use of the feature and made sure I spread the news around. Being able to quickly pull in common economic data while doing simple (or complex) analysis can save a lot of time. Outsourcing the data storage and update costs to FRED is wonderful. I've been able to cut down on some user table creation and maintenance I owned was a time saver.

In order to facilitate the access to my company's internal economic data hub I even created my own version of the FRED Excel Add-In, which I named ED. Using some simple VBA GUI elements (drop downs, radio buttons, many MsgBox's...) and an ODBC connection I was able to mimic the Excel Add-In functionality of FRED. Adding in some charting code I was able to mimic the distinctive graphs as well. Given that the data is proprietary, I don't see any issue in my imitation of FRED, and I view it as a labor of love in tribute to the data tool.

Tying FRED into R was an obvious result, and I've already begun to make use of this data. Being able to pull this data down into the R environment makes it even easier to manipulate the data quickly, without the worry of Excel resources (Autosave I'm looking at you!), or adding the data to a database structure. A R programming project I'll detail later exhibiting geographical plotting uses similar data, maybe I'll tie FRED in to show off the functionality.

I also happily own the FRED mobile app, which I find entirely too amusing, and has come in handy for wonky discussions, and to prove my data nerdiness to anyone in sight.

If they sold T-shirts, sign me up for two.

The St.. Louis Fed includes three other tools GeoFred (data Mapping), ALFRED (historical economic series), and CASSIDI (a personal favorite of mine, which details US banking industry data). I believe I'll include love notes on these as well, CASSIDI especially.

Erich Donahue

Friday, June 27, 2014

Article: Open CPU Database; Race Your Favorite Processors Head-to-Head

Friday, May 23, 2014

Housing Cost Comparison Tool with Zillow and R

Tuesday, April 22, 2014

Pretty, Fast D3 charts using Datawrapper

Friday, April 18, 2014

googleMap-ing with R; Communte Loop

Friday, April 11, 2014

GeoGraphing with R; Part 4: Adding intensity to the county Red/Blue Map

Thursday, April 10, 2014

GeoGraphing with R; Part 4: County Level Presidential Election Results

Monday, January 27, 2014

Favorite Tools: NDBC and the Chesapeake Bay Interpretive Buoy System

Wednesday, January 22, 2014

GeoGraphing with R; Part 1: Zipcode Mapping

Sunday, January 19, 2014

Favorite Tools: FRED and St. Louis Fed Research Tools

Followers

My Blogroll

Blog Archive