PMF Data 2009-2011

[Update 2/12/2011: By popular request, I have also formatted this for CSV/Excel: http://pmfellow.kodingen.com/scripts/getcsv.php The format is slightly different and includes is_vet, which is a field denoting whether the finalist is/was a veteran.]

Here is another quick update to let you know that I have made available all of the finalists data from 2009-2011, in JavaScript Object Notation (JSON) format. If you ask nicely and want it, I can offer it in other formats as well.

The data can be retrieved here: http://pmfellow.kodingen.com/scripts/getjson.php

Available fields and descriptions are as follows:

  • label: Either an MD5 hashed version of the original finalist name, or, because I didn’t have the names available when I imported the data, something like “applicantX” where X is an incremental number.
  • type: Currently only finalists are available, but when I get to it, other valid values, for which there are available rows, will be “semifinalists” and “nominees.”
  • year: The PMF class year. Not every record type is available for every year.
  • rank: This is just the database unique record identifier; you don’t really need it for anything.
  • school: The corrected name of the school the individual PMF attended. By corrected, I mean the standardization I undertook as part of the record cleanup.
  • field: Individual’s academic field. No effort to standardize or clean these up occurred.
  • latlng: The latitude and longitude of the school, as determined by a separate geocoding script. I expect some percentage of error to have occurred here, but see below for error reporting.

If you have questions about the data or spot any obvious errors, please let me know in the comments. As stated above, I have the greatest expectation of errors in the latitude and longitude data, but this can be fixed pretty easily if you just tell me which school is wrong, and what the correct lat/long should be.

Also, feel free to use the data however you see fit. If you have anything you’re trying to put together, I would be happy to link to it. Similarly, I would be happy to help if you want data that’s not currently there (assuming I have it).

Leave a Comment


Leave a Reply


Really interesting data. Would be interesting to build some visualizations around the data. What do you think would be interesting ways to use/visualize the data that would be useful?

PMF Fellow

Stuart: See the update above. I meant to post the update yesterday but forgot. If there’s anything else you want in the data, let me know and I will see if I have it.

PMF Fellow

I’ve spent what spare time I had over the last week or two digging through visualization tools to see what might be an interesting use of this data, and I have some ideas. The most obvious idea, and one I’ve already started exploring, is geographic data. School names are easy enough to geocode, and that is easy enough to translate into mapped data, especially if we use the kinds of magnitude-based (i.e., count) representation. Other possibilities using geographic data include a choropleth map (http://en.wikipedia.org/wiki/Choropleth_map) in which we encode the counts of finalists per state, normalize them, and present them according to shading scales based on the resulting normalized magnitudes (think density; I am working on one of these now, but the map data I am working with to render the US is such low resolution that it completely omits DC, thus skewing the visual results a bit); and a Dorling cartogram (see http://vis.stanford.edu/protovis/ex/cartogram.html for an example of this), which does something similar to the above, but instead represents the geographic regions as non-overlapping circles. Again, the tool I am exploring now uses the same low resolution US map, which omits DC entirely.

Aside from 3D (peak/valley) type map overlays, that’s probably all we can do with geographic data. Given the full set of nominee, semifinalst, and finalist data (where they even exist), there are probably things we can do in the realm of statistical modeling of degree fields, veterans, and the like, but I admit to knowing far less about how to approach these than the geographic stuff, at least beyond simple pie charts. If we had even more data, I can think of better or more useful visualizations, but what I want doesn’t really exist as far as I know.