It’s a naural to preceed the publication by O’Reilly of “Making Data Work: Practical Applications of Data Science”.
In summary his message is that the “future belongs to the companies and people that turn data into products.” A few of his points and discussion are provided below.
Mike’s starting point is the observation that the web is full of “data-driven apps” . Data science is what enables the creation of such apps and their data products.
An example of an early Web data products was the CDDB database. Gracenote built a database of track lengths, and coupled it to a database of album metadata (track titles, artists, album titles). Now iTunes takes advantage of this database. He goes on to cover Google ablily to spot trends in the Swine Flu epidemic about 2 weeks before the CC by analyzing searches that people were making in different regions of the country.
Facebook and LinkedIn use patterns of friendship relationships to suggest other people you may know while Amazon saves your searches so it can correlates what you search for with what other users search for. The resulting analytic product is used to make surprisingly appropriate recommendations.
Mike examines the data lifecycle examining the importance of Moore’s law as applied to data but starting with a first step of any data analysis project – “data conditioning”. This activity gets data into a “state where it’s usable.” This includes formats that are easier to consume: Atom data feeds, web services, microformats etc. Conditioning may also require data cleaning and data quality activity and natural language processing to disambiguate it.
As we scale our data up people will increasingly build what are called information platforms or dataspaces. “Information platforms are similar to traditional data warehouses, but different – they go beyond relational DBs. .. They are the logical descendants of Google’s BigTable and Amazon’s Dynamo, and are designed to be distributed across many nodes, to provide “eventual consistency” but not absolute consistency, and to have very flexible schema.”
They also “expose rich APIs, and are designed for exploring and understanding the data rather than for traditional analysis and reporting.”
Vast datasets also present computational problems.” Google popularized the MapReduce approach, which is basically a divide-and-conquer strategy for distributing an extremely large problem across an extremely large computing cluster. In the “map” stage, a programming task is divided into a number of identical subtasks, which are then distributed across many processors; the intermediate results are then combined by a single reduce task….the open source implementation of MapReduce is the Hadoop project… Hadoop has been instrumental in enabling “agile” data analysis. In software development, “agile practices” are associated with faster product cycles, closer interaction between developers and consumers, and testing.”
Machine learning, along with statistical packages, is now another essential tool for the data scientist. “There are many libraries available for machine learning: PyBrain in Python, Elefant, Weka in Java, and Mahout (coupled to Hadoop). Google has just announced their Prediction API, which exposes their machine learning algorithms for public use via a RESTful interface. For computer vision, the OpenCV library is a de-facto standard.”
After analysis data visualization as part of information architecture is a key to allowing data to “tell its story”. “There are many packages for plotting and presenting data. GnuPlot is very effective; R incorporates a fairly comprehensive graphics package; Ben Fry’s Processing is the state of the art, particularly if you need to create animations that show how things change over time. At IBM’s Many Eyes, many of the visualizations are full-fledged interactive applications.
Nathan Yau’s FlowingData blog is a great place to look for creative visualizations.
A final section of the blog covers the skills needed by a Data scientist ranging from traditional computer science to mathematics to art.