There are many definitions of data science but this is how I derive a common definition: data science is the set of techniques, methods, and technologies that allow people to collect, refine, analyze, and visualize data so that knowledge is created. Many experts will say that this is a new field but, in my opinion, you can consider Sir Francis Bacon as the first data scientist because he pioneered the experimental method. He used data from his experiments to create knowledge.
What is different with today’s modern data science is that we have access to much, much, much more data than early scientists could imagine. This is the “big data” that you have heard and read about. So, to understand today’s data science, we need to first define big data.
Thanks to the Internet and other digital technologies, we are producing, consuming, and recombining data in ever increasing amounts. When describing big data many experts like to use the following three characteristics:
Volume: there is just more data being produced from an increasing number of sources. For example, the Sunday edition of the New York Times contains as much information as a well-educated person of the 17th century would possess in a lifetime. Twenty years ago, megabyte hard drives were considered enough storage for the average computer user’s programs and files. Ten years ago, users needed gigabyte drives and now we are commonly seeing hard drives with terabyte storage.
Variety: Along with volume, the kinds of data have increased. It used to be that most data was structured (think of a spreadsheet or a table) but we now have semi-structured data (XML documents) and unstructured data (YouTube video or Flickr photos). I will discuss the differences in a future column but for now, realize that as we connect more devices to the Internet and build more apps, data variety is going to increase.
Velocity: This is the speed at which data is being produced and/or consumed. In the early days of data processing (punch cards), it could take weeks for data to be analyzed and reported. Now, with cloud computing and data visualizations, data can be created and used almost instantaneously (“real-time”). In fact, some data scientists talk about the advent of “faster than real life” data.
These are the three basic Vs of big data. You will often see articles that add more Vs such vitality, viscosity, and so on to describe how data has changed both in characteristics and impact on our world.
Before big data, descriptive statistics and methods like linear regression were used to analyze and present data. These tools are still used but they have been supplemented with things such as machine learning, predictive analytics, Bayesian statistics, topological data analysis and other ways to extract knowledge from big data sets. Just as important are communication tools such as storytelling, graphic design, and visualizations to help people make sense of the data and use it for actionable knowledge.
It is very easy to become lost in the tools, data sets, and data visualizations but the ultimate purpose of data science is to answer questions using evidence. Questions, such as who are good credit risks? Where are the best places to recruit people for government service? Is that email spam or is it valuable information? In the next column, I will discuss how questions should drive the data science process.