Should You Be a Data Scientist?

An interdisciplinary data science approach has advantages giving young professionals and students a broad knowledge of its foundations. All of the vendors and functions below portray “data science” in their own particular way. Sometimes this is for purposes of recruiting students, requiring sweeping credentials for jobs, or advertising functions of different technology.

Paradoxically, given the breadth of what “data science” means, the specific training and expertise is often very narrow. Students and young professional need to be well informed of the options and their personal strengths in programming, analytics, or subject matter expertise to apply the output of data science to their academic training.

A great number of descriptions, professional programs, and jobs identify “data scientist” as requiring programming knowledge and/or statistical expertise. Programming in Python, R, and Javascript are essential to data science. (SAS by itself of course is confined but has its own data science functions.) Of course, there are daily contests between Python and R. The functions of the programming languages enlarge their functions every day.

Of course, expertise in data storage and manipulation techniques with Hadoop are taken to be a prerequisite skill for data science. Sometimes the term “Big Data” is associated with the application of “data science.” Some skills in that programming are essential for most jobs, however familiarity with RDF and OWL, related to semantics, logic, linguistics, and library science, are growing.

The use of neural networks is viewed as the outer limits of data science. “Machine learning” is a common place term in data science often without reference to its origins in AI and cybernetics. While “data mining” is a static analytic exercise, “machine learning” embraces the old idea of “feedback” and response to new data. Data mining and machine learning at techniques search for patterns both numeric and textual.

For decades Library Science has been central to the fast-growing approach to semantic data. “Text analysis” (with origins in “computational linguistics, mathematics, and philosophy) and “semantics” are increasing in relevance. This is in contrast to Relational Data bases, which have been the standard since the 1980s. These are the multiple standards for semantic databases, RDF and OWL, and the major query languages, Sparql or Cypher. This approach has its own very specialized database software

Approaches often claim to use geospatial data or GIS. Geospatial data is used in a variety of ways. These conclusions are based research and conversations rather than on marketing literature. Reference to specifically GIS applications, such as ArcGIS, Oracle MapViewer, or OpenStreetMap, are not included here nor are GIS applications with Relational Databases. In most cases, this results in point data based on names of places. They do not allow editing, geoprocessing, geostatistics. Marklogic stores image data, text data, relational data, and geospatial data. There are indirect limited capabilities for MongoDB.

This variety of professional training and university study points to either making difficult choices to concentrate on one and where to obtain that training. The ramifications are whether students can or should get training in programming within Computer Science or should just get the basics of programming in some other university department or certification.

That depends on the goals of a course of study or training. Students and young professional may or may not want programming skills like that and may not be looking for immediate industry jobs of which there are thousands. Or they, like me, may find the array of programming annoying to try to choose from, or they are not best suited to programming.

Another approach is to combine training and study as well as methods of data science. Moreover, data science is relevant to or applicable to many traditional courses and substantive knowledge about, for example in geography, political science, and, especially, health care. Often, in order to guide professional training toward computer programming and database administration, attention to its applicability to substantive knowledge is neglected. New technology and methods should be wedded to substantive knowledge and experience. Young professionals need to carefully weigh whether they want to become “data scientists” per se, or experts with sound “data science” backgrounds.

Leave a Comment

One Comment

Leave a Reply

Mark Hammer

Let me put I a word for “measurement science”. Not to take anything away from the data science folks, but the data is always in service of measuring something. The data itself, no matter how big or small or exotic, is meaningless unless it yields valid, meaningful, and actionable measurement of something important or potentially mission-critical. The gulf between the data people and the policy folks or senior management can generally be found with respect to the one group not having any sense of how to best measure what is important in their decision-making, and the other group not having much sense of how to anticipate the data needs of the other.

So while pursuing training in all the technical stuff required to gather and manage the data itself, consider training in research methodology, so that you can discern when and how the data you have can provide a valid and insightful answer to the questions that need to be answered, and how to design your data systems to provide better tests of hypotheses and better answers.