,

How English Class Helps You Understand Unstructured Data

featuredblog-01

This is a salute to my brethren from the liberal arts, but also anyone who has ever thought English composition class was useless. What is it about understanding how writing happens that is so boring even if writing itself is fun, especially if you have only 140 characters and are not bound by spelling or syntax? You might have learned in composition class how to diagram a sentence. That is the beginning of text analysis. If you learned logic, set theory or library science in addition, you are well on your way,

Unstructured data is structured data. But not by the limited grammatical relationships in entity-relationship diagrams and relational databases. The grammar of ER diagrams and UML diagrams is based on limited predicates of “is,” “has,” and other basic verbs. Even Cobb, the guru of relational databases, wrote in 1970 that a “first order predicate calculus” of logic and sets show the linguistic aspects of relational databases. Text analytics expands the computer recognized relationships, especially “like” the grammar of Twitter messages. Graph (RDF) databases seem almost limitless in grammatical recognition. The programming of RDF is one matter, the underlying role of grammar lives in the province of reading and writing. (See, for example, “If Algorithms Know All, How Much Should Humans Help?” Steve Lohr, April 6, 2015, International New York Times)

Chomsky’s arguments in “Aspects of a Theory of Syntax” are invaluable in understanding this logic. Put too simply, the object of this theory is otherwise known as “deep structure” sentence components of nouns, verbs, and objects, plus structures of adjectives, adverbs, prepositions, conjunctions, etc. Chomsky’s famous sentences: “Colorless green ideas sleep furiously” has a structure, as he argues in 1957 Syntactic Structure. The structure makes sense but the illusory meaning (compared to proper categories, and taxonomies) is nonsense.   There is a structure to sentences, plus rules about relationships, and hierarchies. “Unstructured” and “non-relational” data are indeed structured, but not exactly as Cobb argued about the superiority of a relational database approach originally.

What differs with text databases is that these relationships can be ‘discovered’ automatically. This bypasses the step of converting these into relational data models and into a database prior to statistical analysis. A variety of software can do this. On the one hand, this seems to dispense with the “human” intervention of those steps. However, only people can make sense of this. For example, a company claims its software can take assorted text and generate narratives or stories. However, that such a collection of words is a story only make sense considering socio-linguistics and semantics.

In the past, empirical observation was to be codified in words, or the attempt made. Now, words, eventually, in a relational, non-relational (in a narrow sense) or graph database are supposed to reveal empirical facts. That is supposed to be “discovery” of information based on quantifiable associations, which are taken to lead to facts. As a fictional example, one book on text analysis doesn’t hesitate to use correlations “discovered” as indicative of patterns of texts by “terrorists.” This imaginary story makes text analysis tantalizing of “discovering” terrorist data. The pattern could just as well connect hikers, bikers, lovers, or diners instead. Is that what we want to hear about the power of IT?

IT’s fundamental contradiction is showing that words have become their own enemy. After millennia of trying to put facts into words, we now think we can turn words into facts automatically. What we should have learned in English composition class is that grammar is just the beginning of the story of words. Discovering the meaning of words a machine cannot do. There is more to that story of theories of language.

The simplicity of English composition class short circuits a theory of language by explaining so much by grammar and vocabulary alone. However, that simplification is what prepares us for text analytics. The basics of grammar persist as programming while analyzing results become more complex. Text analytics continues the work of library science to reproduce reading and writing and translation by machine. Nevertheless a race continues between increasingly complex machines and complex views of grammar. This is an odd task to account for an infinite combination of words to which machines will never catch up. As Jorge Luis Borges wrote in An Investigation of the Word, “Language is nourished not by original intuitions – there are a few – but by variations, happenstance, [and] mischief.”

Dennis Crow is part of the GovLoop Featured Blogger program, where we feature blog posts by government voices from all across the country (and world!). To see more Featured Blogger posts, click here.

Leave a Comment

Leave a comment

Leave a Reply