,

Say it in a Word Cloud! Visualizing Large Documents

What does the Defense Information Systems Agency Campaign Plan say about DISA?

The surface of the ocean may seem static with only the ripples of waves on its surface, but underneath it is teeming with life and large currents acting like enormous rivers circulating and pulsing throughout. If our ocean was in the form of words and information, how could we digest any large scale document and obtain a comprehensive view of its themes and undercurrents without a lot of time and effort? The Defense Information Systems Agency Campaign Plan is an important document, but it is rather long and developing a holistic understanding of how it will help guide and lead the agency mission is quite challenging. We’d have to go beneath the surface and really get under the hood, connecting up the parts to see how it all fits together. After putting it together, would the document hold true to its intent and intended purpose, or would the contents somehow alter it?

What we need is a good method of visualizing the themes and relationships in one clear picture. After all, if a picture is worth a thousand words, we could use a word cloud to create one with thousands of them and reverse engineer the whole thing. Word clouds are an excellent way to visualize the main themes and relationships hidden in large amounts of text. On the Internet, popular social media applications use tag clouds to quickly communicate what is popular and other sites may use these to provide recommended tags to make information more discoverable. But these clouds did not appear one day out of the blue. They are the result of many years of research developing a complex data analysis technique called database tomography (1). In this type of analysis, word frequency and even relative proximity to other words can be used to tease out the themes present in a paper. Rather then describe the document with key words, or tags in the Internet vernacular, database tomography might give someone a shortcut to describing their document.

We must begin with the premise that a word cloud generated from the campaign plan would support the DISA vision statement: Leaders enabling information dominance in defense of our Nation. Finding two good word cloud generators, Wordle and Tagxedo available for public consumption was a start, but a direct cut and paste into the applications was considered risky as the document was labeled FOUO and not for distribution outside of DISA. To get around this inconvenience, all the text was put into an Excel table and consolidated into a single column. Clutter words (i.e. and, of, like, it) as well as dates and DISA organization names were removed because their frequency created a distraction in the cloud. The final list, totaling almost twenty thousand words was then sorted alphabetically, making it nearly impossible to reconstruct any of the document’s original contents and exported to a text file. Below is the picture generated with Wordle:


The original picture created with Wordle was nice, but Tagxedo had a neat option that allowed the generation of a word cloud that formed letters, spelling out DISA with the campaign plan themes.


While these are neat to look at, they are useful too. Going back to the DISA vision statement, if we took a stab at creating a sentence from the most frequently used words that made sense: Providing information, developing enterprise services and capabilities while planning and supporting DoD Agency missions. It sounds pretty good and looks like this is how the campaign plan enables DISA to meet is vision and support the warfighter.

References:

1) “Database tomography is an information extraction and analysis system which operates on textual databases. Its primary use to date has been to identify pervasive technical thrusts and themes, and the
interrelationships among these themes and sub-themes, which are intrinsic to large textual databases.”
Journal of Information Science, Vol. 23, No. 4, 301-311 (1997)

2) www.wordle.net

3) http://www.tagxedo.com/app.html (requires Microsoft’s silverlight)

* Disclaimer: As always, all thoughts and opinions expressed are my own and do not support any official position of the US Gov’t, the DoD, DISA, etc.

Leave a Comment

4 Comments

Leave a Reply

Avatar photo Bill Brantley

Data visualization is a great tool! Sites like Many Eyes and Wordle are great for doing graphics that really communicate your point.

Alas, these sites are also blocked by the OPM firewall! Hmmm.

Teri Centner

Yeah, but how do we get this capability behind the firewall so we can do the same neat stuff with documents that aren’t releasable to Wordle and Tagxedo?

Brock Webb

Teri, the document by itself was not releasable, but as I described — by taking all the words, putting them into a column and using a sorted list, it would be impossible to reconstruct the document. Those tools are java apps and I don’t believe they store any of the data beyond the session anyways… I had looked for getting a tool, but the cost of getting one behind to firewall so that I didn’t have to go through my process didn’t make sense. On the other hand, it would be nice to have to help generate tags for my blog posts, etc…