Cloudera and Platfora Leveraged to Address Hard Challenge: What do “they” know about my network?


Editor’s note: This guest post by Wayne Wheeles focuses on a topic I’ve struggled with for over a 15 years and shows great promise in addressing challenges no one else has tackled. Wayne is a Network Forensics Analytic/Enrichment Developer at Six3 Systems. – bg

For a decade now, many Network Forensics Analysts, Network Security Engineers, and Cyber security Professionals have pondered that most interesting of questions: What do “they” know about my network? From time to time over the years, discussions related to determining what external entities may know about determining the attack surface of a network occur and then fizzle out. Often, organizations collect and store a great deal of data to piece together a defensive view of a network but do not piece together what external entities know about or have shown interest in on the same network. Big Data offers the potential to evaluate this question in ways that were unimaginable just five years ago. New technologies and techniques enable organizations to evaluate the question of what is the known attack surface of my network. I addressed this question head-on using a variety of cyber security data sets, enrichment techniques, Cloudera CDH 4 (Hadoop distribution), and Platfora: a relative newcomer that is one of the most powerful tools I have worked with in some time.

In this day and age it is amazing how little is known about what activities are occurring on our networks. The “they” alluded to earlier in the blog is used to describe external entities which engage in scanning and network mapping, seeking to learn more about all aspects of a target network: what devices reside on the network, what ports are open, and identify potential avenues for exploitation. This scanning occurs at a scale that is almost unimaginable and often goes unnoticed. For those who have the question: So is this network scanning common? On the working data used for this article set, I determined that over 4000 large-scale scans of the target network occurred each year, originating from at least 95 countries worldwide.

As always, the real story is told through the data; using netflow data, port and geographic enrichment. In order to more effectively share the tale at scale, we worked with Platfora to explore and visualize the data. The screen shot below is of the Platfora Data Catalog, which makes it easy to look at all of the available data sets available in the cluster. The data catalog provides the instrument for defining data sets and relationships between different classes of data within the cluster.


Next, using Platfora we loaded a series of derivative data sets which captured all of the major scans on the network during 2013 into the Platfora Data Catalog. From the Platfora Data Catalog, we generated a series of lenses or views of the data. When creating Lenses, Platfora provides a wide range of functions, operators and aggregates for working with data which are really helpful in generating visualizations in this blog.

Platfora provided a wide range of capabilities for preparing the data for analysis which considerably reduced data preparation time. After completing the preparation of the data, the emphasis shifted to developing and understanding the data using a variety of visualization techniques. In the Platfora VizBoard below, of interest was not the fact that high ports (x-axis) were scanned, but rather the number of times (indicated by color of bars) that they were scanned by the same source IP address (y-axis). Each of the source IP addresses in the set below scanned ports of the targeted network over 1000 times in a 90-day timeframe.


The heat map above depicts the fact that not only did the source IP addresses (y-axis) scan large numbers of destination ports (x-axis) on the target network but in many instances returned between four and six times to the same port during the observation period. When building the data sets, references were defined, defining the relationships between different types of data resident in the cluster. In the graphic above, when port 61000 is highlighted, the netflow information which served as the base data set has been augmented with information from other data sets on: known exploits for a given port, Intrusion Detection Signatures information for a given port over time and information on Intrusion Detection Signatures for a given IP address. Platfora was very useful for “following where the data will lead”, enabling the analyst to pivot in the direction with all details on a port or IP address, bytes, packets, and generate new derivative lenses with two clicks of a button.

In review, what do “they” know about my network? Based on the analysis of the set of aforementioned actors above, the following observations were made: over 300 scans a month occurred, roughly 4000 (sweeping scans covering a large number of ports) large scans occurred each year, in all over 22,500 ports were probed and of those no less than twelve ports were revisited up to ten times. Based on the analysis using Platfora, several areas were identified for additional investigation and recommendations made to improve the overall network security posture.

In order to put this article together, a four-node Hadoop cluster built using Cloudera CDH 4, IBM Pure Data for Analytics 2001 and Platfora’s exploratory BI tool for Hadoop.

Based on what I had read previously my view of Platfora was that it was just a visualization package but to my surprise it turned out to be a complete end-to-end data integration and visualization platform fully integrated with Hadoop and Hive.

Finally, I would like to thank two contributors: Keith McClellan and Six3 Systems for helping me pull this off and Bob Gourley (CTO Vision) for posting my blog.

Original post

Leave a Comment

Leave a comment

Leave a Reply