Data science has many analytical methods which can make the field sound very mysterious. Methods such as K-Means Clustering, Naïve Bayes Classifier, and Random Forests. The math behind the methods can be complex but for many methods, the concepts are easy to understand. Especially when you know what question the method is being used to answer.
There are six common questions in data science:
- Clustering – Are there similarities in the data which I can use to create groups?
- Association Rules – Are there relationships between data items?
- Regression – What is the relationship between input(s) and an outcome?
- Classification – How can I label specific data items?
- Time Series Analysis – What role does time play in the structure or behavior of data items?
- Text Analysis – Are there patterns and/or relationships in text data?
Let’s examine a common method used for clustering analysis. Imagine that you have a dataset of visitors to your government web site. You have launched some new features and you want to group visitors by the features they most often use. For this example, we will use a common clustering method – K-Means Clustering.
The “K” in K-Means Clustering is the number of clusters that the data can be grouped. Usually you don’t know ahead of time what K is, so you will often do several iterations of grouping data items by common characteristics. In this case, you could set K equal to the number of features. Then, you can start sorting data items by how closely they group together. You determine the group that a particular data item belongs to by how near it is to the center of a given cluster.
In the diagram below, K equals three and data items have one of three colors which indicates the group that data item belongs. If this is the web site features, you can then determine what are the common characteristics (age, how they access the feature, profession, or other characteristics you have captured) of the feature users. As you can see from the diagram, there is some overlap between the groups which also indicate that you may want to try another K value to tighten the clustering around the group means.
Next week: The Bayes Rule and How It Fights Spam.