With Datasets, Spider Impact can explore large amounts of unstructured data. With data clustering, you can unlock powerful insights by analyzing the relationships between your datasets’ multiple fields. Clustering creates profiles in your data, helping you to understand the types of records most likely to show up in your dataset.
Clustering is best explained by example. Let’s imagine that we have a dataset of customers, and we want to discover the types of people who buy our products. Each point on the scatter plots below represents a customer. Let’s imagine that the X axis is age, and the Y axis is income.
We can see that the clustering algorithm has found three clusters in the data. The three demographics of people who buy this product are young high-income people, middle-aged low-income people, and older middle-income people.
Looking at two dataset fields is interesting, but now let’s imagine extending these scatter plots into a 3rd dimension by adding a Z axis. In addition to tracking age and income, let’s say that we’re also tracking years of formal education. By seeing points in 3-dimensional space, we could find even more interesting clusters of people. We could discover that our product is often purchased by older, higher-income people with little formal education, or middle-aged, low-income people with graduate degrees.
The human mind has trouble imagining data in more than 3 dimensions, but clustering algorithms do not. The more dimensions of data that you’re able to provide to Impact, the more powerful it becomes. Your datasets have dozens of fields, and there are meaningful insights to be discovered.
For an animated explanation of clustering, take a look at the clustering section of our What is BI? article.
To create a clustering field in your dataset, click the “Add” button in the Fields table on the Edit tab.
Then choose Data Clusters from the field type.
Next, choose which fields you want to cluster on and click the Analyze button.
This opens a second-level dialog showing the quality of various numbers of clusters. You can see here that 17 clusters is the best fit for our data, but that 6 clusters is almost as good.
In this situation we want to go with 6 clusters to keep things simple, so we’ll tell Impact that we want 6 clusters instead of “Auto”.
Finally, we’ll give each cluster a name based on its characteristics for each of the fields we’ve chosen.
We can now use our new data clusters field just like we would any other dataset field. The cluster that a record falls into is the cluster field’s value. In this example we’ve added the field to the Datasets Explore tab, but you can also use it in Reports, Charts, and Dashboards.
Spider Impact uses the k-means++ algorithm for clustering, and each cluster’s quality is evaluated using the Calinski Harabasz index.