For the most part, data mining tells us about very large and complex data sets, the kinds of information that would be readily apparent about small and
simple things. For example, it can tell us that "one of these things is not like the other" a
la Sesame Street or it can show us categories and then sort things into pre-determined categories. But what's simple with 5 datapoints is not so simple with 5 billion datapoints.
And these days, there's always more data. We gather far more of it then we
can digest. Nearly every transaction or interaction leaves a data signature that someone somewhere is capturing
and storing. This is, of course, true on the internet; but, ubiquitous computing and digitization has made it increasingly true about our
lives away from our computers (do we still have those?). The sheer scale of this data has far exceeded human sense-making capabilities. At these
scales patterns are often too subtle and relationships too complex or multi-dimensional to observe by simply looking at the data. Data mining is a
means of automating part this process to detect interpretable patterns; it helps us see the forest without getting lost in the trees.
Discovering information from data takes two major forms: description and prediction. At the scale we are talking about, it is hard to know what the
data shows. Data mining is used to simplify and summarize the data in a manner that we can understand, and then allow us to infer things about specific
cases based on the patterns we have observed. Of course, specific applications of data mining methods are limited by the data and computing power
available, and are tailored for specific needs and goals. However, there are several main types of pattern detection that are commonly used.
These general forms illustrate what data mining can do.
: in a large data set it is possible to get a picture of what the data tends to look like in a typical case. Statistics can be used to determine if
something is notably different from this pattern. For instance, the IRS could model typical tax returns and use anomaly detection to identify specific
returns that differ from this for review and audit.
This is the type of data mining that drives the Amazon recommendation system. For instance, this might reveal that customers who bought a cocktail
shaker and a cocktail recipe book also often buy martini glasses. These types of findings are often used for targeting coupons/deals or advertising.
Similarly, this form of data mining (albeit a quite complex version) is behind Netflix movie recommendations.
one type of pattern recognition that is particularly useful is recognizing distinct clusters or sub-categories within the data. Without data mining, an
analyst would have to look at the data and decide on a set of categories which they believe captures the relevant distinctions between apparent groups
in the data. This would risk missing important categories. With data mining it is possible to let the data itself determine the groups. This
is one of the black-box type of algorithms that are hard to understand. But in a simple example - again with purchasing behavior - we can imagine that
the purchasing habits of different hobbyists would look quite different from each other: gardeners, fishermen and model airplane enthusiasts would all
be quite distinct. Machine learning algorithms can detect all of the different subgroups within a dataset that differ significantly from each other.