Archive for the ‘Data Mining’ category

Thoughts on Data Mining

March 10, 2012

Data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information (see prior blogs including The Data Information Hierarchy series). The term is overused and conjures impressions that do not reflect the true state of the industry. Knowledge Discovery from Databases (KDD) is more descriptive and not as misused – but the base meaning is the same.

Nevertheless, this definition of data mining is a very general definition and does not convey the different aspects of data mining / knowledge discovery.

The basic types of Data Mining are:

  • Descriptive data mining, and
  • Predictive data mining

Descriptive Data Mining generally seeks groups, subgroups and clusters. Algorithms are developed that draw associative relationships from which actionable results may be derived. (ie. a diamond head snake should be considered poisonous.)

Generally, a descriptive data mining result will appear as a series of if – then – elseif – then … conditions. Alternatively, a system of scoring may be used much like some magazine based self assessment exams. Regardless of the approach, the end result is a clustering of the samples with some measure of quality.

Predictive Data Mining is then performing an analysis on previous data to derive a prediction to the next outcome. For example: new business incorporation tend to look for credit card merchant solutions. This may seem obvious, but someone had to discover this tendency – and then exploit it.

Data mining is ready for application in the business community because it is supported by three technologies that are now sufficiently mature: 1) massive data collection, 2) powerful multiprocessor computers, and 3) data mining algorithms (

Kurt Thearling identifies five type od data mining: (definitions taken from Wikipedia)

A decision tree is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. Decision trees are commonly used in operations research, specifically in decision analysis, to help identify a strategy most likely to reach a goal. If in practice decisions have to be taken online with no recall under incomplete knowledge, a decision tree should be paralleled by a Probability model as a best choice model or online selection model algorithm. Another use of decision trees is as a descriptive means for calculating conditional probabilities.

Nearest neighbour or shortest distance is a method of calculating distances between clusters in hierarchical clustering. In single linkage, the distance between two clusters is computed as the distance between the two closest elements in the two clusters.

The term neural network was traditionally used to refer to a network or circuit of biological neurons. The modern usage of the term often refers to artificial neural networks, which are composed of artificial neurons or nodes.

Rule induction is an area of machine learning in which formal rules are extracted from a set of observations. The rules extracted may represent a full scientific model of the data, or merely represent local patterns in the data.

Cluster analysis or clustering is the task of assigning a set of objects into groups (called clusters) so that the objects in the same cluster are more similar (in some sense or another) to each other than to those in other clusters.


Data Mining and Data, Information, Knowledge

January 4, 2011

Throughout my Computer Science courses, I strive to teach the difference between data and information.

Information is what the user wanted when he ask for the data.

Seldom does an end user want reems of data or Excel spreadsheets hundreds and thousands of rows long. The use needs a discernable, digestable presentation so that he may cognatively absorb the concept being presented.

Should your supervisor ask for the 3rd quarter sales data in the mid-west, do not unload a truckload of sales reciepts on his desk. He is looking for presentations that represents the significance of the data.

But Information is not the end of the story either. An excellent article by Gene Bellinger, et. al. extends the concept of Data –> Information to be
Data –> Information –> Knowledge –> Understanding –> Wisdom

Presenting Data as Information has long been the objective of spreadsheets, reports, and even three-dimensional presentations. But moving from Information to Knowledge has been a little more challenging.

Data Mining is a tool to move the to this next level. A great overview of this is provided by Bill Palace at

But Data Mining has advanced past Knowledge into Understanding through the advancement of ‘semantics‘. (Often referred to as the Semantic Web.) The Wikipedia article provides good coverage on this topic.

%d bloggers like this: