Posted tagged ‘Data Mining’

Thoughts on Data Mining

March 10, 2012

Data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information (see prior blogs including The Data Information Hierarchy series). The term is overused and conjures impressions that do not reflect the true state of the industry. Knowledge Discovery from Databases (KDD) is more descriptive and not as misused – but the base meaning is the same.

Nevertheless, this definition of data mining is a very general definition and does not convey the different aspects of data mining / knowledge discovery.

The basic types of Data Mining are:

  • Descriptive data mining, and
  • Predictive data mining

Descriptive Data Mining generally seeks groups, subgroups and clusters. Algorithms are developed that draw associative relationships from which actionable results may be derived. (ie. a diamond head snake should be considered poisonous.)

Generally, a descriptive data mining result will appear as a series of if – then – elseif – then … conditions. Alternatively, a system of scoring may be used much like some magazine based self assessment exams. Regardless of the approach, the end result is a clustering of the samples with some measure of quality.

Predictive Data Mining is then performing an analysis on previous data to derive a prediction to the next outcome. For example: new business incorporation tend to look for credit card merchant solutions. This may seem obvious, but someone had to discover this tendency – and then exploit it.

Data mining is ready for application in the business community because it is supported by three technologies that are now sufficiently mature: 1) massive data collection, 2) powerful multiprocessor computers, and 3) data mining algorithms (http://www.thearling.com/text/dmwhite/dmwhite.htm).

Kurt Thearling identifies five type od data mining: (definitions taken from Wikipedia)

A decision tree is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. Decision trees are commonly used in operations research, specifically in decision analysis, to help identify a strategy most likely to reach a goal. If in practice decisions have to be taken online with no recall under incomplete knowledge, a decision tree should be paralleled by a Probability model as a best choice model or online selection model algorithm. Another use of decision trees is as a descriptive means for calculating conditional probabilities.

Nearest neighbour or shortest distance is a method of calculating distances between clusters in hierarchical clustering. In single linkage, the distance between two clusters is computed as the distance between the two closest elements in the two clusters.

The term neural network was traditionally used to refer to a network or circuit of biological neurons. The modern usage of the term often refers to artificial neural networks, which are composed of artificial neurons or nodes.

Rule induction is an area of machine learning in which formal rules are extracted from a set of observations. The rules extracted may represent a full scientific model of the data, or merely represent local patterns in the data.

Cluster analysis or clustering is the task of assigning a set of objects into groups (called clusters) so that the objects in the same cluster are more similar (in some sense or another) to each other than to those in other clusters.

Advertisements

Artificial Intelligence vs Algorithms

February 9, 2012

I first considered aspects of artificial intelligence (AI) in the 1980s while working for General Dynamics as an Avionics Systems Engineer on the F-16. Over the following 3 decades, I continued to follow the concept until I made a realization – AI is just an algorithm. Certainly the goals of AI will one day be reached, but the manifestation metric of AI is not well defined.

Mark Reynolds is currently at Southwestern Energy where he works in the Fayetteville Shale Drilling group as a Staff Drilling Data Analyst. In this position, he pulls his experiences in data processing, data analysis, and data presentation to improve Southwestern Energy’s work in the natural gas production and mid-stream market.

Recently, Mark has been working toward improved data collection, retention, and utilization in the real-time drilling environment.

www.ProfReynolds.com

Consider the Denver International Airport. The baggage handling system was state of the art, touted as AI based and caused the delay of the opening by 16 months and cost $560M to fix. (more – click here) In the end, the entire system was replaced with a more stable system based not on a learning or deductive system, but upon much more basic routing and planning algorithms which may be deterministically designed and tested.

Consider the Houston traffic light system. Mayors have been elected on the promise to apply state of the art computer intelligence. Interconnected traffic lights, traffic prediction, automatic traffic redirection. Yet the AI desired results in identifiable computer algorithms with definitive behavior and expectations. Certainly an improvement, but not a thinking machine. The closest thing to automation is the remote triggering features used by the commuter rail and emergency vehicles.

So algorithms form the basis for computer advancement. And these algorithms may be applied with human interaction to learn the new lessons so necessary to achieving behavioral improvement with the computers. Toward this objective, distinct fields of study are untangling interrelated elements – clustering, neural networks, case based reasoning, and predictive analytics are just a few.

When AI can be achieved, it will be revolutionary. But until that time, deterministic algorithms, data mining, and predictive analytics will be at the core of qualitative and quantitative advancement.

The Value of Real-Time Data

August 31, 2011

Real-Time data is a challenge to any process-oriented operation. But the functionality of the data is difficult to describe in such a way that team members not well versed in data management. Toward that end, four distinct phases of data have been identified:

  1. Real-Time: streaming data
    visualized and considered – system responds
  2. Forensic: captured data
    archived, condensed – system learns
  3. Data Mining: consolidated data
    hashed and clustered – system understands
  4. Predictive Analytics: patterned data
    compared and matched – system anticipates

A mnore detailed explanation of these phases may be:

Control and Supervision: Real-time data is used to provide direct HMI (human-machine-interface) and permit the human computer to monitor / control the operations from his console. The control and supervision phase of real-time data does not, as part of its function, record the data. (However, certain data logs may be created for legal or application development purposes.) Machine control and control feedback loops require, as a minimum, real-time data of sufficient quality to provide steady operational control.

Forensic Analysis and Lessons Learned: Captured data (and, to a lesser extent, data and event logs) are utilized to investigate specific performance metrics and operations issues. Generally, this data is kept in some form for posterity, but it may be filtered, processed, or purged. Nevertheless, the forensic utilization does represent post-operational analytics. Forensic analysis is also critical to prepare an operator for an upcoming similar process – similar in function, geography, or sequence.

Data Mining: Data mining is used to research previous operational events to locate trends, areas for improvement, and prepare for upcoming operations. Data mining is used identify a bottleneck or problem area as well as correlate events that are less than obvious.

Proactive / Predictive Analytics: The utilization of data streams, both present and previous, in an effort to predict the immediate (or distant) future requires historical data, data mining, and the application of learned correlations. Data mining may provide correlated events and properties, but the predictive analytics will provide the conversion of the correlations into positive, immediate performance and operational changes. (This utilization is not, explicitly AI, artificial intelligence, but the two are closely related)

The Big Crew Change

May 17, 2011

“The Big Crew Change” is an approaching event within the oil and gas industry when the mantle of leadership will move from the “calculators and memos” generation to the “connected and Skype” generation. In a blog 4 years ago, Rembrandt observes:

“The retirement of the workforce in the industry is normally referred to as “the big crew change”. People in this sector normally retire at the age of 55. Since the average age of an employee working at a major oil company or service company is 46 to 49 years old, there will be a huge change in personnel in the coming ten years, hence the “big crew change”. This age distribution is a result of the oil crises in ‘70s and ‘80s as shown in chart 1 & 2 below. The rising oil price led to a significant increase in the inflow of petroleum geology students which waned as prices decreased.”

Furthermore, a Society of Petroleum Engineers study found:

“There are insufficient personnel or ‘mid-carrers’ between 30 and 45 with the experience to make autonomous decisions on critical projects across the key areas of our business: exploration, development and production. This fact slows the potential for a safe increase in production considerably”

A study undertaken by Texas Tech University make several points about the state of education and the employability of graduates during this crew change:

  • Employment levels at historic lows
  • 50% of current workers will retire in 6 years
  • Job prospects: ~100% placement for the past 12 years
  • Salaries: Highest major in engineering for new hires

The big challenge: Knowledge Harvesting. “The loss of experienced personnel combined with the influx of young employees is creating unprecedented knowledge retention and transfer problems that threaten companies’ capabilities for operational excellence, growth, and innovation.” (Case Study: Knowledge Harvesting During the Big Crew Change).

In a blog by Otto Plowman, “Retaining knowledge through the Big Crew Change”, we see that

“Finding a way to capture the knowledge of experienced employees is critical, to prevent “terminal leakage” of insight into decisions about operational processes, best practices, and so on. Using of optimization technology is one way that producers can capture and apply this knowledge.When the retiring workforce fail to convey the important (critical) lessons learned, the gap is filled by data warehouses, knowledge systems, adaptive intelligence, and innovation.”

When the retiring workforce fail to convey the important (critical) lessons learned, the gap is filled by data warehouses, knowledge systems, adaptive intelligence, and innovation. Perhaps the biggest challenge is innovation. Innovation will drive the industry through the next several years. Proactive intelligence, coupled with terabyte upon terabyte of data will form the basis.

The future: the nerds will take over from the wildcatter.

Real-Time Data in an Operations/Process Environment

May 16, 2011

The operations/process environment differs from the administrative and financial environments in that operations is charged with getting the job done. As such, the requirements placed on computers, information systems, instrumentation, controls, and data is different too. Data is never ‘in balance’, data always carries uncertainty, and the process cannot stop. Operations personally have learned to perform their job while waiting for systems to come online, waiting for systems to upgrade, or even waiting for systems to be invented.

Once online, systems must be up 100% of the time, but aren’t. Systems must process data from a myriad of sources, but those sources are frequently intermit or sporadic. Thus the processing, utilization, storage, and analysis of real-time data is a challenge totally unlike the systems seen in administrations or financial operations.

Real time systems must address distinct channels of data flow – from the immediate to the analysis of terabytes of archived data.

Control and Supervision: Real-time data is used to provide direct HMI (human-machine-interface) and permit the human computer to monitor / control the operations from his console. The control and supervision phase of real-time data does not, as part of its function, record the data. (However, certain data logs may be created for legal or application development purposes.) Machine control and control feedback loops require, as a minimum, real-time data of sufficient quality to provide steady operational control.

Forensic Analysis and Lessons Learned: Captured data (and, to a lesser extent, data and event logs) are utilized to investigate specific performance metrics and operations issues. Generally, this data is kept in some form for posterity, but it may be filtered, processed, or purged. Nevertheless, the forensic utilization does represent post-operational analytics. Forensic analysis is also critical to prepare an operator for an upcoming similar process – similar in function, geography, or sequence.

Data Mining: Data mining is used to research previous operational events to locate trends, areas for improvement, and prepare for upcoming operations. Data mining is used identify a bottleneck or problem area as well as correlate events that are less than obvious.

Proactive / Predictive Analytics: The utilization of data streams, both present and previous, in an effort to predict the immediate (or distant) future requires historical data, data mining, and the application of learned correlations. Data mining may provide correlated events and properties, but the predictive analytics will provide the conversion of the correlations into positive, immediate performance and operational changes. (This utilization is not, explicitly AI, artificial intelligence, but the two are closely related)

The data-information-knowledge-understanding-wisdom paradigm: Within the data—>wisdom paradigm, real-time data is just that – data. The entire tree breaks out as:

  • data – raw, untempered data from the operations environment (elemental data filtering and data quality checks are, nevertheless, required).
  • information – presentation of the data in human comprehensible formats – the control and supervision phase of real-time data.
  • knowledge – forensic analytics, data mining, and correlation analysis
  • understanding – proactive and forward-looking changes in behavior characteristic of the proactive / predictive analytics phase.
  • wisdom – the wisdom phase remains the domain of the human computer.

Related Posts:

Data Mining and Data, Information, Understanding, Knowledge
https://profreynolds.wordpress.com/2011/01/30/data-mining-and-data-information-understanding-knowledge/

The Digital Oilfield, Part 1
https://profreynolds.wordpress.com/2011/01/30/the-digital-oilfield-part-1/

The Data-Information Hierarchy
https://profreynolds.wordpress.com/2011/01/31/the-data-information-hierarcy/

Multi-Nodal, Multi-Variable, Spatio-Temporal Datasets

April 21, 2011

Multi-Nodal, Multi-Variable, Spatio-Temporal Datasets are large-scale datasets encountered in real-world data-intensive environments.

Example Dataset #1

A basic example would be the heat distribution within a chimney at a factory. Heat sensors are distributed throughout the chimney and readings are taken are periodic intervals. Since the laws of Thermodynamics within a chimney are well understood, the interaction between the monitoring devices can be modeled. Predictive analysis could, conceivably be performed on the dataset and chimney cracks could be detected, or even predicted, in real-time.

In this scenario, data points consist of 1) multiple sensors or data acquisition devices, 2) multiple spatial locations, 3) temporally separated samples. When a sensor fails, it is simply removed from the processing and kept out of the processing until the sensor is repaired (during plant maintenance).

Example Dataset #2

An example would be the interconnected river and lake levels within a single geographic area. Distinct monitoring points are located at specific geo-spatial locations; geo-spatial points with interconnected transfer functions and models. Each of the monitoring points consist of multiple data acquisitions, and each data acquisition is sampled at random (or predetermined) intervals.

As a result, data points consist of 1) multiple sensors, 2) multiple spatial locations, and 3) temporally separated samples. In this scenario, sensors may fail – or become temporarily offline in a random, unpredictable manner. Sensors must be taken out of the processing until data validity returns. Due to the interconnectedness of the sensor locations, and the interrelationships between the sensors, sufficient redundant data could be present to permit suitable analytical processing in the absence of data.

Example Dataset #3

The most complex example could be aerial chemical contamination sampling. In this scenario, the chemical distribution is continuously changing at the result of understood, but not fully predictable, weather behavior. Sampling devices would consist of 1) airborne sampling devices (balloons) providing specific, limited sample sets, 2) ground based mobile sampling units (trucks) providing extensive sample sets, and fixed based (pole mounted) sampling units whose data is downloaded in relatively long intervals (hours or days).

In this scenario, multiple, non-uniform data sampling elements are positioned in non-uniformly (and mobile) located positions, with data collection performed in fully asynchronous fashion. This data cannot be stored in flat-table structures and it must provide enough relevant information to fill-in the gaps in data.

The Data-Information Hierarcy, Part 2

February 11, 2011

Data, as has been established is the organic, elemental source quantities. Data, by itself, does not produce cognitive information and decision-making ability. But without it, the chain Data –> Information –> Knowledge –> Understanding –> Wisdom is broken before it starts. Data is recognized for its discrete characteristics.

Information is the logic grouping and presentation of the data. Information, in a more general sense, is the structure and encoded explanation of phenomena. It answers the who, what, when, where questions (http://www.systems-thinking.org/dikw/dikw.htm) but does not explain these answer, nor instill a wisdom necessary to act on the information. “Information is sometimes associated with the idea of knowledge through its popular use rather than with uncertainty and the resolution of uncertainty.” (An Introduction to Information Theory, John R. Pierce)

The leap from Information through Knowledge and Understanding into Wisdom is the objective sought be data / information analysis, data / information mining, and knowledge systems.

What knowledge is gleaned from the information; how can we understand the interactions (particularly at a system level); and how is this understanding used to create correct decisions and navigate a course to a desired end point.

Wisdom is the key. What good is the historical volumes of data unless informed decisions result? (a definition of Wisdom)

How do informed decisions (Wisdom) result if the data and process are not Understood?

How do we achieve understanding without the knowledge?

How do we achieve the Knowledge without the Information?

But in a very real sense, Information is not the who, what, when, where answers to the questions assimilating data. Quantified, Information is the measure of the order of the system; or conversely the measure of the lack of disorder, the measure of the entropy of the system.

Taken together, information is composed of the informational context, the informational content, and the informational propositions. Knowledge, then, is the informed assimilation of the information. And the cognitive conclusion is the understanding.

Thus Data –> Information –> Knowledge –> Understanding –> Wisdom through the judicious use of:

  • data acquisition,
  • data retention,
  • data transition,
  • knowledge mining,
  • cognitive processing, and
  • the application of this combined set into a proactive and forward action plan.

%d bloggers like this: