Principles of Data Mining

mic64 · 2004-05-20, 04:49 PM

Chapter 1: Introduction
1.1 Introduction to Data Mining
Progress in digital data acquisition and storage technology has resulted in the growth of
huge databases. This has occurred in all areas of human endeavor, from the mundane
(such as supermarket transaction data, credit card usage records, telephone call details,
and government statistics) to the more exotic (such as images of astronomical bodies,
molecular databases, and medical records). Little wonder, then, that interest has grown
in the possibility of tapping these data, of extracting from them information that might be
of value to the owner of the database. The discipline concerned with this task has
become known as data mining.
Defining a scientific discipline is always a controversial task; researchers often disagree
about the precise range and limits of their field of study. Bearing this in mind, and
accepting that others might disagree about the details, we shall adopt as our working
definition of data mining:
Data mining is the analysis of (often large) observational data sets to find unsuspected
relationships and to summarize the data in novel ways that are both understandable and
useful to the data owner.
The relationships and summaries derived through a data mining exercise are often
referred to as models or patterns. Examples include linear equations, rules, clusters,
graphs, tree structures, and recurrent patterns in time series.
The definition above refers to "observational data," as opposed to "experimental data."
Data mining typically deals with data that have already been collected for some purpose
other than the data mining analysis (for example, they may have been collected in order
to maintain an up-to-date record of all the transactions in a bank). This means that the
objectives of the data mining exercise play no role in the data collection strategy. This is
one way in which data mining differs from much of statistics, in which data are often
collected by using efficient strategies to answer specific questions. For this reason, data
mining is often referred to as "secondary" data analysis.
The definition also mentions that the data sets examined in data mining are often large. If
only small data sets were involved, we would merely be discussing classical exploratory
data analysis as practiced by statisticians. When we are faced with large bodies of data,
new problems arise. Some of these relate to housekeeping issues of how to store or
access the data, but others relate to more fundamental issues, such as how to determine
the representativeness of the data, how to analyze the data in a reasonable period of
time, and how to decide whether an apparent relationship is merely a chance occurrence
not reflecting any underlying reality. Often the available data comprise only a sample
from the complete population (or, perhaps, from a hypothetical superpopulation); the aim
may be to generalize from the sample to the population. For example, we might wish to
predict how future customers are likely to behave or to determine the properties of
protein structures that we have not yet seen. Such generalizations may not be
achievable through standard statistical approaches because often the data are not
(classical statistical) "random samples," but rather "convenience" or "opportunity"
samples. Sometimes we may want to summarize or compress a very large data set in
such a way that the result is more comprehensible, without any notion of generalization.
This issue would arise, for example, if we had complete census data for a particular
country or a database recording millions of individual retail transactions.
The relationships and structures found within a set of data must, of course, be novel.
There is little point in regurgitating well-established relationships (unless, the exercise is
aimed at "hypothesis" confirmation, in which one was seeking to determine whether
established pattern also exists in a new data set) or necessary relationships (that, for
example, all pregnant patients are female). Clearly, novelty must be measured relative to
the user's prior knowledge. Unfortunately few data mining algorithms take into account a
user's prior knowledge. For this reason we will not say very much about novelty in this
text. It remains an open research problem.
While novelty is an important property of the relationships we seek, it is not sufficient to
qualify a relationship as being worth finding. In particular, the relationships must also be
understandable. For instance simple relationships are more readily understood than
complicated ones, and may well be preferred, all else being equal.
Data mining is often set in the broader context of knowledge discovery in databases, or
KDD. This term originated in the artificial intelligence (AI) research field. The KDD
process involves several stages: selecting the target data, preprocessing the data,
transforming them if necessary, performing data mining to extract patterns and
relationships, and then interpreting and assessing the discovered structures. Once again
the precise boundaries of the data mining part of the process are not easy to state; for
example, to many people data transformation is an intrinsic part of data mining. In this
text we will focus primarily on data mining algorithms rather than the overall process. For
example, we will not spend much time discussing data preprocessing issues such as
data cleaning, data verification, and defining variables. Instead we focus on the basic
principles for modeling data and for constructing algorithmic processes to fit these
models to data.
The process of seeking relationships within a data set— of seeking accurate, convenient,
and useful summary representations of some aspect of the data—involves a number of
steps:
§ determining the nature and structure of the representation to be used;
§ deciding how to quantify and compare how well different representations fit
the data (that is, choosing a "score" function);
§ choosing an algorithmic process to optimize the score function; and
§ deciding what principles of data management are required to implement the
algorithms efficiently.

Example 1.1
Regression analysis is a tool with which many readers will be familiar. In its simplest form,
it involves building a predictive model to relate a predictor variable, X, to a response
variable, Y , through a relationship of the form Y = aX + b. For example, we might build a
model which would allow us to predict a person's annual credit-card spending given their
annual income. Clearly the model would not be perfect, but since spending typically
increases with income, the model might well be adequate as a rough characterization. In
terms of the above steps listed, we would have the following scenario:
§ The representation is a model in which the response variable, spending,
is linearly related to the predictor variable, income.
§ The score function most commonly used in this situation is the sum of
squared discrepancies between the predicted spending from the model
and observed spending in the group of people described by the data.
The smaller this sum is, the better the model fits the data.
§ The optimization algorithm is quite simple in the case of linear
regression: a and b can be expressed as explicit functions of the
observed values of spending and income. We describe the algebraic
details in chapter 11.
§ Unless the data set is very large, few data management problems arise
with regression algorithms. Simple summaries of the data (the sums,
sums of squares, and sums of products of the X and Y values) are
sufficient to compute estimates of a and b. This means that a single pass
through the data will yield estimates.

Data mining is an interdisciplinary exercise. Statistics, database technology, machine
learning, pattern recognition, artificial intelligence, and visualization, all play a role. And
just as it is difficult to define sharp boundaries between these disciplines, so it is difficult
to define sharp boundaries between each of them and data mining. At the boundaries,
one person's data mining is another's statistics, database, or machine learning problem.

1.2 The Nature of Data Sets
We begin by discussing at a high level the basic nature of data sets.
A data set is a set of measurements taken from some environment or process. In the
simplest case, we have a collection of objects, and for each object we have a set of the
same p measurements. In this case, we can think of the collection of the measurements
on n objects as a form of n × p data matrix. The n rows represent the n objects on which
measurements were taken (for example, medical patients, credit card customers, or
individual objects observed in the night sky, such as stars and galaxies). Such rows may
be referred to as individuals, entities, cases, objects, or records depending on the
context.
The other dimension of our data matrix contains the set of p measurements made on
each object. Typically we assume that the same p measurements are made on each
individual although this need not be the case (for example, different medical tests could
be performed on different patients). The p columns of the data matrix may be referred to
as variables, features, attributes, or fields; again, the language depends on the research
context. In all situations the idea is the same: these names refer to the measurement that
is represented by each column. In chapter 2 we will discuss the notion of measurement
in much more detail.

Data come in many forms and this is not the place to develop a complete taxonomy.
Indeed, it is not even clear that a complete taxonomy can be developed, since an
important aspect of data in one situation may be unimportant in another. However there
are certain basic distinctions to which we should draw attention. One is the difference
between quantitative and categorical measurements (different names are sometimes
used for these). A quantitative variable is measured on a numerical scale and can, at
least in principle, take any value. The columns Age and Income in table 1.1 are
examples of quantitative variables. In contrast, categorical variables such as Sex, Marital
Status and Education in 1.1 can take only certain, discrete values. The common three
point severity scale used in medicine (mild, moderate, severe) is another example.
Categorical variables may be ordinal (possessing a natural order, as in the Education
scale) or nominal (simply naming the categories, as in the Marital Status case). A data
analytic technique appropriate for one type of scale might not be appropriate for another
(although it does depend on the objective—see Hand (1996) for a detailed discussion).
For example, were marital status represented by integers (e.g., 1 for single, 2 for
married, 3 for widowed, and so forth) it would generally not be meaningful or appropriate
to calculate the arithmetic mean of a sample of such scores using this scale. Similarly,
simple linear regression (predicting one quantitative variable as a function of others) will
usually be appropriate to apply to quantitative data, but applying it to categorical data
may not be wise; other techniques, that have similar objectives (to the extent that the
objectives can be similar when the data types differ), might be more appropriate with
categorical scales.
Measurement scales, however defined, lie at the bottom of any data taxonomy. Moving
up the taxonomy, we find that data can occur in various relationships and structures.
Data may arise sequentially in time series, and the data mining exercise might address
entire time series or particular segments of those time series. Data might also describe
spatial relationships, so that individual records take on their full significance only when
considered in the context of others.
Consider a data set on medical patients. It might include multiple measurements on the
same variable (e.g., blood pressure), each measurement taken at different times on
different days. Some patients might have extensive image data (e.g., X-rays or magnetic
resonance images), others not. One might also have data in the form of text, recording a
specialist's comments and diagnosis for each patient. In addition, there might be a
hierarchy of relationships between patients in terms of doctors, hospitals, and
geographic locations. The more complex the data structures, the more complex the data
mining models, algorithms, and tools we need to apply.

2004-05-20, 04:49 PM	#1
mic64 註冊會員榮譽勳章勳章總數2 UID - 582 在線等級: 註冊日期: 2002-12-06 VIP期限: 2007-04 住址: MIB總部文章: 412 精華: 0 現金: 499 金幣資產: 499 金幣	Principles of Data Mining Chapter 1: Introduction 1.1 Introduction to Data Mining Progress in digital data acquisition and storage technology has resulted in the growth of huge databases. This has occurred in all areas of human endeavor, from the mundane (such as supermarket transaction data, credit card usage records, telephone call details, and government statistics) to the more exotic (such as images of astronomical bodies, molecular databases, and medical records). Little wonder, then, that interest has grown in the possibility of tapping these data, of extracting from them information that might be of value to the owner of the database. The discipline concerned with this task has become known as data mining. Defining a scientific discipline is always a controversial task; researchers often disagree about the precise range and limits of their field of study. Bearing this in mind, and accepting that others might disagree about the details, we shall adopt as our working definition of data mining: Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner. The relationships and summaries derived through a data mining exercise are often referred to as models or patterns. Examples include linear equations, rules, clusters, graphs, tree structures, and recurrent patterns in time series. The definition above refers to "observational data," as opposed to "experimental data." Data mining typically deals with data that have already been collected for some purpose other than the data mining analysis (for example, they may have been collected in order to maintain an up-to-date record of all the transactions in a bank). This means that the objectives of the data mining exercise play no role in the data collection strategy. This is one way in which data mining differs from much of statistics, in which data are often collected by using efficient strategies to answer specific questions. For this reason, data mining is often referred to as "secondary" data analysis. The definition also mentions that the data sets examined in data mining are often large. If only small data sets were involved, we would merely be discussing classical exploratory data analysis as practiced by statisticians. When we are faced with large bodies of data, new problems arise. Some of these relate to housekeeping issues of how to store or access the data, but others relate to more fundamental issues, such as how to determine the representativeness of the data, how to analyze the data in a reasonable period of time, and how to decide whether an apparent relationship is merely a chance occurrence not reflecting any underlying reality. Often the available data comprise only a sample from the complete population (or, perhaps, from a hypothetical superpopulation); the aim may be to generalize from the sample to the population. For example, we might wish to predict how future customers are likely to behave or to determine the properties of protein structures that we have not yet seen. Such generalizations may not be achievable through standard statistical approaches because often the data are not (classical statistical) "random samples," but rather "convenience" or "opportunity" samples. Sometimes we may want to summarize or compress a very large data set in such a way that the result is more comprehensible, without any notion of generalization. This issue would arise, for example, if we had complete census data for a particular country or a database recording millions of individual retail transactions. The relationships and structures found within a set of data must, of course, be novel. There is little point in regurgitating well-established relationships (unless, the exercise is aimed at "hypothesis" confirmation, in which one was seeking to determine whether established pattern also exists in a new data set) or necessary relationships (that, for example, all pregnant patients are female). Clearly, novelty must be measured relative to the user's prior knowledge. Unfortunately few data mining algorithms take into account a user's prior knowledge. For this reason we will not say very much about novelty in this text. It remains an open research problem. While novelty is an important property of the relationships we seek, it is not sufficient to qualify a relationship as being worth finding. In particular, the relationships must also be understandable. For instance simple relationships are more readily understood than complicated ones, and may well be preferred, all else being equal. Data mining is often set in the broader context of knowledge discovery in databases, or KDD. This term originated in the artificial intelligence (AI) research field. The KDD process involves several stages: selecting the target data, preprocessing the data, transforming them if necessary, performing data mining to extract patterns and relationships, and then interpreting and assessing the discovered structures. Once again the precise boundaries of the data mining part of the process are not easy to state; for example, to many people data transformation is an intrinsic part of data mining. In this text we will focus primarily on data mining algorithms rather than the overall process. For example, we will not spend much time discussing data preprocessing issues such as data cleaning, data verification, and defining variables. Instead we focus on the basic principles for modeling data and for constructing algorithmic processes to fit these models to data. The process of seeking relationships within a data set— of seeking accurate, convenient, and useful summary representations of some aspect of the data—involves a number of steps: § determining the nature and structure of the representation to be used; § deciding how to quantify and compare how well different representations fit the data (that is, choosing a "score" function); § choosing an algorithmic process to optimize the score function; and § deciding what principles of data management are required to implement the algorithms efficiently. Example 1.1 Regression analysis is a tool with which many readers will be familiar. In its simplest form, it involves building a predictive model to relate a predictor variable, X, to a response variable, Y , through a relationship of the form Y = aX + b. For example, we might build a model which would allow us to predict a person's annual credit-card spending given their annual income. Clearly the model would not be perfect, but since spending typically increases with income, the model might well be adequate as a rough characterization. In terms of the above steps listed, we would have the following scenario: § The representation is a model in which the response variable, spending, is linearly related to the predictor variable, income. § The score function most commonly used in this situation is the sum of squared discrepancies between the predicted spending from the model and observed spending in the group of people described by the data. The smaller this sum is, the better the model fits the data. § The optimization algorithm is quite simple in the case of linear regression: a and b can be expressed as explicit functions of the observed values of spending and income. We describe the algebraic details in chapter 11. § Unless the data set is very large, few data management problems arise with regression algorithms. Simple summaries of the data (the sums, sums of squares, and sums of products of the X and Y values) are sufficient to compute estimates of a and b. This means that a single pass through the data will yield estimates. Data mining is an interdisciplinary exercise. Statistics, database technology, machine learning, pattern recognition, artificial intelligence, and visualization, all play a role. And just as it is difficult to define sharp boundaries between these disciplines, so it is difficult to define sharp boundaries between each of them and data mining. At the boundaries, one person's data mining is another's statistics, database, or machine learning problem. 1.2 The Nature of Data Sets We begin by discussing at a high level the basic nature of data sets. A data set is a set of measurements taken from some environment or process. In the simplest case, we have a collection of objects, and for each object we have a set of the same p measurements. In this case, we can think of the collection of the measurements on n objects as a form of n × p data matrix. The n rows represent the n objects on which measurements were taken (for example, medical patients, credit card customers, or individual objects observed in the night sky, such as stars and galaxies). Such rows may be referred to as individuals, entities, cases, objects, or records depending on the context. The other dimension of our data matrix contains the set of p measurements made on each object. Typically we assume that the same p measurements are made on each individual although this need not be the case (for example, different medical tests could be performed on different patients). The p columns of the data matrix may be referred to as variables, features, attributes, or fields; again, the language depends on the research context. In all situations the idea is the same: these names refer to the measurement that is represented by each column. In chapter 2 we will discuss the notion of measurement in much more detail. Data come in many forms and this is not the place to develop a complete taxonomy. Indeed, it is not even clear that a complete taxonomy can be developed, since an important aspect of data in one situation may be unimportant in another. However there are certain basic distinctions to which we should draw attention. One is the difference between quantitative and categorical measurements (different names are sometimes used for these). A quantitative variable is measured on a numerical scale and can, at least in principle, take any value. The columns Age and Income in table 1.1 are examples of quantitative variables. In contrast, categorical variables such as Sex, Marital Status and Education in 1.1 can take only certain, discrete values. The common three point severity scale used in medicine (mild, moderate, severe) is another example. Categorical variables may be ordinal (possessing a natural order, as in the Education scale) or nominal (simply naming the categories, as in the Marital Status case). A data analytic technique appropriate for one type of scale might not be appropriate for another (although it does depend on the objective—see Hand (1996) for a detailed discussion). For example, were marital status represented by integers (e.g., 1 for single, 2 for married, 3 for widowed, and so forth) it would generally not be meaningful or appropriate to calculate the arithmetic mean of a sample of such scores using this scale. Similarly, simple linear regression (predicting one quantitative variable as a function of others) will usually be appropriate to apply to quantitative data, but applying it to categorical data may not be wise; other techniques, that have similar objectives (to the extent that the objectives can be similar when the data types differ), might be more appropriate with categorical scales. Measurement scales, however defined, lie at the bottom of any data taxonomy. Moving up the taxonomy, we find that data can occur in various relationships and structures. Data may arise sequentially in time series, and the data mining exercise might address entire time series or particular segments of those time series. Data might also describe spatial relationships, so that individual records take on their full significance only when considered in the context of others. Consider a data set on medical patients. It might include multiple measurements on the same variable (e.g., blood pressure), each measurement taken at different times on different days. Some patients might have extensive image data (e.g., X-rays or magnetic resonance images), others not. One might also have data in the form of text, recording a specialist's comments and diagnosis for each patient. In addition, there might be a hierarchy of relationships between patients in terms of doctors, hospitals, and geographic locations. The more complex the data structures, the more complex the data mining models, algorithms, and tools we need to apply.

	送花文章: 0, 收花文章: 21 篇, 收花: 61 次

相似的主題
主題	主題作者	討論區	回覆	最後發表
Hack Proofing Your Network-7	mic64	網路軟硬體架設技術文件	0	2004-07-16 04:41 PM
perltoot - Tom's object-oriented tutorial for perl	mic64	作業系統操作技術文件	0	2004-05-20 01:18 PM

Google 提供的廣告