【正文】
ata warehouse Overview Data mining, the extraction of hidden predictive information from large databases, is a powerful new technology with great potential to help panies focus on the most important information in their data warehouses. Data mining tools predict future trends and behaviors, allowing businesses to make proactive, knowledgedriven decisions. The automated, prospective analyses offered by data mining move beyond the analyses of past events provided by retrospective tools typical of decision support systems. Data mining tools can answer business questions that traditionally were too time consuming to resolve. They scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations. Most panies already collect and refine massive quantities of data. Data mining techniques can be implemented rapidly on existing software and hardware platforms to enhance the value of existing information resources, and can be integrated with new products and systems as they are brought online. When implemented on high performance client/server or parallel processing puters, data mining tools can analyze massive databases to deliver answers to questions such as, Which clients are most likely to respond to my next promotional mailing, and why? This white paper provides an introduction to the basic technologies of data mining. Examples of profitable applications illustrate its relevance to today’s business environment as well as a basic description of how data warehouse architectures can evolve to deliver the value of data mining to end users. The Foundations of Data Mining Data mining techniques are the result of a long process of research and product development. This evolution began when business data was first stored on puters, continued with improvements in data access, and more recently, generated technologies that allow users to navigate through their data in real time. Data mining takes this evolutionary process beyond retrospective data access and navigation to prospective and proactive information delivery. Data mining is ready for application in the business munity because it is supported by three technologies that are now sufficiently mature: ? Massive data collection ? Powerful multiprocessor puters ? Data mining algorithms Commercial databases are growing at unprecedented rates. A recent META Group survey of data warehouse projects found that 19% of respondents are beyond the 50 gigabyte level, while 59% expect to be there by second quarter of In some industries, such as retail, these numbers can be much larger. The acpanying need for improved putational engines can now be met in a costeffective manner with parallel multiprocessor puter technology. Data mining algorithms embody techniques that have existed for at least 10 years, but have only recently been implemented as mature, reliable, understandable tools that consistently outperform older statistical methods. In the evolution from business data to business information, each new step has built upon the previous one. For example, dynamic data access is critical for drillthrough in data navigation applications, and the ability to store large databases is critical to data mining. From the user’s point of view, the four steps listed in Table 1 were revolutionary because they allowed new business questions to be answered accurately and quickly. Evolutionary Step Business Question Enabling Technologies Product Providers Characteristics Data Collection (1960s) What was my total revenue in the last five years? Computers, tapes, disks IBM, CDC Retrospective, static data delivery Data Access (1980s) What were unit sales in New England last March? Relational databases (RDBMS), Structured Query Language (SQL), ODBC Oracle, Sybase, Informix, IBM, Microsoft Retrospective, dynamic data delivery at record level Data Warehousing amp。 for example, a multidimensional sales database might include the dimensions Product, Time, and City. exploratory data analysis The use of graphical and descriptive statistical techniques to learn about the structure of a dataset. geic algorithms Optimization techniques that use processes such as geic bination, mutation, and natural selection in a design based on the concepts of natural evolution. linear model An analytical model that assumes linear relationships in the coefficients of the variables being studied. linear regression A statistical technique used to find the bestfitting linear relationship between a target (dependent) variable and its predictors (independent variables). logistic regression A linear regression that predicts the proportions of a categorical target variable, such as type of customer, in a population. multidimensional database A database designed for online analytical processing. Structured as a multidimensional hypercube with one axis per dimension. multiprocessor puter A puter that includes multiple processors connected by a work. See parallel processing. nearest neighbor A technique that classifies each record in a dataset based on a bination of the classes of the k record(s) most similar to it in a historical dataset (where k 179。t. For instance, if you were looking for a sunken Spanish galleon on the high seas the first thing you might do is to research the times when Spanish treasure had been found by others in the past. You might note that these ships often tend to b