High-Performance Parallel Database Processing and Grid Databases- P10

Số trang: 50 Loại file: pdf Dung lượng: 365.59 KB Lượt xem: 2 Lượt tải: 0

Jamona

Hỗ trợ phí lưu trữ khi tải xuống: 8,000 VND

Báo xấu

Xem trước 5 trang đầu tiên của tài liệu này:

Thông tin tài liệu:

High-Performance Parallel Database Processing and Grid Databases- P10: Parallel databases are database systems that are implemented on parallel computingplatforms. Therefore, high-performance query processing focuses on queryprocessing, including database queries and transactions, that makes use of parallelismtechniques applied to an underlying parallel computing platform in order toachieve high performance.
Nội dung trích xuất từ tài liệu:
High-Performance Parallel Database Processing and Grid Databases- P10430 Chapter 16 Parallel Data Mining—Association Rules and Sequential Patterns Operational Data Data Data Extraction Warehouse Extract Filter Transform Integrate Classify DB Aggregate Summarize Integrated Non-Volatile Time-Variant Subject-OrientedFigure 16.2 Building a data warehouse A data warehouse is integrated and subject-oriented, since the data is alreadyintegrated from various sources through the cleaning process, and each data ware-house is developed for a certain domain of subject area in an organization, such assales, and therefore is subject-oriented. The data is obviously nonvolatile, meaningthat the data in a data warehouse is not update-oriented, unlike operational data.The data is also historical and normally grouped to reﬂect a certain period of time,and hence it is time-variant. Once a data warehouse has been developed, management is able to performsome operation on the data warehouse, such as drill-down and rollup. Drill-downis performed in order to obtain a more detailed breakdown of a certain dimension,whereas rollup, which is exactly the opposite, is performed in order to obtain moregeneral information about a certain dimension. Business reporting often makesuse of data warehouses in order to produce historical analysis for decision support.Parallelism of OLAP has already been presented in Chapter 15. As can be seen from the above, the main difference between a database and adata warehouse lies in the data itself: operational versus historical. However, anydecision to support the use of a data warehouse has its own limitations. The queryfor historical reporting needs to be formulated similarly to the operational data.If the management does not know what information or pattern or knowledge toexpect, data warehousing is not able to satisfy this requirement. A typical anec-dote is that a manager gives a pile of data to subordinates and asks them to ﬁndsomething useful in it. The manager does not know what to expect but is sure thatsomething useful and surprising may be extracted from this pile of data. This is nota typical database query or data warehouse processing. This raises the need for adata mining process. Data mining, deﬁned as a process to mine knowledge from a collection of data,generally involves three components: the data, the mining process, and the knowl-edge resulting from the mining process (see Fig. 16.1). The data itself needs to gothrough several processes before it is ready for the mining process. This prelimi-nary process is often referred to as data preparation. Although Figure 16.1 showsthat the data for data mining is coming from a data warehouse, in practice this 16.2 Data Mining: A Brief Overview 431may or may not be the case. It is likely that the data may be coming from anydata repositories. Therefore, the data needs to be somehow transformed so that itbecomes ready for the mining process. Data preparation steps generally cover: ž Data selection: Only relevant data to be analyzed is selected from the database. ž Data cleaning: Data is cleaned of noise and errors. Missing and irrelevant data is also excluded. ž Data integration: Data from multiple, heterogeneous sources may be inte- grated into one simple ﬂat table format. ž Data transformation: Data is transformed and consolidated into forms appro- priate for mining by performing summary or aggregate operations. Once the data is ready for the mining process, the mining process can start.The mining process employs an intelligent method applied to the data in orderto extract data patterns. There are various mining techniques, including but notlimited to association rules, sequential patterns, classiﬁcation, and clustering. Theresults of this mining process are knowledge or patterns.16.2 DATA MINING: A BRIEF OVERVIEWAs mentioned earlier, data mining is a process for discovering useful, interesting,and sometimes surprising knowledge from a large collection of data. Therefore,we need to understand various kinds of data mining tasks and techniques. Alsorequired is a deeper understanding of the main difference between querying and thedata mining process. Accepting the difference between querying and data miningcan be ...