Data Mining P2

Số trang: 20 Loại file: pdf Dung lượng: 1.13 MB Lượt xem: 18 Lượt tải: 0

10.10.2023

Phí lưu trữ: 7,000 VND

Xem trước 2 trang đầu tiên của tài liệu này:

Thông tin tài liệu:

Data compression is the technique to reduce the redundancies in data representationin order to decrease data storage requirements and, hence, communicationcosts when transmitted through a communication network [24, 25].Reducing the storage requirement is equivalent to increasing the capacity ofthe storage medium. If the compressed data are properly indexed, it mayimprove the performance of mining data in the compressed large database aswell
Nội dung trích xuất từ tài liệu:
Data Mining P2 DATA COMPRESSION 11 1. In English text files, common words (e.g., is, are, the) or simi- lar patterns of character strings (e.g., lze\ lth\ iing1} are usually used repeatedly. It is also observed that the characters in an English text occur in a well-documented distribution, with letter e and space being the most popular. 2. In numeric data files, often we observe runs of similar numbers or pre- dictable interdependency amongst the numbers. 3. The neighboring pixels in a typical image are highly correlated to each other, with the pixels in a smooth region of an image having similar values. 4. Two consecutive frames in a video are often mostly identical when mo- tion in the scene is slow. 5. Some audio data beyond the human audible frequency range are useless for all practical purposes. Data compression is the technique to reduce the redundancies in data repre- sentation in order to decrease data storage requirements and, hence, commu- nication costs when transmitted through a communication network [24, 25]. Reducing the storage requirement is equivalent to increasing the capacity of the storage medium. If the compressed data are properly indexed, it may improve the performance of mining data in the compressed large database as well. This is particularly useful when interactivity is involved with a data mining system. Thus the development of efficient compression techniques,particularly suitable for data mining, will continue to be a design challenge for advanced database management systems and interactive multimedia ap-plications. Depending upon the application criteria, data compression techniques canbe classified as lossless and lossy. In lossless methods we compress the data insuch a way that the decompressed data can be an exact replica of the originaldata. Lossless compression techniques are applied to compress text, numeric,or character strings in a database - typically, medical data, etc. On the otherhand, there are application areas where we can compromise with the accuracyof the decompressed data and can, therefore, afford to lose some information.For example, typical image, video, and audio compression techniques are lossy,since the approximation of the original data during reconstruction is goodenough for human perception. In our view, data compression is a field that has so far been neglectedby the data mining community. The basic principle of data compressionis to reduce the redundancies in data representation, in order to generatea shorter representation for the data to conserve data storage. In earlierdiscussions, we emphasized that data reduction is an important preprocessingtask in data mining. Need for reduced representation of data is crucial forthe success of very large multimedia database applications and the associated12 INTRODUCTION TO DATA MININGeconomical usage of data storage. Multimedia databases are typically muchlarger than, say, business or financial data, simply because an attribute itselfin a multimedia database could be a high-resolution digital image. Hencestorage and subsequent access of thousands of high-resolution images, whichare possibly interspersed with other datatypes as attributes, is a challenge.Data compression offers advantages in the storage management of such hugedata. Although data compression has been recognized as a potential areafor data reduction in literature [13], not much work has been reported so faron how the data compression techniques can be integrated in a data miningsystem. Data compression can also play an important role in data condensation.An approach for dealing with the intractable problem of learning from hugedatabases is to select a small subset of data as representatives for learning.Large data can be viewed at varying degrees of detail in different regions ofthe feature space, thereby providing adequate importance depending on theunderlying probability density [26]. However, these condensation techniquesare useful only when the structure of data is well-organized. Multimediadata, being not so well-structured in its raw form, leads to a big bottleneckin the application of existing data mining principles. In order to avoid thisproblem, one approach could be to store some predetermined feature set ofthe multimedia data as an index at the header of the compressed file, andsubsequently use this condensed information for the discovery of informationor data mining. We believe that integration of data compression principles and techniquesin data mining systems will yield promising results, particularly in the age ofmultimedia information and their growing usage in the Internet. Soon therewill arise the need to automatically discover or access information from suchmultimedia data domains, in place of well-organized business and financialdata only. Keeping this goal in mind, we intended to devote significant dis-cussions on data compression techniques and their principles in multimediadata domain involving text, numeric and non-numeric data, images, etc. We have elaborated on the fundamentals of data compression and imagecompression principles and some popular algorithms in Chapter 3. Thenwe have described, in Chapter 9, how some data compression principles canimprove the efficiency of information retrieval particularly suitable for multi-media data mining.1.4 INFORMATION RETRIEVALUsers approach large information spaces like the Web with different motives,namely, to (i) search for a specific piece of information or topic, (ii) gainfamiliarity with, or an overview of, some general topic or domain, and (iii)locate something that might be of interest, without a clear prior notion ofwhat interesting should look like. The field of information retrieval d ...