A Concise Introduction to Data Compression- P7

Số trang: 48 Loại file: pdf Dung lượng: 447.24 KB Lượt xem: 10 Lượt tải: 0

Hoai.2512

Phí lưu trữ: 13,000 VND

Xem trước 5 trang đầu tiên của tài liệu này:

Thông tin tài liệu:

Tham khảo tài liệu a concise introduction to data compression- p7, công nghệ thông tin, cơ sở dữ liệu phục vụ nhu cầu học tập, nghiên cứu và làm việc hiệu quả
Nội dung trích xuất từ tài liệu:
A Concise Introduction to Data Compression- P7 Chapter Summary 263 The ﬁrst enhancement improves compression in small alphabets. In Unicode, mostsmall alphabets start on a 128-byte boundary, although the alphabet size may be morethan 128 symbols. This suggests that a diﬀerence be computed not between the currentand previous code values but between the current code value and the value in themiddle of the 128-byte segment where the previous code value is located. Speciﬁcally,the diﬀerence is computed by subtracting a base value from the current code point. Thebase value is obtained from the previous code point as follows. If the previous code valueis in the interval xxxx00 to xxxx7F (i.e., its seven LSBs are 0 to 127), the base valueis set to xxxx40 (the seven LSBs are 64), and if the previous code point is in the rangexxxx80 to xxxxFF (i.e., its seven least-signiﬁcant bits are 128 to 255), the base value isset to xxxxC0 (the seven LSBs are 192). This way, if the current code point is within128 positions of the base value, the diﬀerence is in the range [−128, +127] which makesit ﬁt in one byte. The second enhancement has to do with remote symbols. A document in a non-Latin alphabet (where the code points are very diﬀerent from the ASCII codes) may usespaces between words. The code point for a space is the ASCII code 2016 , so any pair ofcode points that includes a space results in a large diﬀerence. BOCU therefore computesa diﬀerence by ﬁrst computing the base values of the three previous code points, andthen subtracting the smallest base value from the current code point. BOCU-1 is the version of BOCU that’s commonly used in practice [BOCU-1 02]. Itdiﬀers from the original BOCU method by using a diﬀerent set of byte value ranges andby encoding the ASCII control characters U+0000 through U+0020 with byte values 0through 2016 , respectively. These features make BOCU-1 suitable for compressing inputﬁles that are MIME (text) media types. Il faut avoir beaucoup ´tudi´ pour savoir peu (it is necessary to study much in order e e to know little). —Montesquieu (Charles de Secondat), Pens´es diverses eChapter SummaryThis chapter is devoted to data compression methods and techniques that are not basedon the approaches discussed elsewhere in this book. The following algorithms illustratesome of these original techniques: The Burrows–Wheeler method (Section 7.1) starts with a string S of n symbols andscrambles (i.e., permutes) them into another string L that satisﬁes two conditions: (1)Any area of L will tend to have a concentration of just a few symbols. (2) It is possibleto reconstruct the original string S from L. Since its inception in the early 1990s, thisunexpected method has been the subject of much research. The technique of symbol ranking (Section 7.2) uses context, rather than probabili-ties, to rank symbols. Sections 7.3 and 7.3.1 describe two algorithms, SCSU and BOCU-1, for the com-pression of Unicode-based documents.264 7. Other Methods Chapter 8 of [Salomon 07] discusses other methods, techniques, and approaches todata compression. Self-Assessment Questions 1. The term “fractals” appears early in this chapter. One of the applications offractals is to compress images, and it is the purpose of this note to encourage the readerto search for material on fractal compression and study it. 2. The Burrows–Wheeler method has been the subject of much research and at-tempts to speed up its decoding and improve it. Using the paper at [JuergenAbel 07]as your starting point, try to gain a deeper understanding of this interesting method. 3. The term “lexicographic order” appears in Section 7.1. This is an importantterm in computer science in general, and the conscientious reader should make sure thisterm is fully understood. 4. Most Unicodes are 16 bits long, but this standard has provisions for longer codes.Use [Unicode 07] as a starting point to learn more about Unicode and how codes longerthan 16 bits are structured. In comedy, as a matter of fact, a greater variety of methods were discovered and employed than in tragedy. —T. S. Eliot, The Sacred Wood (1920)BibliographyAhmed, N., T. Natarajan, and R. K. Rao (1974) “Discrete Cosine Transform,” IEEETransactions on Computers, C-23:90–93.Bell, Timothy C., John G. Cleary, and Ian H. Witten (1990) Text Compression, Engle-wood Cliﬀs, Prentice Hall.BOCU (2001) is http://oss.software.ibm.com/icu/docs/papers/binary_ordered_compression_for_unicode.html.BOCU-1 (2002) is h ...