High-Performance Parallel Database Processing and Grid Databases- P4

Số trang: 50 Loại file: pdf Dung lượng: 391.11 KB Lượt xem: 5 Lượt tải: 0

10.10.2023

Hỗ trợ phí lưu trữ khi tải xuống: 10,000 VND

Báo xấu

Xem trước 5 trang đầu tiên của tài liệu này:

Thông tin tài liệu:

High-Performance Parallel Database Processing and Grid Databases- P4: Parallel databases are database systems that are implemented on parallel computingplatforms. Therefore, high-performance query processing focuses on queryprocessing, including database queries and transactions, that makes use of parallelismtechniques applied to an underlying parallel computing platform in order toachieve high performance.
Nội dung trích xuất từ tài liệu:
High-Performance Parallel Database Processing and Grid Databases- P4130 Chapter 5 Parallel Join Ri and Si in the cost equation indicate the fragment size of both tables in eachprocessor. ž Receiving records cost is: ..Ri =P/ C .Si =P// ð .m p / Both data transfer and receiving costs look similar, as also mentioned abovefor the divide and broadcast cost. However, for disjoint partitioning the size of Riand Si in the data transfer cost is likely to be different from that of the receivingcost. The reason is as follows. Following the example in Figures 5.14 and 5.16,Ri and Si in the data transfer cost are the size of each fragment of both tablesin each processor. Again, assuming that the initial data placement is done witha round-robin or any other equal partitioning, each fragment size will be equal.Therefore, Ri and Si in the data transfer cost are simply dividing the total tablesize by the available number of processors. However, Ri and Si in the receiving cost are most likely skewed (as alreadymentioned in Chapter 2 on analytical models). As shown in Figures 5.14 and 5.16,the spread of the fragments after the distribution is not even. Therefore, the skewmodel must be taken into account, and consequently the values of Ri and Si in thereceiving cost are different from those of the data transfer cost. Finally, the last phase is data storing, which involves storing all records receivedby each processor. ž Disk cost for storing the result of data distribution is: ..Ri =P/ C .Si =P// ð IO5.4.3 Cost Models for Local JoinFor the local join, since a hash-based join is the most efﬁcient join algorithm, itis assumed that a hash-based join is used in the local join. The cost of the localjoin with a hash-based join comprises three main phases: data loading from eachprocessor, the joining process (hashing and probing), and result storing in eachprocessor. The data loading consists of scan costs and select costs. These are identical tothose of the disjoint partitioning costs, which are: ž Scan cost D ..Ri =P/ C .Si =P// ð IO ž Select cost D .jRi j C jSi j/ ð .tr C tw / It has been emphasized that (jRi j C jSi j) as well as (.Ri =P/ C .Si =P/) corre-spond to the values in the receiving and disk costs of the disjoint partitioning. The join process itself is basically incurring hashing and probing costs, whichare as follows: 5.4 Cost Models 131 ž Join costs involve reading, hashing, and probing: .jRi j ð .tr C th / C .jSi j ð .tr C th C t j // The process is basically reading each record R and hashing it to a hash table.After all records R have been processed, records S can be read, hashed, and probed.If they are matched, the matching records are written out to the query result. The hashing process is very much determined by the size of the hash table thatcan ﬁt into main memory. If the memory size is smaller than the hash table size,we normally partition the hash table into multiple buckets whereby each bucketcan perfectly ﬁt into main memory. All but the ﬁrst bucket are spooled to disk. Based on this scenario, we must include the I/O cost for reading and writingoverﬂow buckets, which is as follows. ž Reading/writing of overﬂow buckets cost is the I/O cost associated with the limited ability of main memory to accommodate the entire hash table. This cost includes the costs for reading and writing records not processed in the ﬁrst phase of hashing. Â Â ÃÃ Â Ã H Si 1 min ;1 ð ð 2 ð IO jSi j P Although this looks similar to that mentioned in other chapters regarding theoverhead of overﬂow buckets, there are two signiﬁcant differences. One is thatonly Si is included in the cost component, because only the table S is hashed; andthe second difference is that the projection and selection variables are not included,because all records S are hashed. The ﬁnal cost is the query results storing cost, consisting of generating resultcost and disk cost. ž Generating result records cost is the number of selected records multiplied by the writing unit cost. jRi j ð σj ð jSi j ð tw Note that the cost is reduced by the join selectivity factor σj , where the smallerthe selectivity factor, the lower the number of records produced by the join opera-tion. ž Disk cost for storing the ﬁnal result is the number of pages needed to store the ﬁnal aggregate values times the disk unit cost, which is: .πR ð Ri ð σj ð πS ð Si =P/ ð ...