High-Performance Parallel Database Processing and Grid Databases- P11

Số trang: 50 Loại file: pdf Dung lượng: 343.10 KB Lượt xem: 7 Lượt tải: 0

10.10.2023

Phí tải xuống: 9,000 VND

Xem trước 5 trang đầu tiên của tài liệu này:

Thông tin tài liệu:

High-Performance Parallel Database Processing and Grid Databases- P11: Parallel databases are database systems that are implemented on parallel computingplatforms. Therefore, high-performance query processing focuses on queryprocessing, including database queries and transactions, that makes use of parallelismtechniques applied to an underlying parallel computing platform in order toachieve high performance.
Nội dung trích xuất từ tài liệu:
High-Performance Parallel Database Processing and Grid Databases- P11480 Chapter 17 Parallel Clustering and ClassiﬁcationRec# Weather Temperature Time Day Jog (Target Class)1 Fine Mild Sunset Weekend Yes2 Fine Hot Sunset Weekday Yes3 Shower Mild Midday Weekday No4 Thunderstorm Cool Dawn Weekend No5 Shower Hot Sunset Weekday Yes6 Fine Hot Midday Weekday No7 Fine Cool Dawn Weekend No8 Thunderstorm Cool Midday Weekday No9 Fine Cool Midday Weekday Yes10 Fine Mild Midday Weekday Yes11 Shower Hot Dawn Weekend No12 Shower Mild Dawn Weekday No13 Fine Cool Dawn Weekday No14 Thunderstorm Mild Sunset Weekend No15 Thunderstorm Hot Midday Weekday NoFigure 17.11 Training data setthunderstorm, whereas the possible values for temperature are hot, mild, and cool.Continuous values are real numbers (e.g., heights of a person in centimetres). Figure 17.11 shows the training data set for the decision tree shown previously.This training data set consists of only 15 records. For simplicity, only categoricalattributes are used in this example. Examining the ﬁrst record and matching it withthe decision tree in Figure 17.10, the target is a Yes for ﬁne weather and mildtemperature, disregarding the other two attributes. This is because all records inthis training data set follow this rule (see records 1 and 10). Other records, such asrecords 9 and 13 use all the four attributes.17.3.2 Decision Tree Classiﬁcation: ProcessesDecision Tree AlgorithmThere are many different algorithms to construct a decision tree, such as ID3, C4.5,Sprint, etc. Constructing a decision tree is generally a recursive process. At thestart, all training records are at the root node. Then it partitions the training recordsrecursively by choosing one attribute at a time. The process is repeated for thepartitioned data set. The recursion stops when a stopping condition is reached,which is when all of the training records in the partition have the same target classlabel. Figure 17.12 shows an algorithm for constructing a decision tree. The deci-sion tree construction algorithm uses a divide-and-conquer method. It constructsthe tree using a depth-ﬁrst fashion. Branching can be binary (only 2 branches) ormultiway (½2 branches). 17.3 Parallel Classiﬁcation 481Algorithm: Decision Tree ConstructionInput: training dataset DOutput: decision tree TProcedure DTConstruct(D):1. T DØ2. Determine best splitting attribute3. T Dcreate root node and label with splitting attribute4. T Dadd arc to root node for each split predicate with label5. For each arc do6. D Ddataset created by applying splitting predicate to D7. If stopping point reached for this path Then8. T’ D create leaf node and label with appropriate class9. Else10. T’ D DTConstruct(D)11. T Dadd T’ to arcFigure 17.12 Decision tree algorithm Note that in the algorithm shown in Figure 17.12, the key element is the splittingattribute selection (line 2). The splitting attribute is the attribute chosen to split thetraining data set into a number of partitions. The splitting attribute step is also oftenknown as feature selection, because the algorithm needs to select a feature (or anattribute) of the training data set to create a node. As mentioned earlier, choosinga different attribute as a splitting attribute will cause the result decision to be dif-ferent. The difference in the decision tree produced by an algorithm lies in howto position the features or input attributes. Hence, choosing a splitting attribute,which will result in an optimum decision tree, is desirable. The way by which asplitting node is determined will be described in greater detail in the following.Splitting Attributes or Feature SelectionWhen constructing a decision tree, it is necessary to have a means of determiningthe importance of the attributes for the classiﬁcation. Hence, calculation is neededto ﬁnd the best splitting attribute at a node. All possible splitting attributes areevaluated with a feature selection criterion to ﬁnd the best attribute. Although thefeature selection criterion still does not guarantee the best decision tree, neverthe-less, it also relies on the completeness of the training data set and whether or notthe training data set provides enough information. The main aim of feature selection or ...