Handbook of Neural Network Signal Processing P2

Số trang: 20 Loại file: pdf Dung lượng: 950.34 KB Lượt xem: 13 Lượt tải: 0

tailieu_vip

Phí lưu trữ: 7,000 VND

Xem trước 2 trang đầu tiên của tài liệu này:

Thông tin tài liệu:

Assume that gi (x) = 1 (hence gk (x) = 0, k = i), update the expert i based on output error. Update gating network so that gi (x) is even closer to unity. Alternatively, a batch training method can be adopted: 1. Apply a clustering algorithm to cluster the set of training samples into n clusters. Use the membership information to train the gating network. 2. Assign each cluster to an expert module and train the corresponding expert module. 3. Fine-tune the performance using gradient-based learning. Note that the function of the gating network is to partition the feature...
Nội dung trích xuất từ tài liệu:
Handbook of Neural Network Signal Processing P2 Assume that gi (x) = 1 (hence gk (x) = 0, k = i), update the expert i based on output error. Update gating network so that gi (x) is even closer to unity. Alternatively, a batch training method can be adopted: 1. Apply a clustering algorithm to cluster the set of training samples into n clusters. Use the membership information to train the gating network. 2. Assign each cluster to an expert module and train the corresponding expert module. 3. Fine-tune the performance using gradient-based learning. Note that the function of the gating network is to partition the feature space into largely disjointedregions and assign each region to an expert module. In this way, an individual expert module onlyneeds to learn a subregion in the feature space and is likely to yield better performance. Combining n expert modules under the gating network, the overall performance is expected toimprove. Figure 1.19 shows an example using the batch training method presented above. Thedots are the training and testing samples. The circles are the cluster centers that represent individualexperts. These cluster centers are found by applying the k-means clustering algorithm on the trainingsamples. The gating network output is proportional to the inverse of the square distance from eachsample to all three cluster centers. The output value is normalized so that the sum equals unity.Each expert module implements a simple linear model (a straight line in this example). We didnot implement the third step, so the results are obtained without ﬁne-tuning. The correspondingMATLAB m-ﬁles are moedemo.m and moegate.m.1.19 Illustration of mixture of expert network using batched training method.1.2.6 Support Vector Machines (SVMs)A support vector machine [14] has a basic format, as depicted in Figure 1.20, where ϕk (x) is anonlinear transformation of the input feature vector x into a high-dimensional space new featurevector ϕ(x) = [ϕ1 (x) ϕ2 (x) . . . ϕp (x)]. The output y is computed as: p y(x) = wk ϕk (x) + b = ϕ(x)wT + b k=1where w = [w1 w2 . . . wp ] is the 1 × p weight vector, and b is the bias term. The dimension ofϕ(x)(= p) is usually much larger than that of the original feature vector (= m). It has been argued© 2002 by CRC Press LLCthat mapping a low-dimensional feature into a higher-dimensional feature space will likely make theresulting feature vectors linearly separable. In other words, using ϕ as a feature vector is likely toresult in better pattern classiﬁcation results.1.20 An SVM neural network structure. Given a set of training vectors {x(i); 1 ≤ i ≤ N }, one can solve the weight vector w as: N w= γi ϕ(x(i)) = γ i=1where = [ϕ(x(1)) ϕ(x(2)) . . . ϕ(x(N ))]T is an N ×p matrix, and γ is a 1×N vector. Substitutingw into y(x) yields: N N y(x) = ϕ(x)wT + b = γi ϕ(x)ϕ T (x(i)) + b = γi K(x, x(i)) + b i=1 i=1where the kernel K(x, x(i)) is a scalar-valued function of the testing sample x and a training samplex(i). For N 1.21 A linearly separable pattern classiﬁcation example. ρ is the distance between each class to the decision boundary. To identify the support vectors from a set of training data samples, consider the linearly separablepattern classiﬁcation example shown in Figure 1.21. According to Cortes and Vapnik [15], theempirical risk is minimized in a linearly separable two-class pattern classiﬁcation problem, as shownin Figure 1.21, if the decision boundary is located such that the minimum distance from each trainingsample of each class to the decision boundary is maximized. In other words, the parameter ρ inFigure 1.21 should be maximized subject to the constraints that all “o” class samples should be onone side of the decision boundary, and all “x” class samples should be on the other side of the decisionboundary. This can be formulated as a nonlinear constrained quadratic optimization problem. Usinga Karush–Kühn–Tucker condition, it can be shown that not all training samples will contribute tothe determination of the decision boundary. In fact, as shown in Figure 1.21, only those trainingsamples that are closest to the decision boundary (marked with color in the ﬁgure) will contribute tothe solution of w and b. These training samples will then be identiﬁed as the support vectors. There are many public domain implementations of ...