Mạng thần kinh thường xuyên cho dự đoán P12

Số trang: 21 Loại file: pdf Dung lượng: 460.34 KB Lượt xem: 12 Lượt tải: 0

tailieu_vip

Phí lưu trữ: 19,000 VND

Xem trước 3 trang đầu tiên của tài liệu này:

Thông tin tài liệu:

Exploiting Inherent Relationships Between Parameters in Recurrent Neural NetworksPerspective Optimisation of complex neural network parameters is a rather involved task. It becomes particularly diﬃcult for large-scale networks, such as modular networks, and for networks with complex interconnections, such as feedback networks. Therefore, if an inherent relationship between some of the free parameters of a neural network can be found, which holds at every time instant for a dynamical network, it would help to reduce the number of degrees of freedom in the optimisation task of learning in a particular network. ...
Nội dung trích xuất từ tài liệu:
Mạng thần kinh thường xuyên cho dự đoán P12 Recurrent Neural Networks for Prediction Authored by Danilo P. Mandic, Jonathon A. Chambers Copyright c 2001 John Wiley & Sons Ltd ISBNs: 0-471-49517-4 (Hardback); 0-470-84535-X (Electronic)12Exploiting InherentRelationships BetweenParameters in RecurrentNeural Networks12.1 PerspectiveOptimisation of complex neural network parameters is a rather involved task. Itbecomes particularly diﬃcult for large-scale networks, such as modular networks, andfor networks with complex interconnections, such as feedback networks. Therefore, ifan inherent relationship between some of the free parameters of a neural network canbe found, which holds at every time instant for a dynamical network, it would helpto reduce the number of degrees of freedom in the optimisation task of learning in aparticular network. We derive such relationships between the gain β in the nonlinear activation functionof a neuron Φ and the learning rate η of the underlying learning algorithm for boththe gradient descent and extended Kalman ﬁlter trained recurrent neural networks. The analysis is then extended in the same spirit for modular neural networks.Both the networks with parallel modules and networks with nested (serial) modulesare analysed. A detailed analysis is provided for the latter, since the former can beconsidered a linear combination of modules that consist of feedforward or recurrentneural networks. For all these cases, the static and dynamic equivalence between an arbitrary neuralnetwork described by β, η and W (k) and a referent network described by β R = 1,η R and W R (k) are derived. A deterministic relationship between these parameters isprovided, which allows one degree of freedom less in the nonlinear optimisation taskof learning in this framework. This is particularly signiﬁcant for large-scale networksof any type.12.2 IntroductionWhen using neural networks, many of their parameters are chosen empirically. Apartfrom the choice of topology, architecture and interconnection, the parameters that200 INTRODUCTIONinﬂuence training time and performance of a neural network are the learning rate η,gain of the activation function β and set of initial weights W0 . The optimal valuesfor these parameters are not known a priori and generally they depend on externalquantities, such as the training data. Other parameters that are also important inthis context are • steepness of the sigmoidal activation function, deﬁned by γβ; and • dimensionality of the input signal to the network and dimensionality and char- acter of the feedback for recurrent networks.It has been shown (Thimm and Fiesler 1997a,b) that the distribution of the initialweights has almost no inﬂuence on the training time or the generalisation performanceof a trained neural network. Hence, we concentrate on the relationship between theparameters of a learning algorithm (η) and those of a nonlinear activation function (β). To improve performance of a gradient descent trained network, Jacobs (1988) pro-posed that the acceleration of convergence of learning in neural networks be achievedthrough the learning rate adaptation. His arguments were that 1. every adjustable learning parameter of the cost function should have its own learning rate parameter; and 2. every learning rate parameter should vary from one iteration to the next.These arguments are intuitively sound. However, if there is a dependence betweensome of the parameters in the network, this approach would lead to suboptimal learn-ing and oscillations, since coupled parameters would be trained using diﬀerent learningrates and diﬀerent speed of learning, which would deteriorate the performance of thenetwork. To circumvent this problem, some heuristics on the values of the parametershave been derived (Haykin 1994). To shed further light onto this problem and oﬀerfeasible solutions, we therefore concentrate on ﬁnding relationships between coupledparameters in recurrent neural networks. The derived relationships are also valid forfeedforward networks, since recurrent networks degenerate into feedforward networkswhen the feedback is removed. Let us consider again a common choice for the activation function, γ Φ(γ, β, x) = . (12.1) 1 + e−βxThis is a Φ : R → (0, γ) function. The parameter β is called the gain and the productγβ the steepness (slope) of the activation function.1 The reciprocal of gain is alsoreferred to as the temperature. The gain γ of a node in a neural network is a constantthat ampliﬁes or attenuates the net input to the node. In Kruschke and Movellan(1991), it has been shown that the use of gradient descent to adjust the gain of thenode increases learning speed. Let us consider again the general gradient-descent-based weight adaptation algo-rithm, given by W (k) = W (k − 1) − η∇W E(k), (12.2) 1 The gain and steepness are identical for activation functions with γ = 1. Hence, for such networks,we often use the term slope for β.RELATIONSHIPS BETWEEN PARAMETERS IN RNNs 201where E(k) = 1 e2 (k) is a cost function, W (k) is the weight vector/matrix at the 2time-instant k and η is a learning rate. The gradient ∇W E(k) in (12.2) comprisesthe ﬁrst derivative of the nonlinear activation function (12.1), which is a function ofβ (Narendra and Parthasarathy 1990). For instance, for a si ...