Danh mục

Mạng thần kinh thường xuyên cho dự đoán P12

Số trang: 21      Loại file: pdf      Dung lượng: 460.34 KB      Lượt xem: 12      Lượt tải: 0    
tailieu_vip

Xem trước 3 trang đầu tiên của tài liệu này:

Thông tin tài liệu:

Exploiting Inherent Relationships Between Parameters in Recurrent Neural NetworksPerspective Optimisation of complex neural network parameters is a rather involved task. It becomes particularly difficult for large-scale networks, such as modular networks, and for networks with complex interconnections, such as feedback networks. Therefore, if an inherent relationship between some of the free parameters of a neural network can be found, which holds at every time instant for a dynamical network, it would help to reduce the number of degrees of freedom in the optimisation task of learning in a particular network. ...
Nội dung trích xuất từ tài liệu:
Mạng thần kinh thường xuyên cho dự đoán P12 Recurrent Neural Networks for Prediction Authored by Danilo P. Mandic, Jonathon A. Chambers Copyright c 2001 John Wiley & Sons Ltd ISBNs: 0-471-49517-4 (Hardback); 0-470-84535-X (Electronic)12Exploiting InherentRelationships BetweenParameters in RecurrentNeural Networks12.1 PerspectiveOptimisation of complex neural network parameters is a rather involved task. Itbecomes particularly difficult for large-scale networks, such as modular networks, andfor networks with complex interconnections, such as feedback networks. Therefore, ifan inherent relationship between some of the free parameters of a neural network canbe found, which holds at every time instant for a dynamical network, it would helpto reduce the number of degrees of freedom in the optimisation task of learning in aparticular network. We derive such relationships between the gain β in the nonlinear activation functionof a neuron Φ and the learning rate η of the underlying learning algorithm for boththe gradient descent and extended Kalman filter trained recurrent neural networks. The analysis is then extended in the same spirit for modular neural networks.Both the networks with parallel modules and networks with nested (serial) modulesare analysed. A detailed analysis is provided for the latter, since the former can beconsidered a linear combination of modules that consist of feedforward or recurrentneural networks. For all these cases, the static and dynamic equivalence between an arbitrary neuralnetwork described by β, η and W (k) and a referent network described by β R = 1,η R and W R (k) are derived. A deterministic relationship between these parameters isprovided, which allows one degree of freedom less in the nonlinear optimisation taskof learning in this framework. This is particularly significant for large-scale networksof any type.12.2 IntroductionWhen using neural networks, many of their parameters are chosen empirically. Apartfrom the choice of topology, architecture and interconnection, the parameters that200 INTRODUCTIONinfluence training time and performance of a neural network are the learning rate η,gain of the activation function β and set of initial weights W0 . The optimal valuesfor these parameters are not known a priori and generally they depend on externalquantities, such as the training data. Other parameters that are also important inthis context are • steepness of the sigmoidal activation function, defined by γβ; and • dimensionality of the input signal to the network and dimensionality and char- acter of the feedback for recurrent networks.It has been shown (Thimm and Fiesler 1997a,b) that the distribution of the initialweights has almost no influence on the training time or the generalisation performanceof a trained neural network. Hence, we concentrate on the relationship between theparameters of a learning algorithm (η) and those of a nonlinear activation function (β). To improve performance of a gradient descent trained network, Jacobs (1988) pro-posed that the acceleration of convergence of learning in neural networks be achievedthrough the learning rate adaptation. His arguments were that 1. every adjustable learning parameter of the cost function should have its own learning rate parameter; and 2. every learning rate parameter should vary from one iteration to the next.These arguments are intuitively sound. However, if there is a dependence betweensome of the parameters in the network, this approach would lead to suboptimal learn-ing and oscillations, since coupled parameters would be trained using different learningrates and different speed of learning, which would deteriorate the performance of thenetwork. To circumvent this problem, some heuristics on the values of the parametershave been derived (Haykin 1994). To shed further light onto this problem and offerfeasible solutions, we therefore concentrate on finding relationships between coupledparameters in recurrent neural networks. The derived relationships are also valid forfeedforward networks, since recurrent networks degenerate into feedforward networkswhen the feedback is removed. Let us consider again a common choice for the activation function, γ Φ(γ, β, x) = . (12.1) 1 + e−βxThis is a Φ : R → (0, γ) function. The parameter β is called the gain and the productγβ the steepness (slope) of the activation function.1 The reciprocal of gain is alsoreferred to as the temperature. The gain γ of a node in a neural network is a constantthat amplifies or attenuates the net input to the node. In Kruschke and Movellan(1991), it has been shown that the use of gradient descent to adjust the gain of thenode increases learning speed. Let us consider again the general gradient-descent-based weight adaptation algo-rithm, given by W (k) = W (k − 1) − η∇W E(k), (12.2) 1 The gain and steepness are identical for activation functions with γ = 1. Hence, for such networks,we often use the term slope for β.RELATIONSHIPS BETWEEN PARAMETERS IN RNNs 201where E(k) = 1 e2 (k) is a cost function, W (k) is the weight vector/matrix at the 2time-instant k and η is a learning rate. The gradient ∇W E(k) in (12.2) comprisesthe first derivative of the nonlinear activation function (12.1), which is a function ofβ (Narendra and Parthasarathy 1990). For instance, for a si ...

Tài liệu được xem nhiều: