The predictions are obtained as E ( y ∗ g ) = E ( y ∗ f ) + R T E ( β ), cov ( y ∗ g ) = cov ( y ∗ f ) + R T ( B − 1 + ( Z ) K − 1 ( Z ) T ) R , (2.41) where R = ( ( Z ∗ ) − ( Z ) K − 1 k ( Z ∗ , Z )) and E ( β ) = ( B − 1 + ( Z ) K − 1 ( Z ) T ) − 1 ( ( Z ) K − 1 y + B − 1 b ) . In cases when the prior knowledge about B is vague, which means that B − 1 approaches to the matrix of zeros, which is elaborated in [ 49 ]. 2.4.5 Asymptotic Properties of GP Models In model identification, it is important that the model is consistent. This means that the posterior distribution concentrates around the true distribution of the parameters as more and more data is observed or the sample size increases. The theory on consistency of GP models with sufficient conditions for the posterior consistency of GP regression is explained in [ 76 ] and reviewed in [ 49 ], with further references describing studies in various general settings in both listed references. As stated in [ 76 ], these sufficient conditions are difficult to validate and may not be intuitive when applied to concrete models. In [ 77 ], the alternative concept of the so-called information consistency for GP regression models is described. The consistency of the information is checked with the information criterion that is the Césaro average of the sequence of prediction errors. See [ 76 , 77 ] for more details and the background theory.
62 2 System Identification with GP Models 2.5 Computational Implementation A noticeable drawback of the system identification with GP models is the computa- tion time necessary for the modelling. GP regression, on which system identification is based, involves several matrix computations. This increases the number of oper- ations with the third power of the number of input data, i.e. O ( N 3 ) , such as matrix inversion and the calculation of the log determinant of the used covariance matrix. This computational greed restricts the number of training data to at most a few thousand cases on modern workstations. A common approach to computing the objective function from Eq.( 2.32 ) and its gradient makes use of the Cholesky decomposition [ 49 ] of K to compute α = K − 1 y . The training algorithm in pseudocode is as follows: Algorithm: GP training ( Z , y , C , initial θ ) repeat change hyperparameters θ compute K ( θ ) = [ C ( z p , z q ) ] N × N compute Cholesky decomposition L = chol ( K ) solve L γ = y for γ and L T α = γ for α to get α = K − 1 y compute ( θ ) and ∇ ( θ ) using α until − ( θ ) is minimal 2.5.1 Direct Implementation One option to deal with the computational implementation is to approach the compu- tation problem from the utilised hardware technology point of view. Since hardware capabilities are increasing every day, this approach might seem inefficient when looking over the longer term, but it is undoubtedly effective in the short term.
You've reached the end of your free preview.
Want to read all 281 pages?
- Fall '19
- dr. ahmed