The predictions are obtained as
E
(
y
∗
g
)
=
E
(
y
∗
f
)
+
R
T
E
(
β
),
cov
(
y
∗
g
)
=
cov
(
y
∗
f
)
+
R
T
(
B
−
1
+
(
Z
)
K
−
1
(
Z
)
T
)
R
,
(2.41)
where
R
=
(
(
Z
∗
)
−
(
Z
)
K
−
1
k
(
Z
∗
,
Z
))
and
E
(
β
)
=
(
B
−
1
+
(
Z
)
K
−
1
(
Z
)
T
)
−
1
(
(
Z
)
K
−
1
y
+
B
−
1
b
)
.
In cases when the prior knowledge about
B
is vague, which means that
B
−
1
approaches to the matrix of zeros, which is elaborated in [
49
].
2.4.5
Asymptotic Properties of GP Models
In model identification, it is important that the model is consistent. This means that
the posterior distribution concentrates around the true distribution of the parameters
as more and more data is observed or the sample size increases. The theory on
consistency of GP models with sufficient conditions for the
posterior consistency
of GP regression is explained in [
76
] and reviewed in [
49
], with further references
describing studies in various general settings in both listed references.
As stated in [
76
], these sufficient conditions are difficult to validate and may not
be intuitive when applied to concrete models. In [
77
], the alternative concept of
the so-called
information consistency
for GP regression models is described. The
consistency of the information is checked with the information criterion that is the
Césaro average of the sequence of prediction errors. See [
76
,
77
] for more details
and the background theory.

62
2
System Identification with GP Models
2.5
Computational Implementation
A noticeable drawback of the system identification with GP models is the computa-
tion time necessary for the modelling. GP regression, on which system identification
is based, involves several matrix computations. This increases the number of oper-
ations with the third power of the number of input data, i.e.
O
(
N
3
)
, such as matrix
inversion and the calculation of the log determinant of the used covariance matrix.
This computational greed restricts the number of training data to at most a few
thousand cases on modern workstations.
A common approach to computing the objective function from Eq.(
2.32
) and its
gradient makes use of the Cholesky decomposition [
49
] of
K
to compute
α
=
K
−
1
y
.
The training algorithm in pseudocode is as follows:
Algorithm:
GP training
(
Z
,
y
,
C
,
initial
θ
)
repeat
change hyperparameters
θ
compute
K
(
θ
)
= [
C
(
z
p
,
z
q
)
]
N
×
N
compute Cholesky decomposition
L
=
chol
(
K
)
solve
L
γ
=
y
for
γ
and
L
T
α
=
γ
for
α
to get
α
=
K
−
1
y
compute
(
θ
)
and
∇
(
θ
)
using
α
until
−
(
θ
)
is minimal
2.5.1
Direct Implementation
One option to deal with the computational implementation is to approach the compu-
tation problem from the utilised hardware technology point of view. Since hardware
capabilities are increasing every day, this approach might seem inefficient when
looking over the longer term, but it is undoubtedly effective in the short term.


You've reached the end of your free preview.
Want to read all 281 pages?
- Fall '19
- dr. ahmed