For a parametric model, the goal is to find the predefined parametric relationship betweenfeatures and targets, once done that, the model doesn’t need the training data for pre-diction. The features,x*i|i= 1, ..., n*, will give all the information we need for predictiony*i’s.However, in Gaussian Process Regression model, the prediction eq.2.31 and eq.2.32 need the information from training set to get the posterior distribution of the testset.2.3Predicition from Gaussian Process RegressionForm eq.2.31, we know that the posterior distribution of test data is Gaussian withmean and variance given in eq.2.32. But if one wants an explicit prediction for the targetfunction rather than a distribution, what is the value then? One intuitive answer wouldbe the mean of posterior distribution. In practice, this prediction is appropriate in mostreal-life tasks. Rasmussen and Williams provides in-depth explanation in [13, sec 2.4], wepresent a summary of the rationale in the following.To find a optimal prediction value, we need a way to measure our performance. Let’sdefine aloss function,L(ˆy, y), which specifies the loss between the prediction, ˆyand thetrue value,y. The loss function can bemean squared errorormean absolute errorwhichare both symmetric loss functions.One wants to give a prediction value that produces the minimum loss possible. Buthow do we achieve that when we don’t know the true value? According to [13, sec 2.4],we can minimize the expected loss, by averaging the loss with respect to the posteriordistribution, which is˜RL(ˆy|x*) =ZL(y*,ˆy)p(y*|x*)dy*.(2.33)Our best prediction is the one that minimizes this loss, i.e.yoptimal|x*= argminˆy˜RL(ˆy|x*).(2.34)Formean squared errorloss function, the minimum occurs at the mean ofy*. Formean
2.3Predicition from Gaussian Process Regression21absolute errorloss function, the minimum occurs at the median ofy*. In our case, thedistribution ofy*is Gaussian, so the mean and median coincide. Furthermore, for othersymmetric loss function and symmetric posterior distribution, the optimal prediction oc-curs at the mean of the distribution. For asymmetric loss functions, the optimal predictioncan be computed from eq. 2.33 and eq. 2.3. For more detail, see .
223Model SelectionWhen modelling the target function as a zero mean Gaussian process, i.e.f(x)∼ GP(0, k(x,x0)).the main task will be to find a good configuration of covariance matrix, generated by kernelfunctionkand a set ofhyperparametersthat gives the best performance. The model selec-tion is a combination of choosing a good kernel function and optimizinghyperparametersof the kernel function.In this section, we will discuss several aspects of model selec-tion, namely marginal likelihood, covariance matrix, kernel functions,hyperparametersoptimization.