This preview shows pages 1–3. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.View Full Document
Unformatted text preview: University of California, Los Angeles Department of Statistics Statistics C173/C273 Instructor: Nicolas Christou Cross Validation Cross validation is a technique that allows us to compare predicted values with true values. In spatial data this technique can help us to decide which variogram model to choose or which prediction method gives better results. The basic idea of cross validation is the following: We omit point i from the data set and we predict it using the remaining n- 1 data points. Therefore, we can compare the predicted value with the true value at location s i . Another way is to split the data set into two parts. The first part will be used for modeling the variogram. The spatial locations of the other part of the data set will be our grid. Once we predict the values we can compare them with the observed values at those locations. We will use again the Maas river data from: a <- read.table("http://www.stat.ucla.edu/~nchristo/ statistics_c173_c273/soil.txt", header=T) # Save the original image function: image.orig <- image We will randomly split the data in two parts. From the data, 100 observations will be used for modeling and 55 for prediction. Here are the commands: choose100 <- sample(1:155, 100) part_model <- a[choose100, ] part_valid <- a[-choose100, ] Note: This is a random selection and every time we run these commands we will get different samples. Now, we can use the part model to estimate the variogram. g <- gstat(id="log_lead", formula = log(lead)~1, locations = ~x+y, data=part_model) q <- variogram(g) plot(q) v.fit <- fit.variogram(q, vgm(1, "Sph", 800, 1)) plot(q, v.fit) The predictions will be performed on the 55 spatial locations of the part valid data set: part_valid_pr <- krige(id="log_lead", log(lead)~1, locations=~x+y, model=v.fit, data=part_model, newdata=part_valid) Let’s compute the difference between the predicted values and the true values: difference <- log(part_valid$lead) - part_valid_pr$log_lead.pred summary(difference) 1 A simple plot of the predicted values against the true values is shown below: plot(part_valid_pr$log_lead.pred,log(part_valid$lead), xlab="Observed values", ylab="Predicted values") ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 4.0 4.5 5.0 5.5 4.0 4.5 5.0 5.5 6.0 6.5 Observed values Predicted values The correlation coefficient between these two sets of values is > cor(part_valid_pr$log_lead.pred,log(part_valid$lead))  0.84808 And the corresponding coefficient of determination is R 2 = √ . 83808 = 0 . 71924. Obviously, with this value of R 2 the kriging prediction are better estimates than the sample mean....
View Full Document
This note was uploaded on 02/11/2012 for the course STATS c173/c273 taught by Professor Nicolaschristou during the Spring '11 term at UCLA.
- Spring '11