139 Pages

time_series

Course: TEACHING 673, Fall 2009
School: Texas A&M
Rating:
 
 
 
 
 

Word Count: 45318

Document Preview

course A in Time Series Analysis Suhasini Subba Rao Email: suhasini.subbarao@stat.tamu.edu October 27, 2008 Contents 1 Introduction: Why do time series? 1.1 Stationary processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Linear time series 2.1 Difference equations and back-shift operators . . . . . . . . . . . . . . . 2.2 The ARMA model . . . . . . . . . . . . . . . . . . . . . . . . ....

Register Now

Unformatted Document Excerpt

Coursehero >> Texas >> Texas A&M >> TEACHING 673

Course Hero has millions of student submitted documents similar to the one
below including study guides, practice problems, reference materials, practice exams, textbook help and tutor support.

Course Hero has millions of student submitted documents similar to the one below including study guides, practice problems, reference materials, practice exams, textbook help and tutor support.
course A in Time Series Analysis Suhasini Subba Rao Email: suhasini.subbarao@stat.tamu.edu October 27, 2008 Contents 1 Introduction: Why do time series? 1.1 Stationary processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Linear time series 2.1 Difference equations and back-shift operators . . . . . . . . . . . . . . . 2.2 The ARMA model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 The autocovariance function . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 The autocovariance of an autoregressive process . . . . . . . . . 2.3.2 The autocovariance of a moving average process . . . . . . . . . 2.3.3 The autocovariance of an autoregressive moving average process 2.3.4 The partial covariance . . . . . . . . . . . . . . . . . . . . . . . . 2.4 The autocovariance function, invertibility and causality . . . . . . . . . 3 Prediction 3.1 Basis and linear vector spaces . . 3.1.1 Orthogonal basis . . . . . 3.1.2 Spaces spanned by infinite 3.2 Durbin-Levinson algorithm . . . 3.3 Prediction for ARMA processes . 3.4 The Wold Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . number of elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 8 11 12 16 20 21 23 24 25 26 29 29 30 31 32 34 38 40 40 40 40 41 43 44 45 46 46 48 4 Estimation for Linear models 4.1 Estimation of the mean and autocovariance function . . . . . . . . 4.1.1 Estimating the mean . . . . . . . . . . . . . . . . . . . . . . 4.1.2 Estimating the covariance . . . . . . . . . . . . . . . . . . . 4.1.3 Some asymptotic results on the covariance estimator . . . . 4.2 Estimation for AR models . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 The Yule-Walker estimator . . . . . . . . . . . . . . . . . . 4.2.2 The Gaussian maximum likelihood (least squares estimator) 4.3 Estimation for ARMA models . . . . . . . . . . . . . . . . . . . . . 4.3.1 The Hannan and Rissanen AR() expansion method . . . 4.3.2 The Gaussian maximum likelihood estimator . . . . . . . . 1 5 Almost sure convergence, convergence in probability and asymptotic normality 5.1 Modes of convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Ergodicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Sampling properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Showing almost sure convergence of an estimator . . . . . . . . . . . . . . . . . . 5.4.1 Proof of Theorem 5.4.2 (The stochastic Ascoli theorem) . . . . . . . . . . 5.5 Almost sure convergence of the least squares estimator for an AR(p) process . . . 5.6 Convergence in probability of an estimator . . . . . . . . . . . . . . . . . . . . . . 5.7 Asymptotic normality of an estimator . . . . . . . . . . . . . . . . . . . . . . . . 5.8 Asymptotic normality of the least squares estimator . . . . . . . . . . . . . . . . 50 50 52 53 54 56 57 59 60 62 6 Sampling properties of ARMA parameter estimators 66 6.1 Asymptotic properties of the Hannan and Rissanen estimation method . . . . . . 66 ^ 6.1.1 Proof of Theorem 6.1.1 (A rate for bT - bT 2 ) . . . . . . . . . . . . . . 70 6.2 Asymptotic properties of the GMLE . . . . . . . . . . . . . . . . . . . . . . . . . 72 7 Residual Bootstrap for estimation in autoregressive processes 7.1 The residual bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 The sampling properties of the residual bootstrap estimator . . . . . . . . . . . . 8 Spectral Analysis 8.1 Some Fourier background . . . . . . . . . . . . . . . . . . . . 8.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3 Spectral representations . . . . . . . . . . . . . . . . . . . . . 8.3.1 The spectral distribution . . . . . . . . . . . . . . . . 8.3.2 The spectral representation theorem . . . . . . . . . . 8.3.3 The spectral densities of MA, AR and ARMA models 8.3.4 Higher order spectrums . . . . . . . . . . . . . . . . . 8.4 The Periodogram and the spectral density function . . . . . . 8.4.1 The periodogram and its properties . . . . . . . . . . 8.4.2 Estimating the spectral density . . . . . . . . . . . . . 8.5 The Whittle Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 81 82 88 . 88 . 89 . 90 . 90 . 94 . 97 . 98 . 99 . 100 . 103 . 107 . . . . . . . . . . 113 113 114 116 116 118 120 120 121 124 125 9 Nonlinear Time Series 9.1 The ARCH model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1.1 Some properties of the ARCH process . . . . . . . . . . . . . . . . 9.2 The quasi-maximum likelihood for ARCH processes . . . . . . . . . . . . 9.2.1 Consistency of the quasi-maximum likelihood estimator . . . . . . 9.2.2 Asymptotic normality of the quasi-maximum likelihood estimator . 9.3 Testing for linearity of a time series . . . . . . . . . . . . . . . . . . . . . . 9.3.1 Motivating the test statistic . . . . . . . . . . . . . . . . . . . . . . 9.3.2 Estimates of the higher order spectrum . . . . . . . . . . . . . . . 9.3.3 Hotelling's T 2 -statistic . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.4 The test statistic for the test for linearity . . . . . . . . . . . . . . 2 10 Mixingales 126 10.1 Obtaining almost sure rates of convergence for some sums . . . . . . . . . . . . . 127 10.2 Proof of Theorem 6.1.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 A Appendix 131 A.1 Background: some definition and inequalities . . . . . . . . . . . . . . . . . . . . 131 3 Preface The material for these notes come from all over the place. Some of it from books and articles, and some it from my own work. For those interested in reading around material the following books may be useful (it is by no means an exhaustable list). For linear time series: Priestley (1983), Brockwell and Davis (1998), Fuller (1995), Grimmett and Stirzaker (1994), Chapter 9, and Shumway and Stoffer (2006). This is not a comprehesive list, and I am sure more books will be included. Bartlett (1981) and M. and Grenander (1997) are very early books on time series, these books were shortly followed by Parzen (1999) (these are reference to the latest edition, they were first published in the late 50s and early 60s). The book which brought time series to the masses is Box and Jenkins (1970) and is very useful for any practitioner. At about the same time Hannan (1970) and Anderson (1994) were published, which deals with time series analysis. I have yet to start writing the nonlinear notes. I will update this list. 4 Chapter 1 Introduction: Why do time series? A time series is a series of observations xt , each observed at the time t. Typically the observations can be over an entire interval, randomly sampled on an interval or at fixed time points. Different types of time sampling require different approaches to the data analysis. However in this course we will focus on the case that observations are observed at fixed time points, hence we will suppose we observe {xt : t = 1, . . . , n}. Below we give examples of typical time series. Figure 1.1 is of the daily exchange rate between the British pound and the US dollar (after taking log differences). Figure 1.2 is of the monthly minimum temperatures recorded at Antarctic and the Figure 1.3 is of the global temperature anomalies. Comparing the Antartic, exchange rate data and global temperatures with the simulation of white noise (iid random variables) in Figure 1.4, we see that unlike the iid realisation, there appears to be `more smoothness' in the plots and dependence between observations which are close located close in time. Figures 1.1, 1.2 and 1.3 are examples of time series and various time series models are fitted to this type of data. Hence we observe the time series {xt }, usually we assume that {xt } is a realisation from a random process {Xt }. We formalise this notion below. The random process {Xt ; t Z} (where Z denotes the integers) is defined on the probability space {, F, P }. We explain what these mean below: (i) is the set of all possible outcomes. Suppose that , then {Xt ()} is one realisation from the random process. For any given , {Xt ()} is not random. In time series we will usually assume that what we observe xt = Xt () (for some ) is a typical realisation. That is, for any other , Xt ( ) will be different, but its general or overall characteristics will be similar. (ii) F is known as a sigma algebra. It is a set of subsets of (though not necessarily the set of all subsets, as this can be too large). But it consists of all sets for which a probability can be assigned. That is if A F, then P (A) is known. (iii) P is the probability. After all this formalisation, let us return to plots in Figures 1.2 and 1.1. We see that Figure 1.1 can be considered as one realisation from the stochastic process {X t }. Now based on the one realisation we want to make inference about parameters associated with the process {X t }, such as the mean etc. Let us consider estimators of the mean, noting that the discussion below equally applies to any population parameter. We recall that in classical statistics we usually 5 log differences -0.02 -0.01 0.00 0.01 0.02 0 500 1000 Daily 1500 2000 Figure 1.1: The GBP and USD exchange rate from 2000-2008 (after taking log differences) degrees celcius -40 -30 -20 -10 0 0 100 200 300 months 400 500 600 700 Figure 1.2: The monthly minimum temperatures at Faraday station in the Antarctic. 6 temp -0.6 -0.4 -0.2 0.0 0.2 0.4 1850 1900 year 1950 2000 Figure 1.3: The global yearly temperature anomolies from 1850-present wn -2 -1 0 1 2 3 0 50 Time 100 150 Figure 1.4: A simulation from 150 iid random variables 7 assume we observe several independent realisations, {Zt } from a random variable Z, and use 1 the multiple realisations to make inference about the mean: Z = n n Zk . Roughly speaking, k=1 by using several independent realisations we are sampling over the entire probability space and obtaining a good estimate of the mean. On the other hand if the samples were not independent and highly dependent, then it is likely that {Zt } would be concentrated about a small part of the probability space. In this case, the sample mean would be highly biased. Now let us consider the time series. For most time series we need to estimate parameters based on only one realisation xt = Xt (). Therefore, it would appear impossible to obtain a good estimator of the mean. However good estimates, of the mean, can be made, based on just one realisation so long as certain assumptions are satisfied (i) the process is stationary (this is a type of invariance assumption, that is the main characters of the process do not change over time such as the mean does not change over time) and (ii) despite the fact that each time series is generated from one realisation there is `short' memory in the observations. That is, what is observed today, xt has little influence on observations in the future, xt+k (when k is relatively large). Hence, even though we observe one tragectory, that trajectory traverses much of the probability space. The amount of dependency in the time series determines the `quality' of the estimator. There are several ways to measure the dependency. We know that the most common measure of linear dependency is the covariance. The covariance between in the stochastic process {Xt } is defined as cov(Xt , Xt+k ) = E(Xt Xt+k ) - E(Xt )E(Xt+k ). Hence if {Xt } has mean zero, then the above reduces to cov(Xt , Xt+k ) = E(Xt Xt+k ). In a lot of statistical analysis the covariance is often sufficient as a measure. However it is worth bearing in mind that the covariance only measure linear dependence, usually given cov(Xt , Xt+k ) we cannot say anything about cov(g(Xt ), g(Xt+k )), where g is a nonlinear function. There are occassions where we require a more general measure of dependence. Examples of more general measures include mixing (in its various flavours), first introduced by Rosenblatt in the 50s (M. and Grenander (1997)). However in this course we will not cover mixing. Websites where the data can be obtained include: http://www.cru.uea.ac.uk/ http://www.federalreserve.gov/releases/h10/Hist/ http://bossa.pl/notowania/daneatech/metastock/. 1.1 Stationary processes Stationarity is a rather intuitive concept, and is an invariant property which means that statistical characteristics of the time series do not change over time. For example, the yearly rainfall may vary year by year, but the average rainfall in two equal length time intervals will be roughly the same as would the number of times the rainfall exceeds a certain threshold. Of course, over long periods of time this assumption may not be so plausible. For example, the climate change that we are currently experiencing is causing changes in the overall weather patterns (we will consider nonstationary time series towards the end of this course). However in many situations, and over shorter intervals the assumption of stationarity is quite a plausibe. Indeed often 8 the statistical analysis of a time series is done under the assumption that a time series is stationary. There are two definitions of stationarity, weak stationarity which only concerns the covariance of a process and strict stationarity which is a much stronger condition and supposes the distributions are invariant over time. Definition 1.1.1 (Strict stationarity) The time series {Xt } is said to be strictly stationary if for any finite sequence of integers t1 , . . . , tk and shift h the distribution of (Xt1 , . . . , Xtk ) and (Xt1 +h , . . . , Xtk +h ) are the same. Definition 1.1.2 (Second order stationarity/weak stationarity) The time series {Xt } is said to be second order stationary if for any t and k the covariance between X t and Xt+k only depends on the lag difference k. In other words there exists a function c : Z R such that for all t and k we have c(k) = cov(Xt , Xt+k ). Remark 1.1.1 It is easy to show that strict stationarity implies second order stationarity. But the converse is not necessarily true. To show that strict stationarity implies second order stationarity, suppose that {Xt } is a strictly stationary process with zero mean, then cov(Xt , Xt+k ) = xyPXt ,Xt+k (dx, dy) = xyPXt1 ,Xt1 +k (dx, dy) = cov(Xt1 , Xt1 +k ), where PXt ,Xt+k is the joint distribution of Xt , Xt+k . Clearly cov(Xt , Xt+k ) does not depend on t and {Xt } is second order stationary. The covariance of a stationary process has several very interesting properties. One of the main properties is that it is non-negative definite, which we define below. Definition 1.1.3 (Non-negative definite) A sequence {c(k)} is said to be non-negative definite if for any n Z and sequence x = (x1 , . . . , xn ) Rn the following is satisfied n i,j xi c(i - j)xj . Remark 1.1.2 You have probably encountered this notion before when dealing with non-negative definite (positive definite) matrices. Recall the n n matrix n is non-negative definite if for all x Rn x n x 0. To see how this is related to non-negative definite matrices, suppose that the matrix n has a special form, that is the elements of n are (n )i,j = c(i - j). Then x n x = n i,j xi c(i - j)xj . We observe that in the case that {Xt } is a stationary process with covariance c(k), the variance covariance matrix of X n = (X1 , . . . , Xn ) is n , where (n )i,j = c(i - j). We now take the above remark further and show that the covariance of a stationary process is semi-negative definite. Theorem 1.1.1 Suppose that {Xt } is a stationary time series with covariance function {c(k)}, then {c(k)} is a non-negative definite sequence. Conversely for any negative definite sequence there exists a stationary time series with a non-negative definite sequence as its covariance function. 9 PROOF. To show that {c(k)} is negative definite. Consider any sequence x = (x 1 , . . . , xn ) Rn , n n and the double sum i=1 xi Xi . It is i,j xi c(i - j)xj . Define the random variable Y = n straightforward to see that var(Y ) = xvar(X n )x = i,j xi c(i - j)xj where X n = (X1 , . . . , Xn ). Since for any random variable Y , var(Y ) 0, this means that n xi c(i-j)xj 0, hence {c(k)} i,j is a positive definite sequence. To show the converse that for any non-negative definite sequence {c(k)} we can find a corresponding stationary time series with the covariance {c(k)} is relatively straightfoward, but depends on defining the characteristic function of a process and using Komologorov's extension theorem. We omit the details but refer an interested reader to Brockwell and Davis (1998), Section 1.5. It is worth noting that a simple way to check for non-negative definiteness of sequence is to consider its Fourier transform. If the Fourier tranform is postive, then the sequence is nonnegative definite, we will look at this in more depth when we consider the spectral density. The above theorem applies also to spatial processes. Which is why in spatial statistics they often look at the construction of positive definite covariance functions. 10 Chapter 2 Linear time series Estimating of autocovariances of a time series gives us information about the linear dependence structure of the process. It is therefore desirable to find models which help to explain some of the characteristics which we see in the autocovariances. An important class of models are linear time series models and MA() models. We recall that linear regression model the dependent variable is influenced by current values of independent variables. The linear time series model it a generalisation of this idea, where the dependent variable is influenced by both past and present and future independent variables. The MA() model is a subclass which has a more natural interpretation, here the dependent variable is influenced by the current and past. There are two popular sub-groups of linear time models (a) the autoregressive and (a) the moving average models, which can be combined to make the autoregressive moving average models. A nice feature of the autoregressive models is that the previous observations linearly influences the current observation. A nice feature of moving average processes is that there is only non-zero correlation for a finite number of lags, for a large enough lag the covariance will be zero. Before defining a linear time series, we consider MA(q) model which is a small subclass of linear time series. Let us supppose that {t } are iid random variables with mean zero and finite variance. Suppose the time series satisfies q Xt = j=0 j t-j . It is clear that Xt is rolling finite weighted sum of {t }, therefore {Xt } must be well defined (which basically means it is almost surely finite, this you can see because it has a finite variance). Now we extend this idea and look not only at finite sums but infinite sums of random variables. Things become more complicated. Care must be always be taken when ever we deal with anything involving infinite sums! For example j=- j Xt-j , for it to make sense if its subsums j Xt-j are (almost surely) finite and the sequence converges (hence |Sn1 -Sn2 | 0 Sn = n j=-n as n1 , n2 ). Effectively everything must be finite. We give conditions under which this is true in the following lemma. Lemma 2.0.1 Suppose {Xt } is a strictly stationary time series with E|Xt | < , then {Yt } defined by Yt = j=- j Xt-j , 11 where |j | < , is a strictly stationary time series (and converges almost surely - that is j=0 Yn,t = n j Xt-j Yt almost surely). If var(Xt ) < , then {Yt } is second order stationary j=0 and converges in mean square (that is E(Yn,t - Yt )2 ). PROOF. See Brockwell and Davis (1998), Proposition 3.1.1 or Fuller (1995), Theorem 2.1.1 (page 31) (also Shumway and Stoffer (2006), page 86). Example 2.0.1 Suppose {Xt } is any stationary process and var(Xt ) < . Define {Yt } as the following infinite sum Yt = j=0 j k j Xt-j where || < 1. Then {Yt } is also a stationary process with a finite variance. We will use this example later in the course. Having derived conditions under which infinite sums are well defined (good), we can now define the general class of linear and MA() processes. Definition 2.0.4 (The linear process and moving average (MA)()) ries is said to be a linear time series if it can be represented as Xt = j=- (i) A time se- j t-j , where {t } are iid random variables with finite variance. (ii) The time series {Xt } has a MA() representation if it satisfies Xt = where {t } are iid random variables, then it is second order stationary. j=0 j=0 |j | j t-j , < and E(|t |) < . If E(|t |2 ) < , The difference between an MA() process and a linear process is quite subtle. The difference is that the linear process involves both past and present innovations {t }, whereas the MA() uses only past innovations. From a modelling perspective, the MA() process has better interpretation. A very interesting class of models which have MA() representations are autoregressive and ARMA models. But in order to define this class we need to take a brief look at difference equations. 2.1 Difference equations and back-shift operators The autoregressive and ARMA models are defined in terms of inhomogenuous difference equations. Often difference equations are defined in terms of backshift operators, so we start by 12 defining them and how they work below. This representation can be very useful as it can be used to obtain a solution to the equations. The autoregressive process (AR(p)) is defined as Xt - 1 Xt-1 - . . . - p Xt-p = t , where {t } are zero mean, finite variance random variables. Often the above is written as Xt - 1 BXt - . . . - p B p Xt = t , (B)Xt = t where (B) = 1- p j B j , B is the backshift operator and is defined such that B k Xt = Xt-k . j=1 Simply rearranging (B)Xt = t , gives the solution of the equation to be Xt = (B)-1 t , however this is a simple algebraic manipulation. We need to investigate whether it really has any meaning. To do this, we start with an example. Example the AR(1) process (i) Consider the AR(1) process Xt = 0.5Xt-1 + t . (2.1) Notice this is an equation (rather like 3x2 +2x+1 = 0, or an infinite number of simultaneous equations), which may or may not have a solution. To obtain the solution we note that Xt = 0.5Xt-1 + t and Xt-1 = 0.5Xt-2 + t-1 . Using this we get Xt = t + 0.5(0.5Xt-2 + t-1 ) = t + 0.5t-1 + 0.52 Xt-2 . Continuing this backward iteration we obtain at the kth iteration, Xt = k (0.5)j t-j + (0.5)k+1 Xt-k . Because (0.5)k+1 0 as k by taking j=0 the limit we can show that Xt = (0.5)j t-j is almost surely finite and a solution of j=0 (2.1). Of course like any other equation one may wonder whether it is the unique solution (recalling that 3x2 + 2x + 1 = 0 has two solutions). We show in a later example that it is the unique solution. Now let us see whether we can obtain a solution using the difference equation representation. We recall that crudely taking inverses the solution would be Xt = (1-0.5B)-1 t . The obvious question is whether this has any meaning. Note that (1-0.5B)-1 = (0.5B)j , j=0 for |B| 2, hence substituting this power series expansion into Xt = (1 - 0.5B)-1 t = ( j=0 (0.5B)j )t = ( j=0 (0.5j B j ))t = (0.5)j t-j , which corresponds to the soluj=0 tion above. Hence the backshift operator in this example helps us to obtain a solution. (ii) Now let us consider the equation Xt = 2Xt-1 + t . (2.2) Doing what we did in (i) we find that after the kth back iteration we have X t = k 2j t-j + j=0 2k+1 Xt-k . However unlike example (i) 2k does not converge as k . This suggest that if we continue the iteration Xt = 2j t-j is not a quantity that is well defined (alj=0 most surely finite). Since does not make much sense as the solution of an equation, Xt = 2j t-j cannot be considered as a solution of (2.2). j=0 13 Let us see whether the difference equation can also offer a solution. Since (1 - 2B)X t = t , using the crude manipulation we have Xt = (1 - 2B)-1 t . Now we see that (1 - 2B)-1 = j j j j=0 2 B Xt , but as we j=0 (2B) for |B| < 1/2. Using this expansion gives Xt = pointed above this sum is not well defined. What we find is that (B)-1 t only makes sense (is well defined) if the series expansion of (B)-1 converges in a region that includes the unit circle |B| = 1. What we need is another series expansion of (1 - 2B)-1 which converges in a region which includes |B| = 1. We note that a function does not necessarily have a unique series expansion, it can have difference series expansions which may converge in different regions. We now show that the appropriate series expansion will be in negative powers of B not positive powers. (1 - 2B) = -(2B)(1 - (2B)-1 ), therefore (1 - 2B)-1 = -(2B)-1 (2B)-1 , which converges for |B| > 1/2. Using this expansion we have j=0 Xt = - (0.5)j+1 B -j-1 t = - (0.5)j+1 t+j+1 , which we have shown above is a j=0 j=0 well defined solution of (2.2). However rewriting (2.2) we have Xt-1 = 0.5Xt + 0.5t . Forward iterating this we get Xt-1 = -(0.5) k (0.5)j t+j - (0.5)t+k+1 Xt+k . Since (0.5)t+k+1 0 we have Xt-1 = j=0 -(0.5) (0.5)j t+j as a solution of (2.2). j=0 Let us now summarise our observation for general AR(1) process X t = Xt-1 + t . If || < 1, then the solution is in terms of past values of {t }, if on the other hand || > 1 the solution is in terms of future values of {t }. In terms of the polynomial (B) = 1 - B (often called the characteristic polynomial), we are looking for regions which include the unit circle |B| = 1, for which the inverse (B)-1 has a convergent power series expansion. We see if the roots of (B) are less than one, then the power series of (B)-1 is in terms of positive powers, if its greater than one, then (B)-1 is in terms of negative powers. Generalising this argument to a general polynomial, if the roots of (B) are less than one, the power series of (B) -1 is in terms of positive powers (hence the solution (B)-1 t will be in past terms of {t }). If on the other hand, the roots are both less than one and greater than one (but do no lie on the unit circle), the power series of (B)-1 will be in both negative and positive powers and the solution Xt = (B)-1 t will be in terms of both past and future values of {t }. We see that where the roots of the characteristic polynomial (B) defines the solution of the AR process. We will show in Section 2.3.1 that it not only defines the solution but determines some of the characteristics of the time series. Example 2.1.1 Suppose {Xt } satisfies Xt = 0.75Xt-1 - 0.125Xt-2 + t , where {t } are iid random variables. We want to obtain a solution for the above equations. It is not easy to use the backward (or forward) iterating techique for AR processes beyond order one. This is where using the backshift operator becomes useful. We start by writing Xt = 0.75Xt-1 - 0.125Xt-2 + t as (B)Xt = , where (B) = 1 - 0.75B + 0.125B 2 , which leads to what is commonly known as the characteristic polynomial (z) = 1 - 0.75z + 0.125z 2 . The solution is Xt = (B)-1 t , if we can find a power series expansion of (B)-1 , which is valid for |B| = 1. 14 We first observe that (z) = 1 - 0.75z + 0.125z 2 = (1 - 0.5z)(1 - 0.25z). Therefore by using partial fractions we have 1 1 -1 2 = = + . (z) (1 - 0.5z)(1 - 0.25z) (1 - 0.5z) (1 - 0.25z) We recall from geometric expansions that -1 =- (1 - 0.5z) j=0 j=0 (0.5) z j j |z| 2, 2 =2 (1 - 0.25z) (0.25)j z j |z| 4. Putting the above together gives 1 = (1 - 0.5z)(1 - 0.25z) j=0 {-(0.5)j + 2(0.25)j }z j j=0 | |z| < 2. Since the above expansion is valid for |z| = 1, we have Lemma 2.1.1, this is also clear to see). Hence Xt = {(1-0.5B)(1-0.25B)}-1 t = j=0 - (0.5)j + 2(0.25)j | < (see j=0 {-(0.5)j +2(0.25)j }B j t = {-(0.5)j +2(0.25)j }t-j , which gives a stationary solution to the AR(2) process (see Lemma 2.0.1). The discussion above motivates how the backshift operator can be applied and how it can be used to obtain solutions to difference equations. We formalise this below. Its worth noting that if you pretty much understand it, you don't have to worry much about the formal setting. Definition 2.1.1 (Analytic functions) Suppose that z C. (z) is an analytic complex function in the region , if it has a power series expansion which converges in , that is (z) = j j=- j z . ~ j ~ ~ If there exists a function (z) = j=- j z such that (z)(z) = 1 for all z , then ~ (z) is the inverse of (z) in the region . Well known examples of analytic functions include polynomials such as (z) = 1+ 1 z +2 z 2 (for all z C) and (1 - 0.5z)-1 = (0.5z)j for |z| 2. j=0 We observe that for AR processes we can represent the equation as (B)X t = t , which formally gives the solution Xt = (B)-1 t . This raises the question under what conditions on (B)-1 is (B)-1 t valid. For (B)-1 t to make sense (B)-1 should be represented as a power series expansion, we show below what conditions on the power series expansion give the solution. It is worth noting this is closely related to Lemma 2.0.1. j Lemma 2.1.1 Suppose that (z) = j=- j z is finite on a region that includes |z| = 1 (hence it is analytic) and {Xt } is a strictly stationary process with E|Xt | < . Then j=- j Xt-j is almost surely finite and strictly j=- |j | < and Yt = (B)Xt-j = stationary time series. 15 j PROOF. It can be shown that if sup|z|=1 |(z)| < , in other words on the unit circle j=- j z < , then j=- |j | < . Since the coefficients are absolutely summable, then by Lemma 2.0.1 we have that Yt = (B)Xt-j = j=- j Xt-j is almost surely finite and strictly stationary. Rules of the back shift operator: (i) If a(z) is analytic in a region which includes the unit circle |z| = 1 and this is not on the boundary of , then a(B)Xt is a well defined random variable. (ii) The operator is commutative and associative, that is [a(B)b(B)]Xt = a(B)[b(B)Xt ] = [b(B)a(B)]Xt (the square brackets are used to indicate which parts to multiply first). This may seems obvious, but remember matrices are not commutative! 1 (iii) Suppose that a(z) and its inverse a(z) are both finite in the region which includes the 1 unit circle |z| = 1. If a(B)Xt = Zt , then Xt = a(B) Zt . Example 2.1.2 (Useful analytic functions) (i) Clearly a(z) = 1 - 0.5z is analytic for all 1 z C, and has no zeros for |z| < 2. The inverse is a(z) = (0.5z)j is well defined in j=0 the region |z| < 2. (ii) Clearly a(z) = 1-2z is analytic for all z C, and has no zeros for |z| > 1/2. The inverse is 1 j -1 -1 a(z) = (-2z) (1-(1/2z)) = (-2z) ( j=0 (1/(2z)) ) well defined in the region |z| > 1/2. (iii) The function a(z) = 1 (1-0.5z)(1-2z) is analytic in the region 0.5 < z < 2. (iv) a(z) = 1 - z, is analytic for all z C, but is zero for z = 1. Hence its inverse is not well defined for regions which involve |z| = 1. The above is quite technical, but it allows us to obtain solutions for ARMA processes and to derive conditions under which they are `causal and invertible'. 2.2 The ARMA model We start by defining the ARMA process and then show that it has (under certain conditions) and MA() representation. Definition 2.2.1 (The AR, ARMA and MA processes) model: {Xt } satisfies p (i) The autoregressive AR(p) Xt = i=1 i Xt-i + t . (2.3) Observe we can write (B)Xt = t (ii) The moving average M A(q) model: {Xt } satisfies q Xt = t + j=1 j t-j . (2.4) Observe we can write Xt = (B)t 16 (iii) The autoregressive moving average ARM A(p, 1) model: {Xt } satisfies p q Xt - i Xt-i = t + i=1 j=1 j t-j . (2.5) We observe that we can write Xt as (B)Xt = (B)t . Example 2.2.1 (The AR(1) model) Consider the AR(1) process Xt = Xt-1 + t , where || < 1. It has almost surely the well defined, unique stationary, causal solution X t = j t-j . j=0 By iterating the difference equation, it is clear that Xt = j t-j is a solution of Xt = j=0 1 Xt-1 +t . We first need to show that it is well defined (that it is almost surely finite). We note that |Xt | |j ||t-j |, showing that |j ||t-j | is almost surely finite, will imply that j=0 j=0 |Xt | is almost surely finite. By montone convergence we can exchange sum and expectatin and we have E(|Xt |) E(limn n |j t-j |) = limn n |j |E|t-j |) = E(|0 |) |j | < . j=0 j=0 j=0 Therefore since E|Xt | < , j t-j is a well defined solution of Xt = Xt-1 + t . To show j=0 that it is the unique (causal) solution, let us suppose there is another (causal) solution, call it Y t (note that this part of the proof is useful to know as such methods are often used when obtaining solutions of time series models). Clearly, by recursively applying the difference equation to Y t , for every s we have s Yt = j=0 j t-j + s Yt . Evaluating the difference between the two solutions gives Yt - Xt = As - Bs where As = s Yt j and Bs = j=s+1 t-j for all s. Now to show that Yt and Xt coincide almost surely we show that for every > 0, P (|As - Bs | > ) < . By the Borel-Cantelli lemman, this s=1 would imply that the event {|As - Bs | > } happens almost surely only finitely often. Since for every , {|As - Bs | > } occurs (almost surely) only finite often for all , then Yt = Xt almost surely. We now show that s=1 P (|As - Bs | > ). We note if |As - Bs | > ), then either |As | > /2 or |Bs | > /2, Therefore P (|As - Bs | > ) P (|Bs | > /2) + P (|As | > /2), by using Markov's inequality we have P (|As - Bs | > ) Cs / (note that since Yt is assumed stationary E|Yt | E|t |/(1 - ||) < ). Hence P (|As - Bs | > ) < Cs / < , thus Xt = Yt s=1 s=1 almost surely. Hence Xt = j t-j is (almost surely) the unique causal solution. j=0 We now consider a generalisation of the above example to ARMA processes. Lemma 2.2.1 Let us suppose Xt is an ARMA(p, q) process. Then if the roots of the polynomial (z) lie outside the unit circle and are greater than 1 + , then Xt almost surely has the solution Xt = where for j > q, aj = [Aj ]1,1 + j=0 q j-i ] , 1,1 i=1 i [A aj t-j , with (2.6) A= 1 2 1 0 . . . . . . 0 ... . . . p-1 p ... ... 0 . . . .. . . . . . ... 1 0 17 where j |aj | < (we note that really aj = aj (, ) since its a function of {i } and {i }). Moreover for all j, |aj | Kj for some finite constant K and 1/(1 + ) < < 1. If the roots of (z) have absolute value greater than 1 + , then (2.5) can be written as Xt = where |bj | Kj for some finite constant K and = 1 - /2. PROOF. We first show that if Xt comes from an ARMA process where the roots lie outside the unit circle then it has the representation (2.6). There are several way to prove the result. The proof we consider here, is similar to the proof given in Example 2.2.1. We write the ARMA process as a vector difference equation X t = AX t-1 + t where X t = (Xt , . . . , Xt-p+1 ), t = (t + Xt = q j=1 j t-j , 0, . . . , 0). j=0 j=1 (2.7) bj Xt-j + t . (2.8) (2.9) (2.10) Now iterating (2.10), we have (2.11) Aj t-j , concentrating on the first element of the vector X t we see that Xt = i=0 q [Ai ]1,1 (t-i + j=1 j t-i-j ). Comparing (2.6) and the above it is clear that for j > q, aj = [Aj ]1,1 + q i [Aj-i ]1,1 . Observe i=1 that the above representation is very similar to the AR(1) given in Example 2.2.1. Indeed as we will show below the Aj behaves in much the same way as the j in Example 2.2.1. As with j , we will show that Aj converges to zero as j (because the eigenvalues of A are less than one). We now show that |Xt | K j |t-j | for some 0 < < 1, this will mean that j=1 |aj | Kj . To bound |Xt | we will bound X t 2 (since |Xt | X t 2 ). Now we note using (2.11) gives Xt 2 j=0 Aj spec t-j 2. Hence, a bound for Aj spec gives a bound for |aj | (note that A spec is the spectral norm of A, which is the largest eigenvalue of the symmetric matrix AA ). To get this bound we use a few 18 tricks. Below we will show that the largest eigenvalue of Aj is less than 1, this means that the largest eigenvalue of Aj is is gets smaller as j grows, hence Aj is contracting. We formalise this now. To show that the largest eigenvalue of A is less than one, we consider det(A - zI) (which gives the eigenvalues of A) p p det(A - zI) = z - p i z i=1 p-i = z (1 - p i z -i ), i=1 =z p (z -1 ) where (z) = 1- p i z i is the characteristic polynomial of the AR part of the ARMA process. i=1 Since the roots of (z) lie outside of the unit circle, the roots of (z -1 ) lie inside the unit circle and the eigenvalues of A are less than one. Clearly if the absolute value of smallest root of (z) is greater than 1 + , then the largest eigenvalue of A is less than 1/(1 + ) and the largest eigenvalue of Aj is less than 1/(1 + )j . We now show that Aj spec also decays at a geometric rate. It can be shown that if the largest absolute eigenvalue of A denoted max (A), is such that max (A) 1/(1 + ), then there exists a , where 1/(1 + ) < 1 where A j spec Kj for all j > 0 (c.f. Moulines et al. (2005), Lemma 12). Therefore we have X t 2 K j t-j 2 j=0 and |aj | Kj . To show that the solution is unique we use the same method given in Example ??. To show (2.8) we use a similar proof, and omit the details. Remark 2.2.1 As we mentioned in the proof of Lemma 2.2.1, there are several methods to prove the result. Another method uses that the roots of (z) lie outside the unit circle, and a power 1 series expansion of (z) is made. Therefore we can obtain the coefficients {aj } by considering (z) the coefficients of the power series (z) . Using this method it may not be immediatley obvious that the coefficients in the MA() expansion decay exponentially. We will clarify this here. 1 ~ Let us denote the power series expansion as (z) = z j . We note that in the case that j=0 1 the roots of the characteristic polynomial of (z) are 1 , . . . , p and are distinct then, (z) = for some constants {Ck }. It is clear in this case that the coefficients of ~ decay exponentially fast, that is for some constant C we have |j | C(mink |k |)-j . However in the case that roots of (z) are not necessarily distinct, let us say 1 , . . . , s with 1 multiplicity m1 , . . . , ms ( k ms = p). Then (z) = ( s -j Pmk (j))z j , where Pmk (j) j=0 k=1 k is a polynomial of order mk . Despite the appearance of the polynomial term in the expansion ~ the coefficients j still decay exponentially fast. It can be shown that for any > (min k |k |)-1 , ~ that there exists a constant such that |j | Cj (we can see this if we make an expansion of j , where is any small quantity). Hence the influence of the polynomial terms P (k + ) mk (j) in the power series expansion is minimal. 1 (z) j=0 ( p -j j k=1 Ck k )z , Remark 2.2.2 In the case that the roots of (z) do not lie on the unit circle, the smallest root outside the unit circle has absolute value greater than (1 + 1 ) and the largest root inside the unit circle has absolute value less than (1 - 2 ), then 1/(z) has a Laurent series expansion ~ j 1(z) = j=- j z which converges for 1/(1 + 1 ) |z| 1/(1 - 2 ). Hence Xt has the solution Xt = (B)-1 (B)t = j=- aj t-j , where the coefficients aj are obtained from the expansion of (z)/(z). 19 We note that in Lemma 2.2.1 we assumed that the roots of the characteristic polynomial (z) lay outside the unit circle |z| = 1. This basically imposed a causality condition on the solution. When the roots don't necessarily lie outside the unit circle the solution is no longer aj t-j j=0 but aj t-j , hence we go from MA() to the more general linear process. We define j=- below causality and a closely related concept called invertibility. Definition 2.2.2 (Causality and Invertibility) (i) Causality A process is called Causal if it can be written as the MA() process Xt = j t-j . j=0 (ii) Invertible A process is called invertible if it can be written as an AR() process, that is Xt = j Xt-j + t , where |j | < . j=1 j=1 Typically we will consider processes which are causal. Invertibility is a closely related concept which says that Xt can be represented in terms of previous values of Xt and an innovation which is independent of the past. The following result states when an ARMA process is invertible. Lemma 2.2.2 An ARMA process is invertible if the roots of (z) lie outside the unit circle and causal if the roots of (z) lie outside the unit circle. One of the main advantages of the invertibility property is in prediction and estimation. We will consider this in detail below. It is worth noting that even if an ARMA process is not invertible, one can generate a time series which has identical correlation structure but is invertible (see Section 2.4). 2.3 The autocovariance function The autocovariance function (ACF) is defined as the sequence of covariances of a stationary process. That is suppose that {Xt } is a stationary process, then {c(k) : k Z}, the ACF of {Xt } where c(k) = E(X0 Xk ). Clearly different time series give rise to different features in the ACF. We will explore some of these features below. First we consider a general result on the covariance of a causal ARMA process. We evaluate the covariance of an ARMA process using its MA() representation. Let us suppose that {Xt } is a causal ARMA process, then it has the representation in (2.6) (where the roots of (z) have absolute value greater than 1 + ). Using (2.6) and the independence of { t } we have cov(Xt , X ) = cov( = Using (2.7) we have cov(Xt , X ) var(t ) for any > 1/(1 + ). 20 j=0 j=0 j=0 aj t-j , j=0 aj -j ) (2.12) (2.13) aj aj+|t- | var(t ). j j+|t- | |t- | j=0 2j = |t- | , 1 - 2 (2.14) The above bound is useful and will be used in several proofs below. However other than it tells us that the ACF decays exponentially fast it is not very enlightening about the features of the process. In the following we consider the ACF of an autoregressive process. So far we have used the characteristic polynomial assocaiated with an AR process to determine whether it was causal. Now we show that the roots of the characteristic polynomial also give information about the ACF and what a `typical' realisation of a autoregressive process could look like. 2.3.1 The autocovariance of an autoregressive process p Let us consider the zero mean causal AR(p) process {Xt } where Xt = j=1 j Xt-j + t . (2.15) Now given that {Xt } is causal we can derive a recursion for the covariances. It can be shown that multipying both sides of the above equation by Xt-k (k 0) and taking expectations, gives the equation p E(Xt Xt-k ) = j=1 p j E(Xt-j Xt-k ) + E(t Xt-k ) j E(Xt-j Xt-k ). j=1 = These are the Yule-Walker equations, we will discuss them in detail when we consider estimation. For now letting c(k) = E(X0 Xk ) and using the above we see that the autocovariance satisfies the homogenuous difference equation p c(k) - j=1 j c(k - j) = 0, (2.16) for k 0. In other words the autocovariance function of {Xt } is the solution of this difference equation. The study of difference equations is a entire field of research, however we will now scratch the surface to obtain a solution for (2.16). Solving (2.16) is very similar to solving homogenuous differential equations, which some of you may be familar with (do not worry if you are not). Now consider the characteristic polynomial of the AR process 1 - p j z j = 0, j=1 which has the roots 1 , . . . , p . The roots of the characteristic give the solution to (2.16). It can be shown if the roots are distinct (not the same) the solution of (2.16) is p c(k) = j=1 Cj -k , j (2.17) where the constants {Cj } are chosen depending on the initial values {c(k) : 1 k p} and ensure that c(k) is real (recalling that j ) can be complex. In the case that the roots are not distinct let the roots be 1 , . . . , s with multiplicity m1 , . . . , ms ( k ms = p). In this case the solution is s c(k) = j=1 -k Pmj (k), j 21 (2.18) acf -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 0 10 20 lag 30 40 50 Figure 2.1: The ACF of the time series Xt = 1.5Xt-1 - 0.75Xt-2 + t where Pmj (k) is mj th order polynomial and the coefficients {Cj } are now `hidden' in Pmj (k). We now study the covariance in greater details and see what it tells us about a realisation. As a motivation consider the following example. Example 2.3.1 Consider the AR(2) process Xt = 1.5Xt-1 - 0.75Xt-2 + t , (2.19) where {t } are iid random variables with mean zero and variance one. The corresponding characteristic polynomial is 1 - 1.5z + 0, 75z 2 , which has roots 1 i3-1/2 = 4/3 exp(i/6). Using the discussion above we see that the autocovariance function of {Xt } is c(k) = ( 4/3)-k (C1 exp(-ik/6) + C1 exp(ik/6)), for a particular value of C1 . Now write C1 = a exp(ib), then the above can be written as c(k) = a( 4/3)-k cos(k + b). 6 We see that the covariance decays at an exponential rate, but there is a periodicity in this decay. This means that observations separated by a lag k = 12 are closely correlated (similarish in value), which suggests a quasi-periodicity in the time series. The ACF of the process is given in Figure 2.1, notice that it has decays to zero but also observe that it undulates. A plot of a realisation of the time series is given in Figure 2.2, notice the quasi-periodicity of about /6. We now generalise the above example. Let us consider the general AR(p) process defined in (2.15). Suppose the roots of the corresponding characteristic polynomial are distinct and let us split them into real and complex roots. Because the characteristic polynomial is comprised of 22 ar2 -4 -2 0 2 4 6 0 24 48 72 Time 96 120 144 Figure 2.2: The a simulation of the time series Xt = 1.5Xt-1 - 0.75Xt-2 + t real coefficients, the complex roots come in complex conjugate pairs. Hence let us suppose the (p-r)/2 real roots are {j }r } and the complex roots are {j , j }j=r+1 . The covariance in (2.20) can j=1 be written as r (p-2)/2 c(k) = j=1 Cj -k j + j=r+1 aj |j |-k cos(kj + bj ) (2.20) where for j > r we write j = |j | exp(ij ) and aj and bj are real constants. Notice that as the example above the covariance decays exponentially with lag, but there is undulation. A typical realisation from such a process will be quasi-periodic with periods at r+1 , . . . , (p-r)/2 , though the magnitude of periods will vary. An interesting discussion on covariances of an AR process and realisation of an AR process is given in Shumway and Stoffer (2006), Chapter 3.3 (it uses the example above). A discussion of difference equations is also given in Brockwell and Davis (1998), Sections 3.3 and 3.6 and Fuller (1995), Section 2.4. 2.3.2 The autocovariance of a moving average process q Suppose that {Xt } satisfies Xt = t + j t-j . j=1 The covariance is cov(Xt , Xt-k ) = p i=0 i i-k 0 23 k = -q, . . . , q otherwise where 0 = 1 and i = 0 for i < 0 and i q. Therefore we see that there is no correlation when the lag between Xt and Xt-k is greater than q. 2.3.3 The autocovariance of an autoregressive moving average process We see from the above that an MA(q) model is only really suitable when we believe that there is no correlaton between two random variables separated by more than a certain distance. Often autoregressive models are fitted. However in several applications we find that autoregressive models of a very high order are needed to fit the data. If a very `long' autoregressive model is required a more suitable model may be the autoregressive moving average process. It has several of the properties of an autoregressive process, but can more parsimonuous than a `long' autoregressive process. In this section we consider the ACF of an ARMA process. Let us suppose that the causal time series {Xt } satisfies the equations p q Xt - i Xt-i = t + i=1 j=1 j t-j . We now define a recursion for ACF, which is similar to the ACF recursion for AR processes. Let us suppose that k > q, then it can be shown that the autocovariance function of the ARMA process satisfies p E(Xt Xt-k ) - Now when k q we have p i E(Xt-i Xt-k ) = 0 i=1 q E(Xt Xt-k ) - i E(Xt-i Xt-k ) = i=1 j=1 q j E(t-j Xt-k ) j E(t-j Xt-k ) j=k = We recall that Xt has the MA() representation Xt = aj t-j (see (2.6)), therefore for j=0 k j q we have E(t-j Xt-k ) = aj-k var(t ) (where a(z) = (z)(z)-1 ). Altogether the above gives the difference equations p q c(k) - i=1 i c(k - i) = var(t ) p j aj-k j=k for 1 k q (2.21) c(k) - i=1 i c(k - i) = 0, for q < k, where c(k) = E(X0 Xk ). Now since this is homogenuous difference equation the solution is (as in (2.18)) s c(k) = j=1 -k Pmj (k), j 24 where 1 , . . . , s with multiplicity m1 , . . . , ms ( k ms = p) are the roots of the characteristic polynomial 1 - p j z j . The coefficients in the polynomials Pmj are determined by the initial j=1 condition given in (2.21). You can also look at Brockwell and Davis (1998), Chapter 3.3 and Shumway and Stoffer (2006), Chapter 3.4. 2.3.4 The partial covariance We see that by using the autocovariance function we are able to identify the order of an MA(q) process: recall when the covariance lag is greater than q the covariance is zero. However the same is not true for AR(p) processes. The autocovariances do not enlighten us on the order p. However a variant of the autocovariance, called the partial autocovariance is quite informative about order of AR(p). We will consider the partial autocovariance in this section. In order to define the partial correlation we need to introduce the idea of projection onto a subspace. We will investigate the idea of projections quite thoroughly in Section 3, however we will briefly introduce the concept here. The projection of Xt onto the space spanned by Xs , Xs+1 , . . . , Xs+k , is the best linear predictor of Xt given Xs , . . . , Xs+k . We will denote the projection of Xt onto the space spanned by Xs , Xs+1 , . . . , Xs+k k Psp(Xs ,...,Xs+k ) Xt = j=1 aj Xs+k-j , where {aj } minimises the means squared error E(Xt - k aj Xs+k-j )2 . Having defined the j=1 notion of projection we can now define the partial correlation. The partial correlation between Xt and Xt+k (where k > 0) is the correlation between Xt and Xt+k , `conditioning out' all the random variables between Xt and Xt+k . More precisely it is defined as cov(Xt+k - Psp(Xt+k-1 ,...,Xt ) Xt+k , Xt - Psp(Xt+k-1 ,...,Xt ) Xt ). We first consider an example. Example 2.3.2 Consider the causal AR(1) process Xt = 0.5Xt-1 + t where E(t ) = 0 and var(t ) = 1. Using (2.12) it can be shown that cov(Xt , Xt-2 ) = 2 0.52 (compare with the MA(1) process Xt = t + 0.5t-1 , where the covariance cov(Xt , Xt-2 ) = 0). Now let us consider the partial covariance between Xt and Xt-2 . Remember we have to `condition out' the random variables inbetween, which in this case is Xt-1 . It is clear that the projection of Xt onto Xt-1 is 0.5Xt-1 (since Xt = 0.5Xt-1 + t ). Therefore Xt - Psp(Xt-1 ) Xt = Xt - 0.5Xt-1 = t . The pro jection of Xt-2 onto Xt-1 is a little more complicated, it is Psp(Xt-1 ) Xt-2 = Therefore the partial correlation between Xt and Xt-2 cov(Xt - Psp(Xt-1 ) Xt , Xt-2 - Psp(Xt-1 ) Xt-2 ) = cov(t , Xt-2 - E(Xt-1 E(Xt-2 ) Xt-1 . 2 E(Xt-1 ) E(Xt-1 E(Xt-2 ) Xt-1 ) = 0. 2 E(Xt-1 ) In fact the above is true for the partial covariance between Xt and Xt-k , for k 2. Hence we see that despite the covariance not being zero for the autocovariance of an AR process greater than order two, the partial covariance is zero for all lags greater than or equal to two. 25 Using the same argument as above, it is easy to show that partial covariance of an AR(p) for lags greater than p is zero. Hence in may respects the partial covariance can be considered as an analogue of the autocovariance. It should be noted that though the covariance of MA(q) is zero for lag greater than q, the same is not true for the parial covariance. Whereas partial covariances removes correlation for autoregressive processes it seems to `add' correlation for moving average processes! If the autocovariances after a certain lag are zero q, it may be appropriate to fit an MA(q) model to the time series. The autocovariances of any AR(p) process will decay but not be zero. If the partial autocovariances after a certain lag are zero p, it may be appropriate to fit an AR(p) model to the time series. The partial autocovariances of any MA(q) process will decay but not to zero. It is interesting to note that the partial covariance is closely related the coefficients in linear prediction. Suppose that {Xt } is a stationary time series, and we consider the projection of Xt+1 onto the space spanned by Xt , . . . , X1 (the best linear predictor). The projection is t Psp(Xt ,...,X1 ) = j=1 t,j Xt+1-j . Then from the proof of the Durbin-Levinson algorithm in Section 3.2 it can be shown that t,t = cov(Xt+1 - Psp(Xt ,...,X2 ) , X1 - Psp(Xt ,...,X2 ) ) . E(Xt+1 - Psp(Xt ,...,X2 ) )2 Hence the last coefficient in the prediction is the (normalised) partial covariance. For further reading see Shumway and Stoffer (2006), Section 3.4 and Brockwell and Davis (1998), Section 3.4. It is worth noting the partial covariance (correlation) is often used in to decide whether there is direct (linear) dependence between random variables. It is has several application for example in MRI data, where the partial coherence density (a closely related concept) is often investigated. 2.4 The autocovariance function, invertibility and causality Here we demonstrate that it is very difficult to identify whether a process is noninvertible/noncausal or not from its covariance structure. Hence for most purposes one can usually suppose a process is both causal and invertible (though in a series of papers Richard Davis and coauthors have discussed the advantages of fitting noncausal processes). To show this we will require the definition of the spectral density function. We will briefly introduce it here but return to it and consider it in depth in later sections. 26 Definition 2.4.1 (The spectral density) Given the covariances c(k) the spectral density function is defined as f () = k c(k) exp(ik). The covariances can be obtained from the spectral density by using the inverse fourier transform 2 c(k) = 0 f () exp(-ik). Hence the covariance yields the spectral density and visa-versa. We will show later in the course that the spectral density of the ARMA process which satisfies p q Xt - i Xt-i = t + i=1 j=1 j t-j and does not have any roots on the unit circle is f () = |1 + q 2 j=1 j exp(ij)| . p 2 j=1 j exp(ij)| |1 - (2.22) Now let us supose the roots of the characteristic polynomial 1 - q j z j are {j }q and j=1 j=1 the roots of 1+ q j z j are {j }p , hence 1- q j z j = q (1-j z), and 1+ p j z j = j=1 j=1 j=1 j=1 j=1 p j=1 (1 - j z). Then (2.22) can be written as f () = q -1 2 j=1 |1 - j exp(i)| . p -1 2 j=1 |1 - j exp(i)| (2.23) Suppose that the roots 1 , . . . , r lie outside the unit circle and the roots r+1 , . . . , q lie inside the unit circle. Similarly, suppose 1 , . . . , s lie outside the unit circle and the roots s+1 , . . . , p lie inside the unit circle. Clearly if r < q the process {Xt } is not invertible (roots of the MA part lie inside the unit circle) and if s < p the process is not causal (roots of the AR part lie inside the unit circle). We now construct a new process, based on {Xt }, which is both causal and invertible and has the spectra (2.23) (upto a multiplicative constant). Define the polynomials r q -1 (1 - j z)][ j=r+1 p ~ (z) = [ j=1 1 (1 - j z)] (1 - j z)] ~ (z) = [ j=1 (1 - -1 z)][ j j=s+1 s ~ -1 q ~ -1 The roots of (z) are {j }r and {j }p j=1 j=r+1 and the roots of (z) are {j }j=1 and {j }j=s+1 . Clearly the roots of both (z) and (z) lie outside the unit circle. Now define the process ~ ~ ~ (z)Xt = (z)t . 27 ~ Clearly Xt is an ARMA process which is both causal and invertible. Let us consider the spectral ~ density of Xt , using (2.22) we have ~ f () = r -1 2 j=1 |1 - j exp(i))| s -1 2 j=1 |1 - j exp(i))| q 2 j=r+1 |1 - j exp(i))| . p 2 j=s+1 |1 - j exp(i)| We observe that |1 - j exp(i))|2 = |1 - -1 exp(-i))|2 = |-1 exp(-i)(j exp(i) - 1)|2 = j j 2 |1 - -1 exp(i))|2 . Therefore we can rewrite f () as ~ | j | j ~ f () = q 2 i=r+1 |i | p 2 i=s+1 |i | q -1 2 j=1 |1 - j exp(i)| p -1 2 j=1 |1 - j exp(i)| = q 2 i=r+1 |i | f (). p 2 i=s+1 |i | ~ Hence {Xt } and {Xt } have the same spectral density up to a multiplicative constant. The multiplicative constant can be treated the the variance of the innovation, which we now normalise. ~ Hence we define a new process {Xt } where ~ ~ ~ ~ (z)Xt = (z) q 2 i=r+1 |i | , p 2 t i=s+1 |i | ~ ~ then the process {Xt } and {Xt } have identical spectral densities. As was mention above since ~ ~ the spectral density gives the covariance, the covariances of {Xt } and {Xt } are also the same. This means based on only the covariances it is not possible to distinguish between a causal (invertible) process and a noncausal (noninvertible) process. In this course we will always assume the ARMA process is invertible and causal, however its worth bearing in mind that there can arise situations where noncausal (noninvertible) processes may be more appropriate. Definition 2.4.2 An ARMA process is said to have minimum phase when the roots of (z) and (z) both lie outside of the unit circle. In the case that the roots of (z) lie on the unit circle, then {Xt } is known as a unit root process. 28 Chapter 3 Prediction In this chapter we will consider prediction for stationary time series. The idea is to find the best linear predictor of Xt given the previous observations Xt-1 , . . . , X1 . This is known as the one-step ahead predictor, as we are prediction only one-step ahead of the known observations. A very interesting application of the one-step ahead predictor is that it is has several useful applications in estimation too, which we will consider later. A rather simple generalisation of the one-step ahead predictor is the n-step ahead predictor. Once we have established the one-step ahead predictor, it is easy to generalise to n-step. First some notation, we use t Xt+1|t = BestLin(Xt+1 |Xt , . . . , X1 ) = Xt+1|t,...,1 = t,j Xt+1-j , j=1 (3.1) where {t,j } are chosen to minimise the mean squared error E(Xt+1 - t at,j Xt+1-j )2 . The j=1 mean squared error E(Xt+1 - t at,j Xt+1-j )2 is known as the one-step ahead prediction error. j=1 The predictors we will consider are for stationary time series. The first is for any general stationary time series (but it has interesting applications for AR processes) and the second is for ARMA processes. A general prediction scheme for any type of time series (not necessarily stationary) called the Innovations Algorithm is considered in Brockwell and Davis (1998), Chapter 5. 3.1 Basis and linear vector spaces Before we continue we first discuss briefly the idea of a vector spaces, spans and basis. A more rigours approach is given in Brockwell and Davis (1998), Chapter 2, and any good linear algebra book. However what is outlined here should be sufficient for the course. First a quick definition of a vector space. X is a vector space if for every x, y X and a, b R, then ax + by X (and ideal example of a vector space is Rn ). A normed linear vector space (usually called a Hilbert space), is a vector space defined with a norm (or inner product). The norm satisfies a set of conditions, I won't give them here, but a good example of a normed vector space is Rn where the inner product between two vectors x, y Rn is the inner product < x, y >= n xi yi . In this course, the normed vector spaces we will be considering are vector i=1 spaces comprising of random variables, and the inner product between two random variables in 29 the space is the covariance. From now on we will concentrate on spaces of random variables which have a finite variance. We say that the random variables {Xt , Xt-1 , . . . , X1 } spans the space Xt1 if for any Y Xt , there exists coefficients {aj } such that t Y = j=1 aj Xt+1-j . (3.2) Conversely, the random variables {Xt , . . . , X1 } can be used to define a vector space. That is we define the space Xt1 , where Y Xt1 if and only if there exists coefficients {aj } with j a2 < j such that Y = t aj Xt+1-j . We often write Xt1 = sp(Xt , . . . , X1 ) to denote the space spanned j=1 by {Xt , Xt-1 , . . . , X1 }. The basis of a vector space is closely related to a span. {Xt , . . . , X1 } is a basis of Xt1 if (3.2) is true, however if {Xt , . . . , X1 } is a basis this representation is unique. That is there does not exist another set of coefficients {bj } such that Y = t bj Xt+1-j . For j=1 this reason one can consider a basis as the minimal span, that is the smallest set of elements which can span a space. (X Definition 3.1.1 (Projections) We note that the projection of Y onto the space spanned by t t , Xt-1 , . . . , X1 ), is P( Xt ,Xt-1 ,...,X1 ) Yt = j=1 cj Xt+1-j , where {cj } is chosen such that the difference Y -P( Xt ,Xt-1 ,...,X1 ) Yt is uncorrelated (orthogonal) to any element in ( Xt , Xt-1 , . . . , X1 ). 3.1.1 Orthogonal basis In the context of what we will be doing, the most interesting example of an orthogonal basis is related to the best linear predictor. We recall that Xt+1|t is the best linear predictor of Xt+1 given Xt , . . . , X1 (it is the projection of Xt+1 onto Xt1 (the space spanned by Xt , . . . , X1 ). Therefore no (linear) information about Xt , . . . , X1 is contained in the difference Xt+1 - Xt+1|t . In other words Xt+1 - Xt+1|t and Xs (1 s t) are orthogonal (cov(Xs , (Xt+1 - Xt+1|t )). That is the space spanned by sp(Xt+1 - Xt+1|t ) and sp(Xt , . . . , X1 ) are orthogonal (think perpendicular). Continuing this argument we see that {(Xt -Xt|t-1 ), . . . , (X2 -X2|1 ), X1 } are orthogonal random variables (E((Xt - Xt|t-1 )(Xs - Xs|s-1 )) = 0 if s = t). To see that {(Xt - Xt|t-1 ), . . . , (X2 - X2|1 ), X1 } and Xt , . . . , X1 span the same space. We now define the sum of spaces. If U and V are two orthogonal vector spaces (which share the same norm), then y U V , if there exists a u U and v V such that y = u + v. By the definition of Xt1 , it is clear that (Xt - Xt|t-1 ) Xt1 , but (Xt - Xt|t-1 ) Xt-1 . Hence / 1 1 = sp(X - X 1 . Continuing this argument we see that X 1 = sp(X - X Xt t t t|t-1 ) Xt-1 t|t-1 ) t sp(Xt-1 - Xt-1|t-2 ), . . . , sp(X1 ). Hence sp(Xt , . . . , X1 ) = sp(Xt - Xt|t-1 , . . . , X2 - X2|1 , X1 ) (the spaces spanned by {(Xt - Xt|t-1 ), . . . , (X2 - X2|1 ), X1 } and Xt , . . . , X1 are the same). That is there exist coefficients {bj } such that t t-1 Y = j=1 aj Xt+1-j = j=1 bj (Xt+1-j - Xt+1-j|t-j ) + bt X1 . A useful application of orthogonal basis is the ease of obtaining the coefficients b j . In other words if Y = t-1 bj (Xt+1-j - Xt+1-j|t-j ) + bt X1 , then bj can be immediately obtained as j=1 bj = E(Y (Xj - Xj|j-1 ))/E(Xj - Xj|j-1 ))2 30 where (t )i,j = E(Xi Xj ) and (r t )i = E(Xi Y ). The problem with using the orthogonal representation {(Xt - Xt|t-1 ), . . . , (X2 - X2|1 ), X1 }, is that it is not easy to obtain E(Y (Xj - Xj|j-1 )) and E(Xj - Xj|j-1 ))2 . . Note that this is not necessarily the case a1 . . . at for obtaining the coefficients {a j } and that -1 = t r t (3.3) 3.1.2 Spaces spanned by infinite number of elements These ideas can be generalised to spaces which have an infinite number of elements (random variables) in their basis. Let now construct the space spanned by infinite number random variables {Xt , Xt-1 , . . .}. As always we need to define precisely what we mean by an infinite basis. To do this we construct a sequence of subspaces all with an increasing number in the basis and consider the limit of this space. Let Xt-n = sp(Xt , . . . , X-n ). Clearly if m > n, then -inf ty -n -m X -n . However we need to close this space, the Xt Xt . Now we define Xt = n=1 t space needs to be complete, that is a the limit of a converging sequence must also belong to this space too. To make this precise suppose the sequence of random variables is such that Y s Xt-s , and E(Ys1 - Ys2 )2 0 as s1 , s2 . It is clear that Ys Xt- . Since the sequence {Ys } is a Cauchy sequence there exists a limit, that is a random variable Y such that E(Y s - Y )2 0 as s . The closure of the space Xt-n , denoted Xt-n contains the set Xt-n and all the limits of the cauchy sequences in this set. We often use sp(Xt , Xt-1 . . . , ) to denote Xt- . You really do not have to worry too much about the above, basically Y sp(Xt , Xt-1 . . .) if E(Y 2 ) < and we can represent Y (almost surely) as Y = j=1 aj Xt+1-j , for some coefficients {aj }. The orthogonal basis of sp(Xt , Xt-1 , . . .) An orthogonal basis of sp(Xt , Xt-1 , . . .) can be constructed in the same way that an orthogonal basis of sp(Xt , Xt-1 , . . . , X1 ). The main difference is how to deal with the initial value, which in the case of sp(Xt , Xt-1 , . . . , X1 ) is X1 and in the case of sp(Xt , Xt-1 , . . .) is in some sense X- , but this it not really a well defined quantity (again we have to be careful with these infinities). Let Xt|t-1,... denote the best linear predictor of Xt given Xt-1 , Xt-2 , . . .. As in - = Section 3.1.1 it is clear that (Xt - Xt|t-1,... ) and Xs for s t - 1 are uncorrelated and Xt - - = sp(Xt , Xt-1 , . . .). Now let us consider the space sp(Xt - Xt|t-1,... ) Xt-1 , where Xt sp((Xt -Xt|t-1,... ), (Xt-1 -Xt-1|t-2,... ), . . .), comparing with the construction in Section 3.1.1, we see that sp((Xt - Xt|t-1,... ), (Xt-1 - Xt-1|t-2,... ), . . .) does not necessarily equal sp(Xt , Xt-1 , . . .), because sp((Xt - Xt|t-1,... ), (Xt-1 - Xt-1|t-2,... ), . . .) lacks the inital value X- . Of course the time - in the past is not really a well defined quantity. Instead the way we define the - initial starting random variable as the intersection of the subspaces Xt , hence let X- = - n=- Xt . Now we note since Xn - Xn|n-1,... and Xs (for any s n - 1) are orthogonal, that sp((Xt - Xt|t-1,... ), (Xt-1 - Xt-1|t-2,... ), . . .) and X- are orthogonal spaces and sp((Xt - Xt|t-1,... ), (Xt-1 - Xt-1|t-2,... ), . . .) X- = sp(Xt , Xt-1 , . . .). We will use this discussion when we prove the Wold decomposition theorem. 31 3.2 Durbin-Levinson algorithm The Durbin-Levinson algorithm is a simple method for obtaining the coefficients of the best linear predictor of Xt+1 given Xt , . . . , X1 . It was first proposed in the 40s by Norman Levinson and improved (and adapted to time series) in the early 60s by Jim Durbin. We recall we want to obtain the coefficients t,j , where t Xt+1|t = j=1 t,j Xt-j (3.4) minimises the mean squared error E(Xt - t aj Xt-j )2 . Of course the coefficients {t,j } can be j=1 obtained using (3.3), however, if t is large this can be computationally quite intensive. Instead we consider an algorithm which obtains {t,j } using {t-1,j } and without the need to invert the matrix t . This algorithm can only be applied to stationary time series (as it derived under the assumption of stationarity). We will show later that it is also useful for estimating the parameters of an autoregressive time series. Let us suppose {Xt } is a zero mean stationary time series and c(k) = E(Xk X0 ). Let X1|t,...,2 denote the best linear predictor of X1 given Xt , . . . , X2 . We first note that by construction that {Xt , . . . , X2 } and X1 - X1|t,...,2 are orthogonal, and {Xt , . . . , X2 , X1 } and {Xt , . . . , X2 , X1 - X1|t,...,2 } span the same space. Hence we can rewrite Xt+1|t as t t-1 Xt+1|t = j=1 t,j Xt+1-j = j=1 ~ t,j Xt+1-j + at (X1 - X1|t,...,2 ). Now this is first note that by the orthogonality of {Xt , . . . , X2 } and X1 - X1|t,...,2 we can `aggregate' the predictions, that is Xt+1|t = Xt+1|t,...,2 + at (X1 - X1|t,...,2 ). (3.5) We can rewrite the above using the orthogonality of {Xt , . . . , X2 } and X1 - X1|t,...,2 to obtain at = E(Xt+1|t (X1 - X1|t,...,2 )) . E(X1 - X1|t,...,2 )2 Furthermore, since Xt+1 = Xt+1|t + (Xt+1 - Xt+1|t ) and (Xt+1 - Xt+1|t ) and {Xt , . . . , X1 } are orthogonal we observe that at = E(Xt+1 (X1 - X1|t,...,2 )) . E(X1 - X1|t,...,2 )2 So far we have not used the stationarity of the time series {Xt }, but we do now. We observe that X1|t,...,2 is the best linear predictor of X1 given {X2 , . . . , Xt }. By stationarity the coefficients of the best linear predictor of Xt given Xt-1 , . . . , X1 are the same as those of the best linear predictor of Xt+1 given Xt , . . . , X2 (due to shift invariance). But the same is also true if we flip the time series around. That is the coefficients which given the best linear predictor of 32 Xt+1 given Xt , . . . , X2 are the same (but in reverse) of the best linear predictor of X1 given X2 , . . . , Xt . In other words under the assumption of stationarity we have t-1 Xt|t-1 = j=1 t-1 t-1,j Xt-j t-1 Xt+1|t,...,2 = j=1 t-1,j Xt+1-j and X1|t,...,2 = j=1 t-1,j Xj+1 . Therefore substituting the above into (3.5) we have Xt+1|t = Xt+1|t,...,2 + at (X1 - X1|t,...,2 ), hence t-1 Xt+1|t = j=1 t-1 t-1,j Xt+1-j + at (X1 - X1|t,...,2 ) (t-1,j - at t-1,t-j )Xt+1-j + at X1 , = j=1 where at = E((X1 - t-1 j=1 t-1,j Xj+1 )Xt+1 ) E(X1 - X1|t,...,2 )2 = c(t) - t-1 j=1 t-1,j c(t r(t) - j) (3.6) where r(t) = E(Xt - Xt|t-1 )2 . Now we note that Xt+1|t satisfies (3.4), therefore by comparing coefficients (the linear predictor is unique) we have t,t = at t,j = t-1,j - at t-1,t-j for j < t. Finally to recursively obtain the one-step ahead prediction error v(t + 1) we note that by orthogonality of {Xt , . . . , X2 } and X1 - X1|t,...,2 that r(t + 1) = E(Xt+1 - Xt+1|t )2 = E(Xt+1 - BestLin(Xt+1 |Xt , . . . , X2 ) - at (X1 - X1|t,...,2 ))2 = E(Xt+1 - BestLin(Xt+1 |Xt , . . . , X2 ))2 + a2 E(X1 - X1|t,...,2 ))2 t -2at E(Xt+1 (X1 - X1|t,...,2 )) = r(t) + a2 r(t) - 2at E(Xt+1 (X1 - X1|t,...,2 )). t We note that by construction of at in (3.6) that r(t) = E(X1 - X1|t,...,2 )2 , substituting this into the above gives r(t + 1) = r(t) + a2 r(t) - 2a2 r(t) = r(t)(1 - a2 ). t t t Hence we have the recursion. The note that the initial values are 1,1 = c(1)/c(0) and r(1) = c(0) (which is straightforward to prove). Further references: Brockwell and Davis (1998), Chapter 5 and Fuller (1995), pages 82. 33 3.3 Prediction for ARMA processes Given the autocovariance of any stationary process the Durbin-Levinson algorithm allows us to systematically obtain one-step predictors without too much computational burden. This includes ARMA processes, where as we have shown in Section 2.3.3 the covariances can be obtained from the ARMA parameters. However, there for ARMA processes there are easier methods for doing the prediction which we describe below. Let us suppose the ARMA process is both causal and invertible, that is Xt satisfies (2.5) (Xt - p i Xt-i = t + q j t-j ). Then by using Lemma 2.2.1, Xt can be written as i=1 j=1 Xt = j=0 aj (, )t-j and Xt = j=1 bj (, )Xt-j + t , (3.7) where we use aj (, ) and bj (, ) to emphasis that the AR() and MA() parameters are functions of 1 , . . . , p and 1 , . . . , q . The above means that given {k }t k=- we can construct Xt t-1 and given ({Xk }k= , t ) we can construct Xt (in other words the sigma-algebras ({k }t k= ) = ({Xk }t-1 , t )). k= We recall that Xt+1|t is the best linear predictor of Xt+1 given Xt , . . . , X1 and the one step ahead prediction error is E(Xt+1 - Xt+1|t )2 . We now define the best linear predictor of Xt+1 given the infinite past Xt , Xt-1 , . . . as Xt+1|t,... . In practice Xt+1|t can be evaluated but not Xt+1|t,... , since we do not observe the entire past {X0 , X-1 , . . .}. However, if t is large, then most of the information in Xt+1|t,... will be contained in the first t terms. Since Xt+1|t,... is easy to obtain (if X0 , X-1 were known) we will often use it, we will discuss later its relationship to Xt+1|t . It is clear from (3.7) that Xt+1|t,... = j=1 bj (, )Xt+1-j , (3.8) since t+1 is orthogonal to {Xt , Xt-1 , . . .}. Of course X0 , X-1 , . . . is unknown so we approximate Xt+1|t,... with a truncated version t ~ Xt+1|t,... = j=1 bj (, )Xt+1-j . (3.9) ~ It is clear that Xt+1|t,... sp(Xt , . . . , X1 ), but it is not necessarily the best linear predictor (in ~ other words it may not be Xt+1|t - or equivalently, Xt+1 - Xt+1|t,... may not be orthogonal to Xk for 1 k t). However in Proposition 3.3.1, we will show that for large t (a long past ~ Xt , . . . , X1 is observed), then Xt+1|t,... Xt+1|t . Of course, since {bj (, )} is not easy to evaluate from j and i , using (3.9) to obtain ~ t+1|t,... is not straightforward. But we now use the ARMA structure (which we have not used X ~ previously), to derive a simple way to calculate Xt+1|t,... . 34 To do this we consider again Xt+1|t,... . We recall that t = Xt - Xt|t-1,... using this we have p q Xt+1 = j=1 p j Xt+1-j + i=1 q j t+1-j + t+1 j Xt+1-j - Xt-j|t-j-1,... + t+1 . = j=1 j Xt+1-j + i=1 Therefore Xt+1|t,... = j=1 p q bj (, )Xt+1-j = j=1 p j Xt+1-j + i=1 q j t+1-j j Xt+1-j - Xt-j (1) . = j=1 j Xt+1-j + i=1 ~ We now return to Xt+1|t,... . Set Zt = Xt for 1 t max(p, q), and define the recursion for t > max(p, q) that p q Zt = j=1 j Xt+1-j + i=1 j Xt+1-j - Zt-j . t It is straightforward to show that for t max(p, q), that Zt = j=1 bj (, )Xt+1-j , hence ~ Xt+1|t = Zt . Hence given the parameters {j } and {j }, it is easily to evaluate Xt+1|t,... recursively. ~ We show in the following proposition that Xt+1|t,... and Xt+1|t are close when t is large (giving ~ some justification for using Xt+1|t,... ). To prove the result we need a result that we will prove in a later Chapter. Lemma 3.3.1 Suppose {Xt } is a stationary time series with spectral density f (). Let X t = (X1 , . . . , Xt ) and t = var(X t ). If the spectral density function is bounded away from zero (there is some > 0 such that inf f () > 0), for any t, min (t ) (where min and max denote the smallest and largest absolute eigenvalues of the matrix t ). Hence max (-1 ) -1 . t Since for symmetric matrices the spectral norm and the largest eigenvalue are the same, then -1 spec -1 . t Furthermore if sup f () M < , then max t M (hence t spec < M ). PROOF. Later in the course. Remark 3.3.1 Now for an ARMA process, where the roots of the AR part have absolute value which is greater than one, then corresponding spectral density is bounded away from zero. Moreover, the spectral density of an ARMA process is always bounded from above. In other words if f is the spectral density of an ARMA process, where the roots of (z) and and have absolute value greater than 1 + 1 and less than 2 , then the spectral density f () is bounded by var(t ) (1-( (1- 1 )2p of an ARMA process given in (2.22). 2 1 )2p 1+1 f () var(t ) 1 (1-( 1+ )2p (1- 1 )2p 2 1 . This can be proved by using the spectral density 35 Proposition 3.3.1 Suppose {Xt } is an ARMA process where the roots of (z) and (z) have ~ roots which are greater in absolute value than 1 + . Let Xt+1|t,... , Xt+1|t and Xt+1|t,... be defined as in (3.9), (3.1) and (3.8) respectively. Then ~ E(Xt+1|t,... - Xt+1|t )2 Kt , E(Xt+1 - Xt+1|t )2 - 2 K2 ~ E(Xt+1|t,... - Xt+1|t,... )2 Kt where 1 1+ (3.10) (3.11) (3.12) < < 1 and var(t ) = 2 . j=1 bj (, )Xt+1-j + PROOF. The proof of (3.10) becomes clear when we use the expansion X t+1 = t+1 . Evaluating the best linear predictor of Xt+1 given Xt , . . . , X1 gives Xt+1|t = = j=1 ~ Xt+1|t,... j=1 t bj (, )Xt+1-j|t,...,1 + BestLin(t+1 |Xt , . . . , X1 ) bj (, )Xt+1-j + j=t+1 bt+j (, )X-j+1|t,...,1 . (to see this consider the Gaussian case where E(Xt+1 |Xt , . . . , X1 ) = E( bj (, )Xt+1-j + j=1 t+1 |Xt , . . . , X1 ) = bj (, )Xt+1-j E(Xt+1-j |Xt , . . . , X1 )). Therefore the difference bej=1 ~ tween the best linear predictor and Xt+1|t,... is ~ Xt+1|t - Xt+1|t,... = j=0 bt+j (, )X-j+1|t,...,1 . ~ Intuitively it is clear that when t is large the difference |Xt+1|t - Xt+1|t,... | decays geometrically because the coefficients bt+j (, ) decay geometrically. We formalise these ideas now. To obtain a bound for this difference we need to obtain bounds for X-j+1|t,...,1 (the best linear predictor of the unobserved past terms X-j given the `future' terms Xt , . . . , X1 ). For j 0 we have t X-j+1|t,...,t = where i,j,t Xi , i=1 (3.13) j,t = -1 r t,j , t (3.14) with j,t = (1,j,t , . . . , t,j,t ), X t = (X1 , . . . , Xt ), t = E(X t X t ) and r t,j = E(X t Xj ). This gives ~ Xt+1|t - Xt+1|t,... = j=t+1 bt+j (, ) j,t X t = 36 j=t+1 bt+j (, )r t,j t X t . (3.15) Taking expectations, we have ~ E(Xt+1|t - Xt+1|t,... )2 = j=t+1 bt+j (, )r t,j -1 t j=t+1 bt+j (, )r t,j By using the Cauchy schwarz inequality ( aBb 1 a 2 Bb 2 ), the spectral norm inequality n n ( a 2 Bb 2 a 2 B spec b 2 ) and Minkowiski's inequality ( j=1 aj 2 j=1 aj 2 ) we have ~ E(Xt+1|t - Xt+1|t,... )2 j=t+1 bt+j (, )r t,j 2 2 -1 t 2 spec j=t+1 |bt+j (, )| r t,j 2 2 -1 t 2 spec . (3.16) We start by bound each of the terms on the right hand side of the above. We note that 1 for all t, using Remark 3.3.1 that -1 spec K(1 - ( 1+1 )-2p . We now consider r t,j = t (E(X1 X-j ), . . . , E(Xt X-j )). By using (2.14) we have E(X1 X-j ) Kj-1 etc. Therefore t r t,j 2 K( r=1 2(j+r) )1/2 j . (1 - 2 )2 j=0 Substituting the above bounds into (3.16) gives ~ E(Xt+1|t - Xt+1|t,... )2 K(1 - ( 1 ))-2p 1 + 1 2 |bt+j (, )| j (1 - 2 )2 2 . Now we note that by using Lemma 2.2.1, that |bj (, )| Kj and this gives ~ E(Xt+1|t - Xt+1|t,... )2 K j=0 t+j j (1 - 2 )2 2 . Thus proving (3.10). To prove (3.11) we use Xt = j=1 bj (, )Xt+1-j + t+1 , Xt = bt+j (, )X-j+1|t,...,1 and (3.13) to obtain j=t+1 Xt+1 - Xt+1|t = t+1 + Hence E(Xt+1 - Xt+1|t )2 = var(t+1 ) + E j=0 j=0 t j=1 bj (, )Xt+1-j + bt+j (, )(Xj - r t,j t X t . bt+j (, )(Xj - r t,j t X t ) . 2 Therefore using Minkowski's inequality and |bj (, ) Kj | we have E(Xt+1 - Xt+1|t ) var(t+1 ) + ( 2 j=0 bt+j (, ){E(Xj - r t,j t X t )2 }1/2 , Kt 2 37 thus proving (3.11). To prove (3.12) we note that ~ E(Xt (1) - Xt (1))2 = E( j=t+1 bj (, )Xt+1-j )2 , now by using (2.9), it is straightforward to prove the result. Remark 3.3.2 We note that the one-step ahead predictor depends on the parameters , which are used to do the prediction and the previous observations. On the other hand, the one-step ahead prediction error E(Xt+1 - Xt+1|t )2 depends only on the parameters , and 2 . 3.4 The Wold Decomposition The above discussion on prediction and Section 3.1.2 leads very nicely to the Wold decomposition. It states that any stationary process, almost, has an MA() representation. We state the theorem below and use some of the notation introduced in Section 3.1.2. Theorem 3.4.1 Suppose that {Xt } is a second order stationary time series with a finite variance (we shall assume that it has mean zero, though this is not necessary). Then X t can be uniquely expressed as Xt = j=0 j Zj + V t , (3.17) where {Zt } are uncorrelated random variables, with var(Zt ) = E(Xt - Xt|t-1,... )2 (Xt|t-1,... is the - best linear predictor of Xt given Xt-1 , Xt-2 , . . .) and Vt X- = - Xn . n=- PROOF. First let is consider the one-step ahead prediction error Xt|t-1,... . Since {Xt } is a second order stationary process it is clear that Xt|t-1 = bj Xt-j , where the coefficients j=1 {bj } do not vary with t. For this reason {Xt|t-1,... } and {Xt - Xt|t-1,... } are second order stationary random variables. Furthermore, since {Xt - Xt|t-1,... } is uncorrelated with Xs for any s t - 1, then {Xt - Xt|t-1,... } are also uncorrelated random variables, let Zt = Xt - Xt|t-1,... , hence Zt is the one-step ahead prediction error. We recall from Section 3.1.2 that Xt sp((Xt - Xt|t-1,... ), (Xt-1 - Xt-1|t-2,... ), . . .) sp(X- ) = sp(Zt , Zt-1 , . . .) sp(X- ). Since the spaces sp(Zt , Zt-1 , . . .) and sp(X- ) are orthogonal, we shall first project Xt onto sp((Xt - Xt|t-1,... ), (Xt-1 - Xt-1|t-2,... ), . . .), due to orthogonality the difference between Xt and its projection will be in sp(X- ). This will lead to the Wold decomposition. First we consider the projection of Xt onto the space sp(Zt , Zt-1 , . . .), which is Psp(Zt ,Zt-1 ,...) Xt = j=0 j Zt-j , where due to orthogonality j = cov(Xt , (Xt-j - Xt-j|t-j-1,... ))/var(Xt-j - Xt-j|t-j-1,... ). Since Xt sp(Zt , Zt-1 , . . .) sp(X- ), the difference Xt - Psp(Zt ,Zt-1 ,...) Xt is orthogonal to {Zt } and 38 belongs in sp(X- ). Hence we have Xt = j=0 j Zt-j + Vt , where Vt = Xt - j Zt-j and is uncorrelated to {Zt }. Hence we have shown (3.17). To j=0 show that the representation is unique we note that Zt , Zt-1 , . . . are an orthogonal basis of sp(Zt , Zt-1 , . . .), which pretty much leads to uniqueness. It is worth noting that variants on the proof can be found in Brockwell and Davis (1998), Section 5.7 and Fuller (1995), page 94. Remark 3.4.1 Notice that the representation in (3.17) looks like an MA() process. There is, however, a significant difference. The random variables {Zt } of an MA() process are iid random variables and not just uncorrelated. There are several example of time series which uncorrelated but not independent, one such example is the ARCH process which we will consider later in the course. 39 Chapter 4 Estimation for Linear models We now consider various methods for estimating the parameters in a stationary time series. We first consider estimation of the mean and covariance and then look at estimation of the parameters of an AR and ARMA process. 4.1 Estimation of the mean and autocovariance function Yt = + X t , Let us suppose the stationary time series Yt satisfies where is the finite mean, {Xt } is a zero mean stationary time series with absolutely summable covariances ( k |cov(X0 , Xk )| < ). Below we consider methods to estimate the mean and autocovariance function. 4.1.1 Estimating the mean Suppose we observe {Yt }n , and we want to estimate the mean . In an ideal world we would t=1 observe independent replications of Yt . We would then use the average, that is Yn = n-1 n Yt t=1 n is a `good' estimator of the mean as an estimator of . If the variance of Yt is finite, then Y which convergences at the rate O(n-1 ) (that is var(Yn ) = n-1 var(Y1 )). However in the case that {Yt } are not independent and we observe a time series, then we can still use Yn as an estimator of . The only drawback is that the dependency means that one observation will influence the next and the resulting estimator will not be so reliable. But it is easy to show that 2 var(Yn ) n k |cov(X0 , Xk )|. Hence if k |cov(X0 , Xk )| < , then E(Yn - )2 K/n, where K is a finite constant. This means, despite the estimator not being as good as an estimator an estimator constructed from independent observations, Yn is still n-consistent. 4.1.2 Estimating the covariance Suppose we observe {Yt }n , to estimate the covariance we can estimate the covariance c(k) = t=1 E(X0 Xk ) from the the observations a plausible estimator is cn (k) = ^ 1 n n-|k| t=1 (Yt - Yn )(Yt+|k| - Yn ), 40 (4.1) since E((Yt - Yn )(Yt+|k| - Yn ) c(k). Of course if the mean of Yt were zero (Yt = Xt ), then the covariance estimator is 1 cn (k) = ^ n n-|k| Xt Xt+|k| . t=1 1 T -|k| n-|k| t=1 Xt Xt+|k| , (4.2) and that The eagle-eyed amongst you may wonder why we don't use 1 cn (k) is more biased than T -|k| ^ which are discussed in the remark below. n-|k| t=1 Xt Xt+|k| . However cn (k) has some very nice properties ^ Remark 4.1.1 Suppose we define the empirical covariances cn (k) = ^ 1 n n-k t=1 Xt Xt-k |k| n - 1 0 otherwise then {hatcn (k)} is positive definite sequence. Therefore, using Lemma 1.1.1 there exists a stationary time series {Zt } which has the covariance cn (k). ^ There are various ways to show that {^n (k)} is a positive definite sequence. One method uses c that corresponding spectral density is positive. We recall that the spectral density was defined in Definition 2.4.1, but we have yet to discuss its properties. One the properties is that is positive. In other words if {c(k)} is a positive definite sequence its fourier transform is positive, if f is positive, then the fourier coefficients are positive definite (we will look into detail at these properties in a later chapter). Using this property we will show that {^n (k)} is positive definite. c But I briefly describe the proof. The spectral density is the a positive definite sequence. The fourier transform of {^n (k)} is c (n-1) (n-1) exp(ik)^n (k) = c k=-(n-1) k=-(n-1) 1 exp(ik)^n (k) = c n n-|k| Xt Xt+|k| t=1 1 = n n t=1 Xt exp(it) 0. Since it is positive, this means that {^n (k)} is a positive definite sequence. c 4.1.3 Some asymptotic results on the covariance estimator The following theorem gives the asymptotic sampling properties of the covariance estimator (4.1). The proof of the result can be found in Brockwell and Davis (1998), Chapter 8, Fuller (1995), but it goes pretty much back to Bartlett (1981) (indeed its called Bartlett's formula). Theorem 4.1.1 Suppose {Xt } is a stationary time series where Xt = + j=- 4 where j |j | < , {Zt } are iid random variables with E(Zt ) < . Suppose we observe {Xt : t = 1, . . . , n} and use (4.1) as an estimator of the covariance c(k) = cov(X0 , Xk ). Then for each h {1, . . . , n} j Zt-j , n(^n (h) - c(h)) N (0, Wh ) c 41 D (4.3) ^ where cn (h) = (^n (1), . . . , cn (h)), c(h) = (c(1), . . . , c(h)) and c ^ (Wh )ij = k=- {c(k + i) + c(k - i) - 2c(i)c(k)}{c(k + j) + c(k - j) - 2c(j)c(k)}. Example 4.1.1 This example is quite an important application of the above theorem. It is used to check by `eye; whether a time series is uncorrelated (there are more sensitive tests, but this one is often used to construct CI in for the sample autocovariances in several statistical packages). Suppose {Xt } are iid random variables, and we use (4.1) as an estimator of the autocovariances. Recalling if {Xt } are iid then c(k) = 0 for k =, using this and (4.3) we see that the asymptotic ^ distribution of cn (h) in this case is D n(^n (h) - c(h)) N (0, Wh ) c where (Wh )ij = var(Xt ) i = j 0 i=j D c In other words n(^n (h) - c(h)) N (0, var(Xt )I). Hence the sample autocovariances at different lags are uncorrelated. This allows us to easily construct confidence intervals for the autocovariances under the assumption of the observations. If the vast majority of the sample autocovariance lie inside the confidence there is not enough evidence to suggest that the data is a realisation of a iid random variables (often called a white noise process). Axample of the empirical ACF and the CI constructed under the assumption of independence is given in Figure 4.1. We see that the empirical autocorrelations of the realisation from iid random variables all lie within the CI. The same cannot be said for the emprical correlations of a dependent time series. Remark 4.1.2 (Long range dependence versus changes in the mean) We first note that a process is said to have long range dependence if the covariances k |c(k)| are not absolutely summable. From a practical point of view data is said to exhibit long range dependence if the autocovariances do not decay very fast to zero as the lag increases. We now demonstrate that one must becareful in the diagnoses of long range dependence, because a slow decay of the autocovariance could also imply a change in mean if this has not been corrected for. This was shown in Bhattacharya et al. (1983), and applied to econometric data in Mikosch and Stric (2000) a a and Mikosch and Stric (2003). A test for distinguishing between long range dependence and a a change points is proposed in Berkes et al. (2006). Suppose that Yt satisfies Yt = t + t , where {t } are iid random variables and the mean t depends on t. We observe {Yt } but do not know the mean is changing. We want to evaluate the autocovariance function, hence estimate the autocovariance at lag k using cn (k) = ^ 1 n n-|k| t=1 (Yt - Yn )(Yt+|k| - Yn ). 42 Series ACF1 ACF -0.2 0.0 0.2 0.4 0.6 0.8 1.0 0 5 10 Lag 15 20 Series ACF2 ACF -0.5 0.0 0.5 1.0 0 5 10 Lag 15 20 Figure 4.1: The top plot is the empirical ACF taken from a iid data and the lower lot is the empirical ACF of a realisation from the AR(2) model defined in (2.19). Observe that Yn is not really estimating the mean but the average mean! If we plotted the empirical ACF {^n (k)} we would see that the covariances do not decay with time. However the c true ACF would be zero and at all lags but zero. The reason the empirical ACF does not decay to zero is because we have corrected for the correct mean. Indeed it can be shown that for large lags cn (k) s<t (s -t )2 . Hence because we are not correcting for the mean in the autocovariance, ^ it remains. 4.2 Estimation for AR models Let us suppose that {Xt } is a zero mean stationary time series which satisfies the AR(p) representation p Xt = j=1 j Xt-j + t , where E(t ) = 0 and var(t ) = 2 and the roots of the characteristic polynomial 1 - p j z j j=1 lie outside the unit circle. Our aim in this section is to construct estimator of the AR parameters {j }. We will show that in the case that {Xt } has an AR(p) representation the estimation is relatively straightforward, and the estimation methods all have properties which are asymptotically equivalent to the Gaussian maximum estimator. The following estimation scheme stem from the following observation. Suppose the AR(p) time series {Xt } is causal (that is the roots of the characteristic polynomial lie outside the unit circle, hence it satisfies an MA() presentation). Then we can multiple X t by Xt-i for 1 i p, since the process is causal t and Xt-i . Therefore taking expectations we have for all 43 i>0 p p E(Xt Xt-i ) = j=1 j E(Xt-j Xt-i ), c(i) = j=1 j c(i - j). (4.4) Recall these are the Yule-Walker equations we considered in Section 2.3.3. Putting the cases 1 i p together we can write the above as ^ p = p p , where (p )i,j = c(i - j), ( p )i = c(i) and p = (1 , . . . , p ). (4.5) 4.2.1 The Yule-Walker estimator The Yule-Walker equations inspire the method of moments estimator often called the Yule^ Walker estimator. We use (4.5) as the basis of the estimator. It is clear that p and p are ^ ^ estimators of and p where (p )i,j = cn (i - j) and (^ )i = cn (i). Therefore we can use ^ ^ p p ^ ^p ^ p = -1 p , (4.6) as an estimator of the AR parameters p = (1 , . . . , p ). We observe that if p is large this involves ^ inverting a large matrix. However, we can use the Durbin-Levinson algorithm to estimate by fitting lower order AR processes to the observations and increasing the order. This way an explicit inversion can be avoided. We detail how the Durbin-Levinson algorithm can be used to estimate the AR parameters below. Using Remark 4.1.1 there exists a process Zt which has the autocovariance function {^n (k)}. c This means the best linear predictor of Ym+1 given Ym , . . . , Y1 is m p Ym+1|m = j=1 m,j Ym+1-j , ^ ^ ^m ^ ^ ^ where m = (m,1 , . . . , m,m ) = -1 m with (m )i,j = cn (i - j) and (^ m )i = cn (i). Hence ^ are in fact the estimators of the AR(m) parameters. Now recally from Section 3.2 that for m ^ ^ ^ m 2, that m can be obtained from m-1 and the empirical covariances cn (k). Hence we ^ , by first fitting an AR(1) can use the Durbin-Levinson algorithm to estimate the parameters model to the time series, then iterating the Durbin-Levinson algorithm to fit higher order AR models, until we finally fit the AR(p) model to the time series. n-|k| 1 In the pevious sections we estimate the covariance E(X0 Xk ) using n t=1 Xt Xt+|k| , which lead to the Yule-Walker estimators. In the following section we estimate the covariance in a slighly different way. This will lead to the least squares estimator (or maximum likelihood estimator). Both estimators are different but are asymptotically equivalent. p 44 4.2.2 The Gaussian maximum likelihood (least squares estimator) Our object here is to obtain the maximum likelihood estimator of the AR(p) parameters. It turns out that this is the same as the least squares estimator. We recall that the maximum likelihood estimator is the parameter which maximises the joint density of the observations. Since the log-likelihood often has a simpler form, we often maximise the log density rather than the density (since both the maximum likelihood estimator and maximum log likelihood estimator yield the same estimator). We note that the Gaussian MLE is constructed as if the observations {Xt } were Gaussian, though it is not necessary that {Xt } is Gaussian when doing the estimation. They can have a different distribution, the only difference is that estimate may be less efficient (will not obtain the Cramer-Rao lower bound). Suppose we observe {Xt ; t = 1, . . . , n} where Xt are observations from an AR(1) process. To construct the the MLE, we use that the joint distribution of {Xt } is the product of the conditional distributions. Hence we need an expression for the conditional distribution (in terms of the densities). Let F f be the distribution function and the density function of respectively. We first note that the AR(p) process is p-Markovian, that is P(Xt x|Xt-1 , Xt-2 , . . .) = P(Xt x|Xt-1 , . . . , Xt-p ) fa (Xt |Xt-1 , . . .) = fa (Xt-1 |Xt-1 , . . . Xt-p ), 4.7) ( where fa is the conditional density of Xt given the past, where the distribution function is derived as if a is the true AR(p) parameters. Remark 4.2.1 To understand why (4.7) is true consider the simple case that p = 1 (AR(1)). Studying the conditional probability gives P(Xt xt |Xt-1 = xt-1 , . . .) = P(aXt-1 + = P( t t xt - axt-1 ) = P(Xt xt |Xt-1 = xt-1 ). p j=1 aj Xt-j ), p xt |Xt-1 = xt-1 , . . .) By using the (4.7) we have P(Xt x|Xt-1 , . . .) = P ( x - p hence (4.8) P(Xt x|Xt-1 , . . .) = F (x - aj Xt-j ), j=1 fa (Xt |Xt-1 , . . .) = f (Xt - aj Xt-j ). j=1 Therefore the joint density of {Xt }n is t=1 fa (X1 , X2 , . . . , Xn ) = fa (X1 , . . . , Xp ) n t=p+1 n fa (Xt |Xt-1 , . . . , X1 ) (by Bayes theorem) fa (Xt |Xt-1 , . . . , Xt-p ) (by the Markov property) p = fa (X1 , . . . , Xp ) t=p+1 n = fa (X1 , . . . , Xp ) t=p+1 f (Xt - aj Xt-j ) j=1 (by (4.8)). Therefore the log likelihood is n p log fa (X1 , X2 , . . . , Xn ) = log fa (X1 , . . . , Xp ) + often ignored t=p+1 log f (Xt - aj Xt-j ) . j=1 conditional likelihood 45 Usually we ignore the initial distribution log fa (X1 , . . . , Xp ) and maximise the conditional likelihood to obtain the estimator. In the case that the sample sizes are large n >> p, the contribution of log fa (X1 , . . . , Xp ) is minimal and the conditional likelihood and likelihood are asymptotically equivalent. We note in the case that f is Gaussian, the conditional log-likelihood is -nLn (a), where Ln (a) = log 2 + 1 n 2 n t=p+1 p (Xt - aj Xt-j )2 . j=1 ~ ~ Therefore the estimates of the AR(p) parameters is p = arg min Ln (a). It is clear that p is the least squares estimator and can be explicitly obtained using ~ ~p ~ p = -1 p , ~ where (p )i,j = 1 n-p n t=p+1 Xt-i Xt-j and (~ n )i = 1 n-p n t=p+1 Xt Xt-i . Remark 4.2.2 (A comparison of the Yule-Walker and least squares estimators) If we ~ compare the least squares (Gaussian conditional likelihood) estimator p with the Yule-Walker es^ timator , then we see that they are very similar. The difference lies in the way the covariances 1 are estimated. We see that for the Yule-Walker estimator n n-i Xt Xt+i is used exclusively to t=1 n-r 1 estimate the covariance c(i). Whereas for the least squares estimator { n t=r Xt Xt+k : k r are all used as estimators of c(k). There is very little difference between these two covariances estimates, indeed the Yule-Walker estimates and the least squares estimates have asymptotically the same properties. There are however subtle differences, in the actual estimators. Because the Yule-Walker is constructed from a positive definite sequence, using the parameter estimates ^ p one can construct a stationary AR(p) process. The same is not necessarily true of the least squares estimator, which does not necessarily construct a stationary AR(p) process. Moreover, ^ ^ because p can be used to construct a stationary AR(p) process, it can be shown that p 2 2p , ~ the same does not necessarily hold for the least squares estimate . p p 4.3 Estimation for ARMA models p q (0) i Xt-i i=1 (0) (0) Let us suppose that {Xt } satisfies the ARMA representation Xt - (0) (0) = t + j=1 j t-j , (0) 2 and 0 = (1 , . . . , q ), 0 = (1 , . . . , p ) and 0 = var(t ). We will suppose for now that p and q are known. In the following sections we consider different methods for estimating 0 and 0 . 4.3.1 The Hannan and Rissanen AR() expansion method We first describe an easy method to estimate the parameters of an ARMA process. These estimates may not necessarily be `efficient' (we define this term later) but they have an explicit 46 form and can be easily obtained. Therefore they are a good starting point, and can be used as the initial value when using the Gaussian maximum likelihood to estimate the parameters (as described below). The method was first propose in Hannan and Rissanen (1982) and An et al. (1982) and we describe it below. It is worth bearing in mind that currently the `large p small n problem' is a hot topic. These are generally regression problems where the sample size n is quite small but the number of regressors p is quite large (usually model selection is of importance in this context). The methods proposed by Hannan involves expanding the ARMA process (assuming invertibility) as an AR() process and estimating the parameters of the AR() process. In some sense this can be considered as a regression problem with an infinite number of regressors. Hence there are some parallels between the estimation described below and the `large p small n problem'. As we mentioned in Lemma 2.2.2, if an ARMA process is invertible it is can be written as Xt = j=1 bj Xt-j + t . (4.9) The idea behind Hannan's method is to estimate the parameters {bj }, then estimate the innovations t , and use the estimated innovations to construct a multiple linear regression estimator of the ARMA paramters {i } and {j }. Of course in practice we cannot estimate all parameters {bj } as there are an infinite number of them. So instead we do a type of sieve estimation where we only estimate a finite number and let the number of parameters to be estimated grow as the sample size increases. We describe the estimation steps below: (i) Suppose we observe {Xt }n . Recalling (4.9), will estimate {bj }pn parameters. We will t=1 j=1 suppose that pn as n and pn << n (we will state the rate below). We use least squares to estimate {bj }pn and define j=1 ^ ^ -1 r bn = Rn ^n , where n T ^ Rn = t=pn +1 Xt-1 Xt-1 ^n = r t=pn +1 Xt Xt-1 and Xt-1 = (Xt-1 , . . . , Xt-pn ). (ii) Having estimated the first {bj }pn coefficients we estimate the residuals with j=1 pn t = Xt - ~ ^j,n Xt-j . b j=1 ~ ~ (iii) Now use as estimates of 0 and 0 n , n where n p q ~ ~ n , n = arg min t=pn +1 (Xt - 47 j=1 j Xt-j - i t-i )2 . ~ i=1 We note that the above can easily be minimised. In fact ~ ~ ~n s (n , n ) = R-1~n where 1 ~ Rn = n n n ~ ~ Yt Yt t=max(p,q) and ~n = s 1 T ~ Y t Xt , t=max(p,q) ~ Yt = (Xt-1 , . . . , Xt-p , t-1 , . . . , t-q ). ~ ~ 4.3.2 The Gaussian maximum likelihood estimator We now consider the Gaussian maximum likelihood estimator (GMLE) to estimate the parameters 0 and 0 . Let XT = (X1 , . . . , XT ). We note that despite calling the estimate the GMLE, it does not assume that the time series {Xt } is Gaussian. The criterion (the GMLE) is constructed as if {Xt } were Gaussian, but this need not be the case. It is clear that the negative Gaussian likelihood of {Xt }n , assuming that it is a realisation t=1 from an ARMA process is 1 Ln (, , ) = n 1 log |(, )| + Xn (, )-1 Xn , n (4.10) where (, , ) the variance covariance matrix of Xn constructed as if Xn came from an ARMA process with parameters and . To directly evaluate the above for each (, ), never mind about minimising over all (, ) can be a daunting task, its computationally extremely difficult for even relatively large sample sizes. However, there exists a simple solution, which uses the one-step predictions considered in the prediction section. Let Xt+1|t = BestLin(,)(Xt+1 |Xt , . . . , X1 ), (,) (4.11) be the best linear predictor of Xt+1 given Xt , . . . , X1 and the ARMA parameters and which are used to calculate the covariances in the prediction. Let rt+1 (, , ) be the one-step ahead (,) mean squared error E(Xt - Xt+1|t )2 . By using Cholskey's decomposition it can be shown that 1 Ln (, , ) = n 1 n n-1 t=1 1 log rt+1 (, , ) + n n-1 t=1 (Xt+1 - Xt+1|t )2 rt+1 (, , ) (,) . ^ ^ We see that we have avoided inverting the matrix (, , ). The GMLE is the parameter n , n (,) which minimises Ln (, , ). We note that the one-step ahead predictor Xt+1|t can be obtained using Durbin-Levinson Algorithm. It is possible to obtain an approximation of Ln (, , ) which is simple to evaluate. However this approximation only really make sense when the sample size n is large. It is, however, useful when obtaining the asymptotic sampling properties of the GMLE. 48 To motivate the approximation consider the one-step ahead prediction error considered in ~ Section 3.3. We have shown in Proposition 3.3.1 that for large t, Xt+1|t,... Xt+1|t and 2 E(Xt+1 - Xt+1|t )2 . Now define ~ (,) Xt+1|t,... = t bj (, )Xt+1-j . j=1 (4.12) (,) ~ (,) We now replace in Ln (, , ), Xt+1|t with Xt+1|t,... and rt+1 (, , ) with 2 to obtain 1 1~ Ln (, , ) = log 2 + n n 2 We show in Section 6 that 1 n Ln (, , ) T -1 t=1 ~ (,) (Xt+1 - Xt+1|t,... )2 . are asymptotically equivalent. and 1~ n Ln (, , ) 49 Chapter 5 Almost sure convergence, convergence in probability and asymptotic normality In the previous chapter we considered estimator of several different parameters. The hope is that as the sample size increases the estimator should get `closer' to the parameter of interest. When we say closer we mean to converge. In the classical sense the sequence {x k } converges to x (xk x), if |xk - x| 0 as k (or for every > 0, there exists an n where for all k > n, |xk - x| < ). Of course the estimators we have considered are random, that is for every (set of all out comes) we have an different estimate. The natural question to ask is what does convergence mean for random sequences. 5.1 Modes of convergence We start by defining different modes of convergence. Definition 5.1.1 (Convergence) Almost sure convergence We say that the sequence {Xt } converges almost sure to , if there exists a set M , such that P(M ) = 1 and for every N we have Xt () . In other words for every > 0, there exists an N () such that |Xt () - | < , (5.1) for all t > N (). Note that the above definition is very close to classical convergence. We a.s. denote Xt almost surely, as Xt . An equivalent definition, in terms of probabilities, is for every > 0 X t if P (; {|Xt () - | > }) = 0. m=1 t=m It is worth considering briefly what {|Xt () - | > } means. If m=1 t=m m=1 t=m {|Xt () - | > } = , then there exists an {|Xt () - | > } such that m=1 t=m 50 a.s. for some infinite sequence {kj }, we have |Xkj ( ) - | > , this means Xt ( ) does not converge to . Now let {|Xt () - | > } = A, if P (A) = 0, then for `most' m=1 t=m the sequence {Xt ()} converges. Convergence in mean square Convergence in probability We say Xt in mean square (or L2 convergence), if E(Xt - )2 0 as t . Convergence in probability cannot be stated in terms of realisations X t () but only in terms of probabilities. Xt is said to converge to in probability (written Xt ) if P (|Xt - | > ) 0, If for any 1 we have Often we write this as |Xt - | = op (1). E(Xt - ) 0 t , t . P then it implies convergence in probability (to see this, use Markov's inequality). Rates of convergence: (i) We say the stochastic process {Xt } is |Xt - | = Op (at ), if the sequence {a-1 |Xt - t |} is bounded in probability (this is defined below). We see from the definition of -1 boundedness, that for all t, the distribution of at |Xt - | should mainly lie within a certain interval. In general at as t . (ii) We say the stochastic process {Xt } is |Xt - | = op (at ), if the sequence {a-1 |Xt - |} t converges in probability to zero. Definition 5.1.2 (Boundedness) (i) Almost surely bounded If the random variable X is almost surely bounded, then for a positive sequence {ek }, such that ek as k (typically ek = 2k is used), we have P (; { {|X()| ek }}) = 1. k=1 Usually to prove the above we show that P ((; { {|X| ek }})c ) = 0. k=1 Since ( {|X| ek })c = {|X| > ek } {|X| > ek }, to show the above k=1 k=1 k=1 m=k we show P ( : { {|X()| > ek }}) = 0. k=1 m=k (5.2) We note that if ( : { {|X()| > ek }}) = , then there exists a and an k=1 m=k infinite subsequence kj , where |X( )| > ekj , hence X( ) is not bounded (since ek ). To prove (5.2) we usually use the Borel Cantelli Lemma. This states that if P (Ak ) < k=1 , the events {Ak } occur only finitely often with probability one. Applying this to our case, 51 It is worth noting that often we choose the sequence ek = 2k , in this case P ( : m=1 {|X()| > em |}) = P ( : {log |X()| > log 2k |}) CE(log |X|). Hence if we can m=1 show that E(log |X|) < , then X is bounded almost surely. (ii) Sequences which are bounded in probability A sequence is bounded in probability, written Xt = Op (1), if for every > 0, there exists a () < such that P (|Xt | ()) < . Roughly speaking this means that the sequence is only extremely large with a very small probability. And as the `largeness' grows the probability declines. if we can show that P ( : {|X()| > em |}) < , then {|X()| > em |} happens only m=1 finitely often with probability one. Hence if m=1 P ( : {|X()| > em |}) < , then P ( : {k=1 m=k {|X()| > ek }}) = 0 and X is a bounded random variable. 5.2 Ergodicity To motivate the notion of ergodicity we recall the strong law of large numbers (SLLN). Suppose {Xt }t is an iid random sequence, and E(|X0 |) < then by the SLLN we have that 1 n n j=1 Xt E(X0 ), a.s. for the proof see, for example, Grimmett and Stirzaker (1994). It would be useful to generalise this result and find weaker conditions on {Xt } for this result to still hold true. A simple 1 application is when we want to estimate the mean , and we use n n Xt as an estimator of j=1 the mean. It can be shown that if {Xt } is an ergodic process then the above result holds. That is if {Xt } is an ergodic process then for any function h such that E(h(X0 )) < we have 1 n n j=1 h(Xt ) E(h(X0 )). a.s. Note that the result does not state anything about the rate of convergence. Ergodicity is normally defined in terms of measure preserving transformations. However, we do not formally define ergodicity here, but needless to say all ergodic processes are stationary. For the definition of ergodicity and a full treatment see, for example, Billingsley (1995). However below we do state a result which characterises a general class of ergodic processes. Theorem 5.2.1 Suppose {Zt } is an ergodic sequence (for example iid random variables) and g : R R is a measureable function (its really hard to think up nonmeasureable functions). Then the sequence {Yt }t , where Yt = g(Zt , Zt-1 , . . . , ), is an ergodic process. PROOF. See Stout (1974), Theorem 3.5.8. 52 Example 5.2.1 (i) The process {Zt }t , where {Zt } are iid random variables, is probably the simplest example of an ergodic sequence. (ii) A simple example of a time series {Xt } which is not independent but is ergodic is the AR(1) process. We recall that the AR(1) process satisfies the representation Xt = Xt-1 + t , (5.3) where { t }t are iid random variables with E( t ) = 0, E( 2 ) = 1 and || < 1. It has the t unique causal solution Xt = j=0 j t-j . The solution motivates us to define the function g(x0 , x1 , . . .) = j=0 j xj . Since g() is bounded, it is sufficiently well behaved (thus measureable). Which implies, by using Theorem 5.2.1, that {Xt } is an ergodic process. We note if E(2 ) < , then E(X 2 ) < . 2 2 The ARCH(p) process {Xt } defined by Xt = Zt t where t = a0 + p aj Xt-j with j=1 p j=1 aj < 1 is ergodic stochastic process (we look at this model in a later Chapter). Example 5.2.2 (Application) If {Xt } is an AR(1) process with |a| < 1 and E(2 ) < , then t by using the ergodic theorem we have 1 n n t=1 Xt Xt+k E(X0 Xk ). a.s. 5.3 Sampling properties Often we will estimate the parameters by maximising (or minimising) a criterion. Suppose we have the criterion Ln (a) (eg. likelihood, quasi-likelihood, Kullback-Leibler etc) we use as an estimator of a0 , an where ^ an = arg max Ln (a) ^ a and is the parameter space we do the maximisation (minimisation) over. Typically the true parameter a should maximise (minimise) the `limiting' criterion L. If this is to be a good estimator, as the sample size grows the estimator should converge (in some sense) to the parameter we are interesting in estimating. As we discussed above, there are various modes in which we can measure this convergence (i) almost surely (ii) in probability and (iii) in mean squared error. Usually we show either (i) or (ii) (noting that (i) implies (ii)), in time series its usually quite difficult to show (iii). 53 Definition 5.3.1 (i) An estimator an is said to be almost surely consistent estimator of a0 , ^ if there exists a set M , where P(M ) = 1 and for all M we have an () a. ^ (ii) An estimator an is said to converge in probability to a0 , if for every > 0 ^ P (|^n - a| > ) 0 a T . To prove either (i) or (ii) usually involves verifying two main things, pointwise convergence and equicontinuity. 5.4 Showing almost sure convergence of an estimator We now consider the general case where Ln (a) is a `criterion' which we maximise. Let us suppose we can write Ln as 1 Ln (a) = n n t (a), t=1 (5.4) where for each a , { t (a)}t is a ergodic sequence. Let L(a) = E( t (a)), (5.5) we assume that L(a) is continuous and has a unique maximum in . We define the estimator n where n = arg mina Ln (a). ^ ^ Definition 5.4.1 (Uniform convergence) Ln (a) is said to almost surely converge uniformly to L(a), if a sup |Ln (a) - L(a)| 0. a.s. In other words there exists a set M where P (M ) = 1 and for every M , a sup |Ln (, a) - L(a)| 0. Theorem 5.4.1 (Consistency) Suppose that an = arg maxa Ln (a) and a0 = arg maxa L(a) ^ a.s. is the unique minimum. If supa |Ln (a) - L(a)| 0 as n and L(a) has a unique maxia.s. mum. Then Then an a0 as n . ^ PROOF. We note that by definition we have Ln (a0 ) Ln (^n ) and L(^n ) L(a0 ). Using this a a inequality we have Ln (a0 ) - L(a0 ) Ln (^n ) - L(a0 ) Ln (^n ) - L(^n ). a a a Therefore from the above we have |Ln (^T ) - L(a0 )| max {|Ln (a0 ) - L(a0 )|, |Ln (^T ) - L(^n )|} sup |Ln (a) - L(a)|. a a a a 54 Hence since we have uniform converge we have |Ln (^n ) - L(a0 )| 0 as n . Now since a a.s. a.s. L(a) has a unique maximum, we see that |Ln (^n ) - L(a0 )| 0 implies an a0 . a ^ We note that directly establishing uniform convergence is not easy. Usually it is done by assuming the parameter space is compact and showing point wise convergence and stochastic equicontinuity, these three facts imply uniform convergence. Below we define stochastic equicontinuity and show consistency under these conditions. Definition 5.4.2 The sequence of stochastic functions {fn (a)}n is said to be stochastically equicontinuous if there exists a set M where P (M ) = 1 and for every M and and > 0, there exists a and such that for every M sup |a1 -a2 | a.s. |fn (, a1 ) - fn (, a2 )| , for all n > N (). A sufficient condition for stochastic equicontinuity of fn (a) (which is usually used to prove equicontinuity), is that fn (a) is in some sense Lipschitz continuous. In other words, a1 ,a2 sup |fn (a1 ) - fn (a2 )| < Kn a1 - a2 , a.s. where kn is a random variable which converges to a finite constant as n (Kn K0 as a.s. n ). To show that this implies equicontinuity we note that Kn K0 means that for every M (P (M ) = 1) and > 0, we have |Kn () - K0 | < for all n > N (). Therefore if we choose = /(K0 + ) we have sup |a1 -a2 |/(K0 +) |fn (, a1 ) - fn (, a2 )| < , for all n > N (). In the following theorem we state sufficient conditions for almost sure uniform convergence. It is worth noting this is the Arzela-Ascoli theorem for random variables. Theorem 5.4.2 (The stochastic Ascoli Lemma) Suppose the parameter space is coma.s. pact, for every a we have Ln (a) L(a) and Ln (a) is stochastic equicontinuous. Then a.s. supa |Ln (a) - L(a)| 0 as n . We use the theorem below. Corollary 5.4.1 Suppose that an = arg maxa Ln (a) and a0 = arg maxa L(a), moreover ^ L(a) has a unique maximum. If (i) we have point wise convergence, that is for every a we have Ln (a) L(a). (ii) The parameter space is compact. (iii) Ln (a) is stochastic equicontinuous. Then an a0 as n . ^ We prove Theorem 5.4.2 in the section below, but it can be omitted on first reading. 55 a.s. a.s. 5.4.1 Proof of Theorem 5.4.2 (The stochastic Ascoli theorem) We now show that stochastic equicontinuity and almost pointwise convergence imply uniform convergence. We note that on its own, pointwise convergence is a much weaker condition than uniform convergence, since for pointwise convergence the rate of convergence can be different for each parameter. Before we continue a few technical points. We recall that we are assuming almost pointwise convergence. This means for each parameter a there exists a set N a (with P (Na ) = 1) such that for all Na Lt (, a) L(a). In the following lemma we unify this set. That is show (using stochastic equicontinuity) that there exists a set N (with P (N ) = 1) such that for all N Lt (, a) L(a). Lemma 5.4.1 Suppose the sequence {Ln (a)}n is stochastically equicontinuous and also pointwise convergent (that is Ln (a) converges almost surely to L(a)), then there exists a set M where P (M ) = 1 and for every M and a we have |Ln (, a) - L(a)| 0. PROOF. Enumerate all the rationals in the set and call this sequence {a i }i . Then for every ai there exists a set Mai where P (Mai ) = 1, such that for every Mai we have |LT (, ai ) - L(ai )| 0. Define M = Mai , since the number of sets is countable P (M ) = 1 and for every M and ai we have Ln (, ai ) L(ai ). ~ Since we have stochastic equicontinuity, there exists a set M (with P (M ) = 1), such that ~ , {Ln (, )} is equicontinuous. Let M = M {Ma }, we will show that for all ~ for every M i a and M we have Ln (, a) L(a). By stochastic equicontinuity for every M and /3 > 0, there exists a > 0 such that sup |b1 -b2 | |Ln (, b1 ) - Ln (, b2 )| /3, (5.6) for all n > N (). Furthermore by definition of M for every rational aj and N we have |LT (, ai ) - L(ai )| /3, (5.7) where n > N (). Now for any given a , there exists a rational ai such that a - aj . Using this, (5.6) and (5.7) we have |Ln (, a) - L(a)| |Ln (, a) - Ln (, ai )| + |Ln (, ai ) - L(ai )| + |L(a) - L(ai )| , for n > max(N (), N ()). To summarise for every M and a , we have |Ln (, a) - L(a)| 0. Hence we have pointwise covergence for every realisation in M . We now show that equicontinuity implies uniform convergence. Proof of Theorem 5.4.2. Using Lemma 5.4.1 we see that there exists a set M with P (M ) = 1, where Ln is equicontinuous and also pointwise convergent. We now show uniform convergence on this set. Choose /3 > 0 and let be such that for every M we have sup |a1 -a2 | |LT (, a1 ) - LT (, a2 )| /3, 56 (5.8) for all n > n(). Since is compact it can be divided into a finite number of open sets. Construct the sets {Oi }p , such that p Oi and supx,y,i x - y . Let {ai }p be such i=1 i=1 i=1 that ai Oi . We note that for every M we have Ln (, ai ) L(ai ), hence for every /3, there exists an ni () such that for all n > ni () we have |LT (, ai ) - L(ai )| /3. Therefore, since p is finite (due to compactness), there exists a n() such that ~ 1ip max |Ln (, ai ) - L(ai )| /3, for all n > n() = max1ip (ni ()). For any a , choose the i, such that open set Oi such ~ that a Oi . Using (5.8) we have |LT (, a) - LT (, ai )| /3, for all n > n(). Altogether this gives |LT (, a) - L(a)| |LT (, a) - LT (, ai )| + |LT (, ai ) - L(ai )| + |L(a) - L(ai )| , for all n max(n(), n()). We observe that max(n(), n()) and /3 does not depend on ~ ~ a, therefore for all n max(n(), n()) and we have supa |Ln (, a) - L(a)| < . This gives ~ for every M (P(M ) = 1), supa |Ln (, a) - L(a)| 0, thus we have almost sure uniform convergence. 5.5 Almost sure convergence of the least squares estimator for an AR(p) process In Chapter 6 we will consider the sampling properties of many of the estimators defined in Chapter 4. However to illustrate the consistency result above we apply it to the least squares estimator of the autoregressive parameters. To simply notation we only consider estimator for AR(1) models. Suppose that X t satisfies Xt = Xt-1 + t (where || < 1). To estimate we use the least squares estimator defined below. Let Ln (a) = 1 n-1 n t=2 (Xt - aXt-1 )2 , (5.9) ^ we use n as an estimator of , where ^ n = arg max LT (a), a (5.10) where = [-1, 1]. How can we show that this is consistent? In the case of least squares for AR processes, aT has the explicit form ^ ^ n = n 1 t=2 Xt Xt-1 n-1 . T -1 1 2 t=1 Xt n-1 Now by just applying the ergodic theorem to the numerator and denominator we get ^ a.s. n . It is worth noting, that 1 Pn t=2 Xt Xt-1 n-1 1 Pn-1 2 t=1 Xt n-1 < 1 is not necessarily true. 57 However we will tackle the problem in a rather artifical way and assume that it does not ^ have an explicit form and instead assume that n is obtained by minimising Ln (a) using a numerical routine. In general this is the most common way of minimising a likelihood function (usually explicit solutions do not exist). ^ In order to derive the sampling properties of n we need to directly study the likelihood function Ln (a). We will do this now in the least squares case. We will first show almost sure convergence, which will involve repeated use of the ergodic theorem. We will then demonstrate how to show convergence in probability. We look at almost sure convergence as its easier to follow. Note that almost sure convergence implies convergence in probability (but the converse is not necessarily true). The first thing to do it let t (a) = (Xt - aXt-1 )2 . Since {Xt } is an ergodic process (recall Example 5.2.1(ii)) by using Theorem 5.2.1 we have for a, that { t (a)}t is an ergodic process. Therefore by using the ergodic theorem we have 1 Ln (a) = n-1 n t (a) t=2 a.s. E( 0 (a)). a.s. In other words for every a [-1, 1] we have that Ln (a) E( 0 (a)) (almost sure pointwise convergence). Since the the parameter space [-1, 1] is compact and a is the unique minimum of () in the parameter space, then all that remains is to show show stochastic equicontinuity, from this we deduce almost sure uniform convergence. To show stochastic equicontinuity we expand LT (a) and use the mean value theorem to obtain Ln (a1 ) - Ln (a2 ) = where a [min[a1 , a2 ], max[a1 , a2 ]] and -2 Ln () = a n-1 Because a [-1, 1] we have | Ln ()| Dn , where Dn = a 2 n-1 n t=2 2 (|Xt-1 Xt | + Xt-1 ). n t=2 LT ()(a1 - a2 ), a (5.11) Xt-1 (Xt - aXt-1 ). 2 Since {Xt }t is an ergodic process, then {|Xt-1 Xt | + Xt-1 } is an ergodic process. Therefore, if var( 0 ) < , by using the ergodic theorem we have 2 Dn 2E(|Xt-1 Xt | + Xt-1 ). a.s. 58 2 Let D := 2E(|Xt-1 Xt | + Xt-1 ). Therefore there exists a set M , where P(M ) = 1 and for every M and > 0 we have |DT () - D| , for all n > N (). Substituting the above into (5.11) we have |Ln (, a1 ) - Ln (, a2 )| Dn ()|a1 - a2 | (D + )|a1 - a2 |, for all n N (). Therefore for every > 0, there exists a := /(D + ) such that sup |a1 -a2 |/(D+ ) |Ln (, a1 ) - Ln (, a2 )| , for all n N (). Since this is true for all M we see that {Ln (a)} is stochastically equicontinuous. ^ ^ a.s. Theorem 5.5.1 Let n be defined as in (5.10). Then we have n . PROOF. Since {Ln (a)} is almost sure equicontinuous, the parameter space [-1, 1] is compact a.s. and we have pointwise convergence of Ln (a) L(a), by using Theorem 5.4.1 we have that ^ a.s. n a, where a = mina L(a). Finally we need to show that a = . Since L(a) = E( 0 (a)) = -E(X1 - aX0 )2 , 2 we see by differentiating L(a) with respect to a, that it is minimised at a = E(X 0 X1 )/E(X0 ), 2 hence a = E(X0 X1 )/E(X0 ). To show that this is , we note that by the Yule-Walker equations Xt = Xt-1 + t 2 E(Xt Xt-1 ) = E(Xt-1 ) + E( t Xt-1 ) . =0 2 ^ a.s. Therefore = E(X0 X1 )/E(X0 ), hence n . We note that by using a very similar methods we can show strong consistency of the least squares estimator of the parameters in an AR(p) model. 5.6 Convergence in probability of an estimator a.s. We described above almost sure (strong) consistency where we showed a T a0 . Sometimes ^ its not possible to show strong consistency (when ergodicity etc. cannot be verified). Often, as an alternative, weak consistency where aT a0 (convergence in probability), is shown. This ^ requires a weaker set of conditions, which we now describe: (i) The parameter space should be compact. (ii) We pointwise convergence: for every a Ln (a) L(a). P P 59 (iii) The sequence {Ln (a)} is equicontinuous in probability. That is for every there exists a such that lim P sup |a1 -a2 | > 0 and > 0 n |Ln (a1 ) - Ln (a2 )| > P < . (5.12) If the above conditions are satisified we have aT a0 . ^ Verifying conditions (ii) and (iii) may look a little daunting but actually with the use of Chebyshev's (or Markov's) inequality it can be quite straightforward. For example if we can show that for every a E(Ln (a) - L(a))2 0 T . Therefore by applying Chebyshev's inequality we have for every > 0 that P (|Ln (a) - L(a)| > ) P E(Ln (a) - L(a))2 0 T . 2 Thus for every a we have Ln (a) L(a). To show (iii) we often use the mean value theorem Ln (a). Using the mean value theorem we have |Ln (a1 ) - Ln (a2 )| sup a a Ln (a) 2 a1 - a2 . Now if we can show that supn E supa a Ln (a) 2 < (in other words it is uniformly bounded in probability over n) then we have the result. To see this observe that P sup |a1 -a2 | |Ln (a1 ) - Ln (a2 )| > P sup a a Ln (a) 2 |a1 - a2 | > a Ln (a) 2 ) supn E(|a1 - a2 | supa . Therefore by a careful choice of > 0 we see that (5.12) is satisfied (and we have equicontinuity in probability). 5.7 Asymptotic normality of an estimator The first central limit theorm goes back to the asymptotic distribution of sums of binary random variables (these have a binomial distribution and Bernoulli showed that they could be approximated to a normal distribution). This result was later generalised to sums of iid random variables. However from mid 20th century to late 20th century several advances have been made for generalisating the results to dependent random variables. These include generalisations to random variables which have n-dependence, mixing properties, cumulant properties, near-epoch dependence etc (see, for example, Billingsley (1995) and Davidson (1994)). In this section we will concentrate on a central limit theore for martingales. Our reason for choosing this flavour of CLT is that it can be applied in various estimation settings - as it can often be shown that the derivative of a criterion at the true parameter is a martingale. 60 Let us suppose that an = arg max Ln (a), ^ a where Ln (a) = 1 n n t (a), t=1 and for each a , { t (a)}t are identically distributed random variables. In this section we shall show asymptotic normality of n(^n -a0 ). The reason for normalising a a.s. by n, is that (^n - a0 ) 0 as n , hence in terms of distributions it converges towards a the point mass at zero. Therefore we need to increase the magnitude of the difference a n - a0 . ^ We can show that (^n - a0 ) = O(n-1/2 ), therefore n(^n - a) = O(1). a a We often use Ln (a) to denote the partial derivative of Ln (a) with respect to a ( Ln (a) = Ln (a) Ln (a) ^ Ln (^n ) = 0. Now expanding a a1 , . . . , ap ). Since aT = arg max Ln (a), we observe that Ln (^n ) about a0 (the true parameter) we have a (^n - a0 ) = -{ a To show asymptotically normality of 2L P Ln (^n ) = a Ln (a0 ) + (^n - a0 ) a 2 2 shown, second it is shown that a n () E( 0 (a0 )), together they yield asymptotically normality of n(^n - a0 ). In many cases Ln (a0 ) is a martingale, hence the martingale central a limit theorem is usually applied to show asymptotic normality of Ln (a0 ). We start by defining a martingale and stating the martingale central limit theorem. Definition 5.7.1 The random variables {Zt } are called martingale differences if E(Zt |Zt-1 , Zt-2 , . . .) = 0. The sequence {ST }T , where T n(^n - a0 ), first asymptotic normality of a 2 Ln ()} a -1 Ln (a0 ) Ln () a (5.13) Ln (a0 ) is ST = Zt k=1 are called martingales if {Zt } are martingale differences. Remark 5.7.1 (Martingales and covariances) We observe that if {Zt } are martingale differences then if t > s and Fs = (Zs , Zs-1 , . . .) cov(Zs , Zt ) = E(Zs Zt ) = E E(Zs Zt |Fs ) = E Zs E(Zt |Fs ) = E(Zs 0) = 0. Hence martingale differences are uncorrelated. Example 5.7.1 Suppose that Xt = Xt-1 +t , where {t } are idd rv with E(t ) = 0 and || < 1. Then {t Xt-1 }t are martingale differences. 61 Let us define ST as T ST = Zt , t=1 (5.14) 2 where Ft = (Zt , Zt-1 , . . .), E(Zt |Ft-1 ) = 0 and E(Zt ) < . In the following theorem adapted from Hall and Heyde (1980), Theorem 3.2 and Corollary 3.1, we show that S T is asymptotically normal. Theorem 5.7.1 Let {ST }T be defined as in (6.36). Further suppose 1 T T t=1 2 Zt 2 , P (5.15) where 2 is a finite constant, for all > 0, 1 T T t=1 P 2 E(Zt I(|Zt | > T )|Ft-1 ) 0, (5.16) (this is known as the conditional Lindeberg condition) and 1 T Then we have T -1/2 ST N (0, 2 ). D T t=1 2 E(Zt |Ft-1 ) 2 . P (5.17) (5.18) 5.8 Asymptotic normality of the least squares estimator In this section we show asymptotic normality of the least squares estimator of the AR(1) (X t = Xt-1 + t , with var(t ) = 2 ) defined in (5.9). ^ We call that the least squares estimator is n = arg maxa[-1,1] Ln (a). Recalling the criterion Ln (a) = the first and the second derivative is Ln (a) = and 2 1 n-1 n t=2 n t=2 n t=2 (Xt - aXt-1 )2 , -2 n-1 2 n-1 Xt-1 (Xt - aXt-1 ) = 2 Xt-1 . -2 n-1 n Xt-1 t=2 t Ln (a) = Therefore by using (5.13) we have ^ (n - ) = - 2 Ln -1 Ln (). (5.19) 62 2 Since {Xt } are ergodic random variables, by using the ergodic theorem we have 2 ). This with (5.19) implies 2E(X0 2L n a.s. ^ n(n - ) = - a.s. 2 Ln -1 n Ln (). (5.20) 2 (2E(X0 ))-1 To show asymptotic normality of We observe that ^ n(n - ), will show asymptotic normality of n Ln (). Ln () = -2 n-1 n Xt-1 t , t=2 is the sum of martingale differences, since E(Xt-1 t |Xt-1 ) = Xt-1 E( t |Xt-1 ) = Xt-1 E( t ) = 0 (here we used Definition 5.7.1). In order to show asymptotic of Ln () we will use the martingale central limit theorem. We now use Theorem 5.7.1 to show that n Ln () is asymptotically normal, which means we have to verify conditions (5.15)-(5.17). We note in our example that Z t := Xt-1 t , and that the series {Xt-1 t }t is an ergodic process. Furthermore, since for any function g, E(g(Xt-1 t )|Ft-1 ) = E(g(Xt-1 t )|Xt-1 ), where Ft = (Xt , Xt-1 , . . .) we need only to condition on Xt-1 rather than the entire sigma-algebra Ft-1 . C1 : By using the ergodicity of {Xt-1 t }t we have 1 n n 2 Zt t=1 n 2 Xt-1 t=1 1 = n 2 P t 2 E(Xt-1 ) E( 2 ) = 2 c(0). t =1 C2 : We now verify the conditional Lindeberg condition. 1 n n t=1 1 2 E(Zt I(|Zt | > n)|Ft-1 ) = n n t=1 2 E(Xt-1 2 I(|Xt-1 t | > n)|Xt-1 ) t 2 We now use the Cauchy-Schwartz inequality for conditional expectations to split X t-1 2 t and I(|Xt-1 t | > ). We recall that the Cauchy-Schwartz inequality for conditional expec2 2 tations is E(Xt Zt |G) [E(Xt |G)E(Zt |G)]1/2 almost surely. Therefore 1 n 1 n 1 n n t=1 n t=1 n t=1 2 E(Zt I(|Zt | > n)|Ft-1 ) 4 E(Xt-1 4 |Xt-1 )E(I(|Xt-1 t | > n)2 |Xt-1 ) t 2 Xt-1 E( 4 )1/2 E(I(|Xt-1 t | > n)2 |Xt-1 ) t 1/2 1/2 . (5.21) We note that rather than use the Cauchy-Schwartz inequality we can use a generalisation of it called the Hlder inequality. The Hlder inequality states that if p-1 + q -1 = 1, then o o 63 E(XY ) {E(X p )}1/p {E(Y q )}1/q (the conditional version also exists). The advantage of using this inequality is that one can reduce the moment assumptions on Xt . Returning to (5.21), and studying E(I(|Xt-1 t | > )2 |Xt-1 ) we use that E(I(A)) = P(A) and the Chebyshev inequality to show E(I(|Xt-1 t | > n)2 |Xt-1 ) = E(I(|Xt-1 t | > n)|Xt-1 ) = E(I(| t | > n/Xt-1 )|Xt-1 ) X 2 var( t ) n )) t-1 2 . (5.22) = P (| t | > Xt-1 n Substituting (5.22) into (5.21) we have 1 n 1 n n t=1 n 2 E(Zt I(|Zt | > n)|Ft-1 ) 2 Xt-1 E( 4 )1/2 t n 2 Xt-1 var( t ) 2 n 1/2 t=1 E( 4 )1/2 t n3/2 E( t=1 4 )1/2 E( 2 )1/2 t t n1/2 |Xt-1 |3 E( 2 )1/2 t 1 n n t=1 |Xt-1 |3 . a.s. 1 4 If E( 4 ) < , then E(Xt ) < , therefore by using the ergodic theorem we have n n |Xt-1 |3 t t=1 3 ). Since almost sure convergence implies convergence in probability we have E(|X0 | 1 n E( 4 )1/2 E( 2 )1/2 1 t t 2 E(Zt I(|Zt | > n)|Ft-1 ) n n1/2 t=1 0 n n t=1 P |Xt-1 |3 0. Hence condition (5.16) is satisfied. C3 : We need to verify that 1 n n t=1 2 E(Zt |Ft-1 ) 2 . P P E(|X0 |3 ) Since {Xt }t is an ergodic sequence we have 1 n = 1 n n t=1 n t=1 2 E(Zt |Ft-1 ) 1 = n n t=1 2 E(Xt-1 2 |Xt-1 ) n 2 Xt-1 t=1 a.s. 2 E(X0 ) 2 Xt-1 E(2 |Xt-1 ) 1 = E( ) n 2 2 E(2 )E(X0 ) = 2 c(0), P 64 hence we have verified condition (5.17). Altogether conditions C1-C3 imply that 1 n Ln () = n Recalling (5.20) and that n Xt-1 t=1 t N (0, 2 c(0)). D (5.23) D n Ln () N (0, 2 ) we have ^ n(n - ) = - a.s. 2 Ln -1 D n Ln () . (5.24) 2 (2E(X0 ))-1 N (0, 2 c(0)) 2 Using that E(X0 ) = c(0), this implies that 1 D ^ n(n - ) N (0, 2 c(0)-1 ). 4 (5.25) ^ Thus we have derived the limiting distribution of n . Remark 5.8.1 We recall that ^ (n - ) = - -2 and that var( n-1 n t=2 t Xt-1 ) 2 Ln -1 Ln () = -2 n-1 2 n-1 n t=2 t Xt-1 , n 2 t=2 Xt-1 (5.26) = -2 n-1 n t=2 var(t Xt-1 ) 1 = O( n ). This implies ^ (n - ) = Op (n-1/2 ). Indeed the results also holds almost surely ^ (n - ) = O(n-1/2 ). The same result is true for autoregressive processes of arbitrary finite order. That is D ^ n(n - ) N (0, E(p )-1 2 ). (5.28) (5.27) 65 Chapter 6 Sampling properties of ARMA parameter estimators In this section we obtain the sampling properties of estimates of the parameters in an ARMA process p q (0) i Xt-i i=1 Xt - = t + j=1 j t-j , (0) (0) (6.1) (0) where {t } are iid random variables with mean zero and var(t ) = 2 . Let 0 = (1 , . . . , p ) (0) (0) and 0 = (1 , . . . , q ) and 0 = (0 , 0 ). 6.1 Asymptotic properties of the Hannan and Rissanen estimation method In this section we will derive the sampling properties of the Hannan-Rissanen estimator. We will obtain an almost sure rate of convergence (this will be the only estimator where obtain we an almost sure rate). Typically obtaining only sure rates can be more difficult than obtaining probabilistic rates, moreover the rates can be different (worse in the almost sure case). We now illustrate why that is with a small example. Suppose {Xt } are iid random variables with mean zero and variance one. Let Sn = n Xt . It can easily be shown that t=1 var(Sn ) = 1 1 therefore Sn = Op ( ). n n (6.2) However, from the law of iterated logarithm we have for any > 0 P (Sn (1 + ) 2n log log n infinitely often) = 0P (Sn (1 - ) 2n log log n infinitely often) = 1. 6.3) ( Comparing (6.2) and (6.3) we see that for any given trajectory (realisation) most of the time log log n 1 1 n Sn will be within the O( n ) bound but there will be excursions above when it to the O( n 1 1 bound. In other words we cannot say that n Sn = ( n ) almost surely, but we can say that This basically means that 2 log log n 1 ) almost surely. Sn = O( n n 66 Hence the probabilistic and the almost sure rates are (slightly) different. Given this result is true for the average of iid random variables, it is likely that similar results will hold true for various estimators. In this section we derive an almost sure rate for Hannan-Rissanen estimator, this rate will be determined by a few factors (a) an almost sure bound similar to the one derived above (b) the increasing number of parameters pn (c) the bias due to estimating only a finite number of parameters when there are an infinite number in the model. We first recall the algorithm: (i) Use least squares to estimate {bj }pn and define j=1 ^ ^ -1 r bn = Rn ^n , ^ where bn = (^1,n , . . . , ^pn ,n ), b b n T (6.4) ^ Rn = t=pn +1 Xt-1 Xt-1 ^n = r t=pn +1 Xt Xt-1 and Xt-1 = (Xt-1 , . . . , Xt-pn ). (ii) Estimate the residuals with pn t = Xt - ~ ^j,n Xt-j . b j=1 ~ ~ (iii) Now use as estimates of 0 and 0 n , n where n p q ~ ~ n , n = arg min t=pn +1 (Xt - j=1 j Xt-j - i t-i )2 . ~ i=1 (6.5) We note that the above can easily be minimised. In fact ~ ~ ~n s (n , n ) = R-1~n where 1 ~ Rn = n n ~ ~ Yt Yt t=pn +1 1 ~n = s T n ~ Y t Xt , t=pn +1 ~ ~ ~ Yt = (Xt-1 , . . . , Xt-p , t-1 , . . . , t-q ). Let n = (n , n ). ~ ~ ^ We observe that in the second stage of the scheme where the estimation of the ARMA parameters are done, it is important to show that the empirical residuals are close to the true residuals. That is t = t + o(1). We observe that from the definition of t , this depends on the rate of ~ ~ convergence of the AR estimators ^j,n b pn t = Xt - ~ = t + ^j,n Xt-j b j=1 pn j=1 (^j,n - bj )Xt-j - b 67 j=pn +1 bj Xt-j . (6.6) Hence pn t - t ^ j=1 (^j,n - bj )Xt-j + b j=pn +1 bj Xt-j . (6.7) ^ ^ ~ Therefore to study the asymptotic properties of = n , n we need to Obtain a rate of convergence for supj |^j,n - bj |. b Obtain a rate for |^t - t |. ^ ^ ~ Use the above to obtain a rate for n = (n , n ). We first want to obtain the uniform rate of convergence for supj |^j,n - bj |. Deriving this b is technically quite challanging. We state the rate in the following theorem, an outline of the proof can be found in Section 6.1.1. The proofs uses results from mixingale theory which can be found in Chapter 10. Theorem 6.1.1 Suppose that {Xt } is from an ARMA process where the roots of the true char^ acteristic polynomials (z) and (z) both have absolute value greater than 1 + . Let bn be defined as in (6.4), then we have almost surely ^ bn - b n for any > 0. PROOF. See Section 6.1.1. Corollary 6.1.1 Suppose the conditions in Theorem 6.1.1 are satisfied. Then we have t - t ~ where Zt,pn = 1 pn pn t=1 |Xt-j | 2 = O p2 n (log log n)1+ log n p3 + n + p n pn n n b pn max |^j,n - bj |Zt,pn + Kpn Yt-pn , 1jpn (6.8) and Yt = n pn j t=1 |Xt |, 1 n t=pn +1 t-i Xt-j - t-i Xt-j = O(pn Q(n) + pn ) ~ (6.9) 1 n where Q(n) = p2 n n t=pn +1 t-i t-j - t-i t-j = O(pn Q(n) + pn ) ~ ~ ~ + p3 n n (6.10) (log log n)1+ log n n + p n pn . 68 PROOF. Using (6.7) we immediately obtain (6.8). To obtain (6.9) we use (6.7) to obtain 1 n n t=pn +1 t-i Xt-j - t-i Xt-j ~ n t=pn +1 pn 1 n n t=pn +1 pn |Xt-j | t-i - t-i ~ n 1 O(pn Q(n)) n 1 |Xt ||Zt,pn | + O( ) n t=pn +1 |Xt ||Yt-pn | = O(pn Q(n) + ). To prove (6.10) we use a similar method, hence we omit the details. We apply the above result in the theorem below. Theorem 6.1.2 Suppose the assumptions in Theorem 6.1.1 are satisfied. Then ~ n - 0 = O p3 n (log log n)1+ log n p4 + n + p 2 pn n n n . 2 ~ ~ ~ for any > 0, where n = (n , n ) and 0 = (0 , 0 ). ~ PROOF. We note from the definition of n that ~ n - 0 ~ ~ ~n s = R-1 ~n - Rn 0 . ~ Now in the Rn and ~n we replace the estimated residuals n with the true unobserved residuals. s ~ This gives us ~ n - 0 Rn = 1 n ~n s = R-1 sn - Rn 0 + (R-1 sn - R-1~n ) n n n (6.11) Yt Yt t=max(p,q) sn = 1 n n Y t Xt , t=max(p,q) Yt = (Xt-1 , . . . , Xt-p , t-1 , . . . , t-q ) (recalling that Yt = (Xt-1 , . . . , Xt-p , t-1 , . . . , t-q ). The ~ ~ error term is ~n s ~ ~n ~n (R-1 sn - R-1~n ) = R-1 (Rn - Rn )R-1 sn + R-1 (sn - ~n ). s n n ~n Now, almost surely R-1 , R-1 = O(1) (if E(Rn ) is non-singular). Hence we only need to obtain n ~ n - Rn and sn - ~n . We recall that a bound for R s 1 ~ Rn - R n = n t=pn +1 ~ ~ (Yt Yt - Yt Yt ), hence the terms differ where we replace the estimated t with the true t , hence by using (6.9) ~ and (6.10) we have almost surely ~ s |Rn - Rn | = O(pn Q(n) + pn ) and |~n - sn | = O(pn Q(n) + pn ). 69 Therefore by substituting the above into (6.12) we obtain ~ n - 0 = R-1 sn - Rn 0 + O(pn Q(n) + pn ). n n (6.12) Finally using straightforward algebra it can be shown that s n - R n n = 1 n t Y t . t=max(p,q) (log log n)1+ log n ). Substituting n 1+ log n O( (log log n) ) gives n By using Theorem 6.1.3, below, we have sn - Rn n = O((p + q) the above bound into (??), and noting that O(Q(n)) dominates ~ n - n and the required result. = O p3 n 2 (log log n)1+ log n p4 + n + p 2 pn n n n 6.1.1 ^ Proof of Theorem 6.1.1 (A rate for bT - bT 2) We observe that -1 -1 ^ ^ -1 ^ ^ bn - bn = Rn ^n - Rn bn + Rn - Rn ^n - Rn bn r r where b, Rn and rn are deterministic, with bn = (b1 . . . , bpn ), (Rn )i,j = E(Xi Xj ) and (rn )i = E(X0 X-i ). Evaluating the Euclidean distance we have ^ bn - b n 2 -1 Rn spec ^ ^n - Rn bn r 2 -1 + Rn spec ^ -1 Rn spec ^ Rn - Rn 2 ^ ^n - Rn bn r 2 ,(6.13) -1 ^ -1 ^ -1 ^ -1 ^ where we used that Rn - Rn = Rn (Rn - Rn )Rn and the norm inequalities. Now by using -1 ) > /2 for all T . Thus our aim is to obtain almost sure bounds Lemma 3.3.1 we have min (Rn ^ ^ for ^n - Rn bn 2 and Rn - Rn 2 , which requires the lemma below. r Theorem 6.1.3 Let us suppose that {Xt } has an ARMA representation where the roots of the characteristic polynomials (z) and (z) lie are greater than 1 + . Then (i) 1 n (ii) 1 n for any > 0. 70 n n t Xt-r = O( t=r+1 (log log n)1+ log n ) n (6.14) Xt-i Xt-j = O( t=max(i,j) (log log n)1+ log n ). n (6.15) PROOF. The result is proved in Chapter 10.2. To obtain the bounds we first note that if the there wasn't an MA component in the ^ ARMA process, in other words {Xt } was an AR(p) process with pn p, then ^n - Rn bn = r n 1 t=pn +1 t Xt-r , which has a mean zero. However because an ARMA process has an AR() n representation and we are only estimating the first pn parameters, there exists a `bias' in ^ ^n - Rn bn . Therefore we obtain the decomposition r ^ (^n - Rn bn )r = r = 1 n 1 n n t=pn +1 n t=pn +1 Xt - bj Xt-j Xt-r + n j=1 1 n n bj Xt-j Xt-r (6.16) (6.17) t=pn +1 j=pn +1 1 t Xt-r + n bj Xt-j Xt-r t=pn +1 j=pn +1 bias stochastic term Therefore we can bound the bias with 1 ^ (^n - Rn bn )r - r n n t=pn +1 t Xt-r K pn 1 n n t=1 |Xt-r | j=1 j |Xt-pn -j |. (6.18) 1 Let Yt = j |Xt-j and Sn,k,r = n n |Xt-r | j |Xt-k-j |. We note that {Yt } and j=1 t=1 j=1 {Xt } are ergodic sequences. By applying the ergodic theorm we can show that for a fixed k and a.s. r, Sn,k,r E(Xt-r Yt-k ). Hence Sn,k,r are almost surely bounded sequences and pn 1 n n t=1 |Xt-r | j=1 j |Xt-pn -j | = O(pn ). Therefore almost surely we have ^ ^n - Rn bn r Now by using (6.14) we have ^ ^n - Rn bn r 2 2 1 = n n t Xt-1 t=pn +1 2 + O(pn pn ). = O pn (log log n)1+ log n + pn n . (6.19) ^ ^ ^ This gives us a rate for ^n - Rn bn . Next we consider Rn . It is clear from the definition of Rn r that almost surely we have ^ (Rn )i,j - E(Xi Xj ) = = 1 n 1 n 1 n n t=pn +1 n Xt-i Xt-j - E(Xi Xj ) [Xt-i Xt-j - E(Xi Xj )] - 1 n pn Xt-i Xt-j + t=min(i,j) t=min(i,j) T t=min(i,j) min(i, j) E(Xi Xj ) n = [Xt-i Xt-j - E(Xi Xj )] + O( 71 pn ). n Now by using (6.15) we have almost surely ^ |(Rn )i,j - E(Xi Xj )| = O( Therefore we have almost surely ^ Rn - Rn 2 pn + n (log log n)1+ log n ). n = O p2 n pn + n (log log n)1+ log n n . (6.20) We note that by using (6.13), (6.19) and (6.20) we have ^ bn - b n 2 -1 Rn spec ^ -1 Rn 2 spec O pn (log log n)1+ log n p2 + n + p n pn n n . As we mentioned previously, because the spectrum of Xt is bounded away from zero, min (Rn ) ^ ^ is bounded away from zero for all T . Moreover, since min (Rn ) min (Rn ) - max (Rn - Rn ) 2 ), which for a large enough n is bounded away from zero. Hence we ^ min (Rn ) - tr((Rn - Rn ) obtain almost surely ^ bn - b n 2 = O p2 n (log log n)1+ log n p3 + n + p n pn n n , (6.21) thus proving Theorem 6.1.1 for any > 0. 6.2 Asymptotic properties of the GMLE p q Let us suppose that {Xt } satisfies the ARMA representation Xt - (0) (0) i Xt-i = t + i=1 (0) (0) j=1 (0) j t-j , (0) (6.22) 2 and 0 = (1 , . . . , q ), 0 = (1 , . . . , p ) and 0 = var(t ). In this section we consider the sampling properties of the GML estimator, defined in Section 4.3.2. We first recall the estimator. ^ ^ ^ ^ We use as an estimator of ( 0 , 0 ), n = ( n , n , n ) = arg min(,) Ln (, , ), where 1 Ln (, , ) = n 1 n n-1 t=1 1 log rt+1 (, , ) + n n-1 t=1 (Xt+1 - Xt+1|t )2 rt+1 (, , ) (,) . (6.23) To show consistency and asymptotic normality we will use the following assumptions. Assumption 6.2.1 (i) Xt is both invertible and causal. (ii) The parameter space should be such that all (z) and (z) in the parameter space have roots whose absolute value is greater than 1 + . 0 (z) and 0 (z) belong to this space. 72 1 Assumption 6.2.1 means for for some finite constant K and 1+ < 1, we have |(z)-1 | j ||z j | and |(z)-1 | K j ||Z j |. K j=0 | j=0 | To prove the result, we require the following approximations of the GML. Let ~ (,) Xt+1|t,... = t bj (, )Xt+1-j . j=1 (6.24) This is an approximation of the one-step ahead predictor. Since the likelihood is constructed 1 from the one-step ahead predictors, we can approximated the likelihood n Ln (, , ) with the above and define 1~ 1 Ln (, , ) = log 2 + n n 2 T -1 t=1 ~ (,) (Xt+1 - Xt+1|t,... )2 . (6.25) (,) ~ (,) We recall that Xt+1|t,... was derived from Xt+1|t,... which is the one-step ahead predictor of Xt+1 given Xt , Xt-1 , . . ., this is (,) Xt+1|t,... j=1 1 Using the above we define a approximation of n Ln (, , ) which in practice cannot be obtained (since the infinite past of {Xt } is not observed). Let us define the criterion = bj (, )Xt+1-j . (6.26) 1 1 Ln (, , ) = log 2 + n n 2 T -1 t=1 (Xt+1 - Xt+1|t,... )2 . (,) (6.27) {Xt+1|t,... = bj (, )Xt+1-j } both of these are ergodic (since the ARMA process is ergodic j=1 when its roots lie outside the unit circle and the roots of , are such that they lie outside the unit circle). In contrast looking at Ln (, , ), which is comprised of {Xt+1|t }, which not an ergodic random variable because Xt+1 is the best linear predictor of Xt+1 given Xt , . . . , X1 (see the number of elements in the prediction changes with t). Using this approximation really simplifies the proof, though it is possible to prove the result without using these approximations. First we obtain the result for the estimators = ( , , n ) = arg min(,) Ln (, , ) ^n n n ^ and then show the same result can be applied to n . ^ Proposition 6.2.1 Suppose {Xt } is an ARMA process which satisfies (6.22), and Assumption (,) (,) ~ (,) 6.2.1 is satisfied. Let Xt+1|t , Xt+1|t,... and Xt+1|t,... be the predictors defined in (4.11), (6.24) and (6.26), obtained using the parameters = {j } and = {i }, where the roots the corresponding characteristic polynomial (z) and (z) have absolute value greater than 1 + . Then (,) Xt+1|t 1 In practice n Ln (, , ) can not be evaluated, but it proves to be a convenient tool in obtaining 1 ^ the sampling properties of n . The main reason is because n Ln (, , ) is a function of {Xt } and (,) ~ (,) - Xt+1|t,... t 1- t i=1 i |Xi |, (6.28) 73 (,) ~ (,) E(Xt+1|t - Xt+1|t,... )2 Kt , j=t+1 j=0 (6.29) ~ Xt+1|t,... (1) - Xt+1|t,... = bj (, )Xt+1-j K t j |X-j |, (6.30) ~ E(Xt+1|t,... - Xt+1|t,... )2 Kt and |rt (, , ) - 2 | Kt for any 1/(1 + ) < < 1 and K is some finite constant. (,) (,) (6.31) (6.32) PROOF. The proof follows closely the proof of Proposition 6.2.1. First we define a separate ARMA process {Yt }, which is driven by the parameters and (recall that {Xt } is drive by the parameters 0 and 0 ). That is Yt satisfies Yt - p j Yt-j = t + q j t-j . Recalling that j=1 j=1 , Xt+1|t is the best linear predictor of Xt+1 given Xt , . . . , X1 and the variances of {Yt } (noting that it is the process driven by and ), we have t , Xt+1|t = j=1 j=t+1 bj (, )Xt+1-j + bj (, )r t,j (, )t (, )-1 X t , (6.33) where t (, )s,t = E(Ys Yt ), (r t,j )i = E(Yt-i Y-j ) and X t = (Xt , . . . , X1 ). Therefore , Xt+1|t ~ - Xt+1|t,... = j=t+1 bj r t,j t (, )-1 X t . Since the largest eigenvalue of t (, )-1 is bounded (see Lemma 3.3.1) and |(r t,j )i | = |E(Yt-i Y-j )| K|t-i+j| we obtain the bound in (6.28). Taking expectations, we have , E(Xt+1|t ~ , - Xt+1|t,... )2 = j=t+1 bj r t,j t (, ) -1 t (0 , 0 )t (, ) -1 j=t+1 bt+j r t,j . Now by using the same arguments given in the proof of (3.10) we obtain (6.29). To prove (6.31) we note that ~ E(Xt+1|t,... (1) - Xt+1|t,... )2 = E( j=t+1 bj (, )Xt+1-j )2 = E( j=1 bt+j (, )X-j )2 , 1 now by using (2.9), we have |bt+j (, )| Kt+j , for 1+ < < 1, and the bound in (6.30). ~ Using this we have E(Xt+1|t,... (1) - Xt+1|t,... )2 Kt , which proves the result. 74 Using t = Xt - j=1 bj (0 , 0 )Xt-j and substituting this into Ln (, , ) gives j=1 T -1 t=1 1 1 Xt - Ln (, , ) = log 2 + n n 2 = bj (, )Xt+1-j 2 1 1 Ln (, , ) log 2 + n n 2 n 2 (B)-1 (B)Xt j=1 2 (B)-1 (B)Xt 1 = log + n 2 + 1 n n t=1 j=1 2 t t=1 2 + n n t t=1 bj (, )Xt-j (bj (, ) - bj (0 , 0 ))Xt-j . Remark 6.2.1 (Derivatives involving the Backshift operator) Consider the transformation 1 Xt = 1 - B j=0 B Xt = j j j=0 j Xt-j . Suppose we want to differentiate the above with respect to , there are two ways this can be done. 1 Either differentiate j Xt-j with respect to or differentiate 1-B with respect to . In j=0 other words d 1 -B Xt = Xt = d 1 - B (1 - B)2 j=0 p j j=1 j B jj-1 Xt-j . and (B) = Often it is easier to differentiate the operator. Suppose that (B) = 1 + 1 - q j B j , then we have j=1 d (B) B j (B) (B) Xt = - Xt = - Xt-j dj (B) (B)2 (B)2 Bj d (B) 1 Xt = - X =- Xt-j . 2 t dj (B) (B) (B)2 Moreover in the case of squares we have d (B) (B) (B) ( Xt )2 = -2( Xt )( Xt-j ), dj (B) (B) (B)2 d (B) (B) 1 ( Xt )2 = -2( Xt )( Xt-j ). dj (B) (B) (B)2 1 n Ln Using the above we can easily evaluate the gradient of 1 n 1 n 1 n 2 i Ln (, , ) = - 2 j Ln (, , ) n ((B)-1 (B)Xt ) t=1 n t=1 n t=1 (B) Xt-i (B)2 1 Xt-j (B) 2 = - 2 n 2 ((B)-1 (B)Xt ) Xt - j=1 2 Ln (, , ) = 1 1 - 2 n 4 75 bj (, )Xt-j . (6.34) Let =( i , j , 2 ). We note that the second derivative 2L n can be defined similarly. Lemma 6.2.1 Suppose Assumption 6.2.1 holds. Then sup , 1 Ln n 2 KSn sup , 1 n 3 Ln 2 KSn (6.35) for some constant K, 1 Sn = n where Yt = K for any 1 (1+) j=0 max(p,q) n Yt-r1 Yt-r2 r1 ,r2 =0 t=1 (6.36) j |Xt-j |. < < 1. PROOF. The proof follows from the the roots of (z) and (z) having absolute value greater than 1 + . 1 Define the expectation of the likelihood L(, , )) = E( n Ln (, , )). We observe L(, , )) = log 2 + where Zt (, ) = j=1 2 1 0 + 2 E(Zt (, )2 ) 2 (bj (, ) - bj (0 , 0 ))Xt-j Lemma 6.2.2 Suppose that Assumption 6.2.1 are satisfied. Then for all , , we have (i) 1 n iL n (, , )) a.s. i L(, , )) for i = 0, 1, 2, 3. a.s. max(p,q) r1 ,r2 =0 n t=1 Yt-r1 Yt-r2 ). (ii) Let Sn defined in (6.36), then Sn E( PROOF. Noting that the ARMA process {Xt } are ergodic random variables, then {Zt (, )} and {Yt } are ergodic random variables, the result follows immediately from the Ergodic theorem. We use these results in the proofs below. ^ ^ ^ Theorem 6.2.1 Suppose that Assumption 6.2.1 is satisfied. Let ( n , n , n ) = arg min Ln (, , ) (noting the practice that this cannot be evaluated). Then we have ^ ^ ^ (i) ( n , n , n ) ( 0 , 0 , 0 ). a.s. 76 (ii) ^ D 2 ^ n( n - 0 , n - 0 ) N (0, 0 -1 ), where = E(Ut Ut ) E(Vt Ut ) E(Ut Vt ) E(Vt Vt ) and {Ut } and {Vt } are autoregressive processes which satisfy 0 (B)Ut = t and 0 (B)Vt = t . PROOF. We prove the result in two stages below. PROOF of Theorem 6.2.1(i) We will first prove Theorem 6.2.1(i). Noting the results in Section 5.4, to prove consistency we recall that we must show (a) the ( 0 , 0 , 0 ) is the a.s. 1 unique minimum of L() (b) pointwise convergence T L(, , )) L(, , )) and (b) stochastic equicontinuity (as defined in Definition 5.4.2). To show that (0 , 0 , 0 ) is the minimum we note that L(, , )) - L(0 , 0 , 0 )) = log( 2 2 2 2 ) + 2 - 1 + E(Zt (, ) ). 0 0 Since for all positive x, log x + x - 1 is a positive function and E(Zt (, )2 ) = E( (bj (, ) - j=1 bj (0 , 0 ))Xt-j )2 is positive and zero at (0 , 0 , 0 ) it is clear that 0 , 0 , 0 is the minimum of L. We will assume for now it is the unique minimum. Pointwise convergence is an immediate consequence of Lemma 6.2.2(i). To show stochastic equicontinuity we note that for any 1 = (1 , 1 , 1 ) and 2 = (2 , 2 , 2 ) we have by the mean value theorem Ln (1 , 1 , 1 ) - Ln (2 , 2 , 2 )) = (1 - 2 ) Ln (, , ). Now by using (6.35) we have Ln (1 , 1 , 1 ) - Ln (2 , 2 , 2 )) ST (1 - 2 ), ( 1 - 2 ), (1 - 2 ) 2 . n By using Lemma 6.2.2(ii) we have Sn E( r1 ,r2 =0 t=1 Yt-r1 Yt-r2 ), hence {Sn } is almost surely bounded. This implies that Ln is equicontinuous. Since we have shown pointwise convergence and equicontinuity of Ln , by using Corollary 5.4.1, we almost sure convergence of the estimator. Thu proving (i). a.s. max(p,q) PROOF of Theorem 6.2.1(ii) We now prove Theorem 6.2.1(i) using the Martingale central limit theorem (see Billingsley (1995) and Hall and Heyde (1980)) in conjunction with the Cramer-Wold device (see Theorem 5.7.1). Using the mean value theorem we have ^n - 0 = 2 L (n )-1 L (0 , 0 , 0 ) n n ^ ^ ^ ^n ^n where = (n , n , n ), 0 = (0 , 0 , 0 ) and n = , , lies between and 0 . Using the same techniques given in Theorem 6.2.1(i) and Lemma 6.2.2 we have pointwise a.s. convergence and equicontinuity of 2 Ln . This means that 2 Ln (n ) E( 2 Ln (0 , 0 , 0 )) = a.s. 1 (since by definition of n n 0 ). Therefore by applying Slutsky's theorem (since is 2 nonsingular) we have 2 Ln (n )-1 2 -1 . 77 a.s. (6.37) Now we show that Ln (0 ) is asymptotically normal. By using (6.34) and replacing Xt-i = 0 (B)-1 0 (B)t-i we have 1 n 1 n 1 n i Ln (0 , 0 , 0 ) = j Ln (0 , 0 , 0 ) 2 2n 2 2n n t t=1 n (-1) -2 t-i = 2 0 (B) n 2 1 t-j = 2 0 (B) n T n t Vt-i t=1 T i = 1, . . . , q = t t=1 t Ut-j t=1 j = 1, . . . , p 2 Ln (0 , 0 , 0 ) = 1 1 - 2 4n 2 = t=1 1 4n T t=1 ( 2 - 2 ), 1 1 where Ut = 0 1 t and Vt = 0 (B) t . We observe that n Ln is the sum of vector martingale (B) differences. If E(4 ) < , it is clear that E((t Ut-j )4 ) = E((4 )E(Ut-j )4 ) < , E((t Vt-i )4 ) = t t E((4 )E(Vt-i )4 ) < and E(( 2 - 2 )2 ) < . Hence Lindeberg's condition is satisfied (see the t t proof given in Section 5.8, for why this is true). Hence we have n Ln (0 , 0 , 0 ) N (0, ). D Now by using the above and (6.37) we have ^n n - 0 = n 2 Ln (n )-1 Ln (0 ) ^n n - 0 N (0, 4 -1 ). D Thus we obtain the required result. ^ ^ ^ The above result proves consistency and asymptotically normality of ( n , n , n ), which is based on Ln (, , ), which in practice is impossible to evaluate. However we will show below that the gaussian likelihood, Ln (, , ) and is derivatives are sufficiently close to Ln (, , ) ^ ^ ^ ^ ^ ^ such that the estimators ( n , n , n ) and the GMLE, ( n , n , n ) = arg min Ln (, , ) are asymptotically equivalent. We use Lemma 6.2.1 to prove the below result. ~ Proposition 6.2.2 Suppose that Assumption 6.2.1 hold and Ln (, , ), Ln (, , ) and Ln (, , ) are defined as in (6.23), (6.25) and (6.27) respectively. Then we have for all (, ) T heta we have almost surely 1 | (,,) n sup (k) ~ L(, , ) - k 1 Ln (, , )| = O( ) n 1 ~ 1 |Ln (, , ) - L(, , )| = O( ), n (,,) n sup for k = 0, 1, 2, 3. PROOF. The proof of the result follows from (6.28) and (6.30). We show that result for 1 ~ sup(,,) n |L(, , ) - Ln (, , )|, a similar proof can be used for the rest of the result. Let us consider the difference Ln (, ) - Ln (, ) = 1 (In + IIn + IIIn ), n 78 where n-1 n-1 In = t=1 n-1 rt (, , ) - 2 , IIn = t=1 1 (,) (,) (Xt+1 - Xt+1|t )2 rt (, , ) IIIn = t=1 1 (,) (,) ~ (,) ~ (,) 2Xt+1 (Xt+1|t - Xt+1|t,... ) + ((Xt+1|t )2 - (Xt+1|t,... )2 ) . 2 Now we recall from Proposition 6.2.1 that (,) ~ (,) Xt+1|t - Xt+1|t,... K Vt t (1 - ) 2 where Vt = t i |Xi |. Hence since E(Xt ) < and E(Vt2 ) < we have that supn E|In | < , i=1 supn E|IIn | < and supn E|IIIn | < . Hence the sequence {|In + IIn + IIIn |}n is almost surely bounded. This means that almost surely 1 sup Ln (, ) - Ln (, ) = O( ). n ,, Thus giving the required result. Now by using the above proposition the result below immediately follows. ^ ^ ~ ^ ~ Theorem 6.2.2 Let (, ) = arg min LT (, , ) and (, ) = arg min LT (, , ) ^ ^ a.s. ~ ~ a.s. (i) (, ) ( 0 , 0 ) and (, ) ( 0 , 0 ). D 4 ^ ^ (ii) T ( T - 0 , T - 0 ) N (0, 0 -1 ) D 4 ~ ~ and T ( T - 0 , T - 0 ) N (0, 0 -1 ). PROOF. The proof follows immediately from Proposition 6.2.1. 79 Chapter 7 Residual Bootstrap for estimation in autoregressive processes In Chapter 6 we consider the asymptotic sampling properties of the several estimators including the least squares estimator of the autoregressive parameters and the gaussian maximum likelihood estimator used to estimate the parameters of an ARMA process. The asymptotic distributions are often used for statistical testing and constructing confidence intervals. However the results are asymptotic, and only hold (approximately), when the sample size is relatively large. When the sample size is smaller, the normal approximation is not valid and better approximations are sought. Even in the case where we are willing to use the asymptotic distribution, often we need to obtain expressions for the variance or bias. Sometimes this may not be possible or only possible with a excessive effort. The Bootstrap is a power tool which allows one to approximate certain characteristics. To quote from Wikipedia `Bootstrap is the practice of estimating properties of an estimator (such as its variance) by measuring those properties when sampling from an approximating distribution'. Bootstrap essentially samples from the sample. Each subsample is treated like a new sample from a population. Using these `new' multiple realisations one can obtain approximations for CIs and variance estimates for the parameter estimates. Of course in reality we do not have multiple-realisations, we are sampling from the sample. Thus we are not gaining more as we subsample more. But we do gain some insight into the finite sample distribution. In this chapter we will details the residual bootstrap method, and then show that the asymptotically the bootstrap distribution coincides with asymptotic distribution. The residual bootstrap method was first proposed by J. P. Kreiss (Kreiss (1997) is a very nice review paper on the subject), (see also Franke and Kreiss (1992), where an extension to AR() processes is also given here). One of the first theoretical papers on the bootstrap is Bickel and Freedman (1981). There are several other boostrapping methods for time series, these include bootstrapping the periodogram, block bootstrap, bootstrapping the Kalman filter (Stoffer and Wall (1991), Stoffer and Wall (2004) and Shumway and Stoffer (2006)). These methods have not only been used for variance estimation but also determining orders etc. At this point it is worth mentioning methods Frequency domain approaches are considered in Dahlhaus and Janas (1996) and Franke and Hrdle (1992) (a review of subsampling methods can be found in Politis a et al. (1999)). 80 7.1 The residual bootstrap p Suppose that the time series {Xt } satisfies the stationary, causal AR process Xt = j=1 j Xt-j + t , where {t } are iid random variables with mean zero and variance one and the roots of the characteristic polynomial have absolute value greater than (1 + ). We will suppose that the order p is known. The residual bootstrap for autoregressive processes (i) Let 1 ^ p = n n n Xt-1 Xt-1 t=p+1 1 and p = ^ n Xt-1 Xt , t=p+1 (7.1) ^ ^ ^ ^p ^ where Xt = (Xt , . . . , Xt-p+1 ). We use n = (1 , . . . , p ) == -1 p as an estimator of = (1 , . . . , p ). (ii) We create the bootstrap sample by first estimating the residuals { t } and sampling from the residuals. Let p t = Xt - ^ j Xt-j . j=1 (iii) Now create the empirical distribution function based on t . Let ^ ^ Fn (x) = 1 n-p n I(-,^t ] (x). t=p+1 ^ (iv) Sample independently from the distribution Fn (x) n times. Label this sample as {+ }. k (v) Let Xk = + for 1 k p and k + Xk p + j Xk-j + k , j=1 ^ we notice that sampling from the distribution Fn (x), means observing t with probability ^ -1 . (n - p) = p < k n. + (vi) We call {Xk }. Repeating step (vi,v) N times gives us N bootstrap samples. To distinguish + each sample we can label each bootstrap sample as ({(Xk )(i) }; i = p + 1, . . . , n). (vii) For each bootstrap sample we can construct a bootstrap matrix, vector and estimator + + ^+ (+ )(i) , (p )(i) and (n )(i) = ((+ )(i) )-1 (p )(i) . p p ^ ^+ (viii) Using (n )(i) we can estimate the variance of n - with ^ distribution function of n - . 81 1 n n ^ + (i) j=1 ((n ) ^ - n ) and the 7.2 The sampling properties of the residual bootstrap estimator ^ ^+ ^ In this section we show that the distribution of n(n - n ) and n(n - ) asymptotically coincide. This means that using the bootstrap distribution is no worse than using the asymptotic normal approximation. However it does not say the bootstrap distribution better approximates ^ the finite sample distribution of (n -), to show this one would have to use Edgeworth expansion methods. ^+ ^ In order to show that the distribution of the bootstrap sample n(n - n ) asymptotically ^ coincides with the asymptotic distribution of n(n - ), we will show convergence of the distributions under the following distance dp (H, G) = XH,Y G inf {E(X - Y )p }1/p , where p > 1. Roughly speaking, if dp (Fn , Gn ) 0, then the limiting distributions of Fn and Gn are the same (see Bickel and Freedman (1981)). The case that p = 2 is the most commonly used p, and for p = 2, this is called Mallows distance. The Mallows distance between the distribution H and G is defined as d2 (H, G) = XH,Y G inf {E(X - Y )2 }1/2 , we will use the Mallow distance to prove the results below. It is worth mentioning that the distance is zero when H = G are the same (as a distance should be). To see this, set the joint distribution between X and Y to be F (x, y) == G(x) when y = x and zero otherwise, then it clear that d2 (H, G) = 0. To reduce notation rather than specify the distributions, F and G, we let dp (X, Y ) = dp (H, G), where the random variables X and Y have the marginal distributions H and G, respectively. We mention that distance dp satisfies the triangle inequality. The main application of showing that dp (Fn , Gn ) 0 is stated in the following lemma, which is a version of Lemma 8.3, Bickel and Freedman (1981). Lemma 7.2.1 Let , n be two probability measures then dp (n , ) 0 if and only if En (|X|p ) = |x|p n (dx) E (|X|p ) = |x|p (dx) n . and the distribution n converges weakly to the distribution . Our aim is to show that d2 ^+ ^ n(n - n ), ^ n(n - 0, which implies that their distributions asymptotically coincide. To do this we use -1 ^ ^ ^ ( n(n - ) = np (^p - p ) + ^+ ^ ^ ( n(n - ) = n(+ )-1 (p - + n ). p p + ^ ^ Studying how p , p , + and p are constructed, we see as a starting point we need to show p + d2 (Xt , Xt ) 0 t, n , + We start by showing that d2 (Zt , Zt ) 0 + d2 (Zt , Zt ) 0 n . 82 Lemma 7.2.2 Suppose + is the bootstrap residuals and t are the true residuals. Define the t 1 discrete random variable J = {p + 1, . . . , n} and let P (J = k) = n-p . Then 1 E (^J - J )2 |X1 , . . . , Xn = Op ( ) n and ^ ^ d2 (Fn , F ) d2 (Fn , Fn ) + d2 (Fn , F ) 0 as , (7.3) (7.2) n 1 1 ^ where Fn = n-1 n t=p+1 I(-,t ) (x), Fn (x) = n-p t=p+1 I(-,^t ] (x) are the empirical distribun and estimated residuals {^ }n , and F is the distribution tion function based on the residuals {t }p t p function of the residual t . PROOF. We first show (7.2). From the definition of + and J we have ^J E(|^J - J |2 |X1 , . . . , Xn ) = = 1 n-p 1 n-p p n t=p+1 n (^t - t )2 p ( t=p+1 j=1 ^ [j - j ]Xt-j )2 1 n-p n = j1 ,j2 =1 ^ ^ [j1 - j1 ][j2 - j2 ] Xt-j1 Xt-j2 . t=p+1 ^ Now by using (5.27) we have sup1jp |j - j | = Op (n-1/2 ), therefore we have E|^J - J |2 = -1/2 ). Op (n We now prove (7.3). We first note by the triangle inequality we have ^ d2 (F, Fn ) d2 (F, Fn ) + d2 (Fn , Fn ). By using Lemma 8.4, Bickel and Freedman (1981), we have that d2 (Fn , F ) 0. Therefore we ^ ^ need to show that d2 (Fn , Fn ) 0. It is clear by definition that d2 (Fn , Fn ) = d2 (+ , t ), where + t ~ t n 1 ^n = 1 I(-,^t ) (x) and t is sampled from Fn = n-1 n ~ I(-,t ) (x). is sampled from F t=p+1 t=p+1 n-1 Hence, t + have the same distribution as J and J . We now evaluate d2 (+ , t ). To evaluate ~ ~t ^ t ~ ^ ~ d2 (+ , t ) = inf + Fn ,~t Fn E|+ - t | we need that the marginal distributions of (+ , t ) are Fn ^ t t ~ t ~ t and Fn , but the infimum is over all joint distributions. It is best to choose a joint distribution which is highly dependent (because this minimises the distance between the two random variables). An ideal candidate is to suppose that + = J and t = J , since these have the ^ ~ t ^ marginals Fn and Fn respectively. Therefore ^ d2 (Fn , Fn )2 = 1 E|+ - t |2 E (^J - J )2 |X1 , . . . , Xn = Op ( ), ~ t ^ n + Fn ,~t Fn t inf P ^ where the above rate comes from (7.2). This means that d2 (Fn , Fn ) 0, hence we obtain (7.3). Corollary 7.2.1 Suppose + is the bootstrapped residual. Then we have t EFn ((+ )2 |X1 , . . . , Xn ) EF (2 ) ^ t t 83 P PROOF. The proof follows from Lemma 7.2.1 and Lemma 7.2.2. We recall that since Xt is a causal autoregressive process, there exists some coefficients {a j } such that Xt = j=0 aj t-j , where aj = aj () = [A()j ]1,1 = [Aj ]1,1 (see Lemma 2.2.1). Similarly using the estimated + ^ parameters n we can write Xt as t + Xt = j=0 + ^ ^ where aj (n ) = [A(n )j ]1,1 . We now show that d2 (Xt , Xt ) 0 as n and t . ^ aj (n )+ , t-j Lemma 7.2.3 Let Jp+1 , . . . , Jn be independent samples from {n - p + 1, . . . , n} with P (Ji = 1 k) = n-p . Define t t t Yt+ = j=p+1 ^ aj (n )+t-j , J ~ Yt+ = j=p+1 ^ aj (n )+t-j , J ~ Yt = j=p+1 aj Jt-j , ~ Yt = Yt + j=t+p+1 aj t-j , ^ ^ where Jj is a sample from {p+1 , . . . , n } and J is a sample from {^p+1 , . . . , n }. Then we have 1 ~ E (Yt+ - Yt+ )2 |X1 , . . . , Xn = Op ( ), n 1 ~ ~ E (Yt+ - Yt )2 |X1 , . . . , Xn = Op ( ), n and ~ E (Yt - Yt )2 |X1 , . . . , Xn Kt , ~ d2 (Y t , Y t ) 0 n . (7.6) ~ d2 (Yt+ , Yt+ ) 0 ~ ~ d2 (Yt+ , Yt ) 0 n , n , (7.4) (7.5) PROOF. We first prove (7.4). It is clear from the definitions that t E (Yt+ ~ - Yt+ )2 | X1 , . . . , Xn j=0 ^ ([A()j ]1,1 - [A(n )j ]1,1 )2 E((+ )2 |X1 , . . . , Xn ). j (7.7) P Using Lemma 7.2.1 we have that E((+ )2 |X1 , . . . , Xn ) is the same for all j and E((+ )2 |X1 , . . . , Xn ) j j ^ ^ E(2 ), hence we will consider for now ([A()j ]1,1 -[A(n )j ]1,1 )2 . Using (5.27) we have (n -) = t ^ ^ Op (n-1/2 ), therefore by the mean value theorem we have [A() - A(n )| = (n - )D K D n (for some random matrix D). Hence ^ A(n )j = (A() + K j K D) = A()j 1 + A()-1 n n 84 j (note these are heuristic bounds, and this argument needs to be made precise). Applying the mean value theorem again we have A()j 1 + A()-1 where B is such that B spec K D n j = A()j + K K D A()j (1 + A()-1 B)j , n n K nD . Altogether this gives K K D A()j (1 + A()-1 B)j . n n ^ |[A()j - A(n )j ]1,1 | Notice that for large enough n, (1 + A()-1 K B)j is increasing slower (as n ) than A()j n is contracting. Therefore for a large enough n we have K ^ [A()j - A(n )j ]1,1 1/2 j , n for any 1 1+ < < 1. Subsituting this into (7.7) gives 1 K j = Op ( ) 0 n . E((+ )2 ) t n n1/2 j=0 t ~ E (Yt+ - Yt+ )2 |X1 , . . . , Xn ~ hence d2 (Yt+ , Yt+ ) 0 as n . We now prove (7.5). We see that t t ~ ~ E (Yt+ - Yt )2 |X1 , . . . , Xn = a2 E(^Jt-j j j=0 - Jt-j ) = E(^Jt-j - Jt-j ) 2 2 j=0 a2 . j (7.8) ~ ~ Now by substituting (7.2) into the above we have E(Yt+ - Yt )2 = O(n-1 ), as required. This + ~ ~ means that d2 (Yt , Yt ) 0. Finally we prove (7.6). We see that ~ E (Yt - Yt )2 |X1 , . . . , Xn = j=t+1 a2 E(2 ). j t (7.9) ~ Using (2.7) we have E(Yt - Yt )2 Kt , thus giving us (7.6). We can now almost prove the result. To do this we note that ^ (^p - p ) = 1 n-p n n t Xt-1 , t=p+1 + (p - ^ + n ) p 1 = n-p + X + . t t-1 t=p+1 (7.10) ~ ~ p Lemma 7.2.4 Let Yt , Yt+ , Yt+ and Yt , be defined as in Lemma 7.2.3. Define p and + , p + in the same way as and defined in (7.1), but using Y and Y + defined in Lemma ^p and p ^p t t 7.2.3, rspectively, rather than Xt . We have that d2 (Yt , Yt+ ) {E(Yt - Yt+ )2 }1/2 = Op (K(n-1/2 + t ), 85 (7.11) d2 (Yt , Xt ) 0, and d2 n , 2 (7.12) n(p - p ), n(p - + n ) nE (p - p ) - (p - + n ) + p ^ + p ^ 0 n , (7.13) + p ^ where p , + , p and p are defined in the same was as p , + , p and p , but with {Yt } + p ^ + p replacing Xt in p and p and {Yt+ } replacing Xt in + and p . Furthermore we have + p E|+ - p | 0, d2 (p - p ), (p - p ) 0, ^ E|p - p | 0 n . (7.14) (7.15) PROOF. We first prove (7.11). Using the triangle inequality we have ~ ~ ~ ~ {E (Yt - Yt+ )2 |X1 , . . . , Xn }1/2 { E(Yt - Yt )2 |X1 , . . . , Xn }1/2 + { E(Yt - Yt+ )2 |X1 , . . . , Xn }1/2 ~ +{E (Yt+ - Yt+ )2 |X1 , . . . , Xn }1/2 = O(n-1/2 + t ), where we use Lemma 7.2.3 we get the second inequality above. Therefore by definition of + d2 (Xt , Xt ) we have (7.11). To prove (7.12) we note that the only difference between Y t and Xt is that the {Jk } in Yt , is sampled from {p+1 , . . . , n } hence sampled from Fn , where as the {t }n t=p+1 in Xt are iid random variables with distribution F . Since d2 (Fn , F ) 0 (Bickel and Freedman (1981), Lemma 8.4) it follows that d2 (Yt , Xt ) 0, thus proving (7.12). To prove (7.13) we consider the difference (p - p ) - (p - + n ) and use (7.10) to get + p ^ 1 n n t=p+1 t Yt-1 - + + Yt-1 t 1 = n n t=p+1 + (t - + )Yt-1 + + (Yt-1 - Yt-1 ) , t t + + + where we note that Yt-1 = (Yt-1 , . . . , Yt-p ) and Yt-1 = (Yt-1 , . . . , Yt-p ) . Using the above, and taking conditional expectations with respect to {X1 , . . . , Xn } and noting that conditioned on {X1 , . . . , Xn }, (t - + ) are independent of Xk and X+ for k < t we have t k 1 E n where I = 1 n n t=p+1 n t=p+1 + t Yt-1 - + Yt-1 t 2 1/2 |X1 , . . . , Xn I + II 2 {E (t - + )2 |X1 , . . . , Xn }1/2 {E(Yt-1 |X1 , . . . , Xn )}1/2 t = {E (t - + )2 |X1 , . . . , Xn }1/2 t II = 1 n n t=p+1 1 n n t=p+1 2 {E((Yt-1 |X1 , . . . , Xn )}1/2 + {E((+ )2 |X1 , . . . , Xn )}1/2 {E((Yt-1 - Yt-1 )2 |X1 , . . . , Xn }1/2 t = {E((+ )2 |X1 , . . . , Xn )}1/2 t 1 n n t=p+1 + {E((Yt-1 - Yt-1 )2 |X1 , . . . , Xn )}1/2 . 86 Now by using (7.2) we have I Kn-1/2 , and (7.13) and Corollary 7.2.1 we obtain II Kn-1/2 , hence we have (7.13). Using a similar technique to that given above we can prove (7.14). (7.15) follows from (7.13), (7.14) and (7.12). + ^ ^ Corollary 7.2.2 Let + , p , p and p be defined in (7.1). Then we have p d2 + ^ ^ n(^p - p ), n(p - + n ) p ^ d1 (+ , p ) 0, p 0 (7.16) (7.17) as n . PROOF. We first prove (7.16). Using (7.13), (7.15) and the triangular inequality gives (7.16). To prove (7.17) we use (7.14) and (7.15) and the triangular inequality and (7.16) immediately follows. Now by using (7.17) and Lemma 7.2.1 we have + E(p ), p and by using (7.16), the distribution of ^ n(^p - p ). Therefore P + ^ n(p - + n converges weakly to the distribution of p ^+ ^ D n(n - n ) N (0, 2-1 ), p + ^ ^ hence the distributions of n(^p - p ) and n(p - + n ) aymptotically coincide. From p ^ ^ D (5.28) we have n(n - ) N (0, 2 -1 ). Thus we see that the distribution of n(n - ) p ^+ ^ and n(n - n ) asymptotically coincide. 87 Chapter 8 Spectral Analysis 8.1 Some Fourier background The background given here is a extremely sketchy (to say the least), for a more thorough background the reader is referred, for example, to Priestley (1983), Chapter 4 and Fuller (1995), Chapter 3. (i) Fourier transforms of finite sequences It is straightforward to show (by using that 1 dk = n n n j=1 exp(i2k/n) = 0 for k = 0) that if xj exp(i2jk/n), j=1 then {xr } can be recovered by inverting this transformation 1 xr = n (ii) Fourier sums and integrals Of course the above only has meaning when {xk } is a finite sequence. However suppose that {xk } is a sequence which belongs to 2 (that is k x2 < ), then we can define the k function 1 f () = 2 where 2 0 k=- n dk exp(-i2rk/n), k=1 xk exp(ik), f ()2 d = k x2 , and we we can recover {xk } from f (). That is k 1 xk = 2 2 f () exp(-ik). 0 88 (iii) Convolutions. Let us suppose that the Fourier transform of the sequence {a k } is A() = 1 1 k ak exp(ik) and Fourier transform of the sequence {bk } is B() = 2 k bk exp(ik). 2 Then j=- j=- aj bk-j = A()B(-) exp(-ik)d A()B( - )d. (8.1) aj bj exp(ij) = 8.2 Motivation To give a taster of the spectral representations below let us consider the following example. Suppose that {Xt }n is a stationary time series. The Fourier transform of this sequence is t=1 1 Jn (j ) = n n Xt exp(itj ) t=1 where j = 2j/n (these are often called the fundamental frequencies). Using (i) above we see that 1 Xt = n n Jn (j ) exp(-ikt ). k=1 (8.2) This is just the inverse Fourier transform, however {Jn (j )} has some interesting properties. Under certain conditions it can be shown that cov(Jn (s ), Jn (t )) 0 if s = t. So in some sense (8.2) can be considered as the decomposition of Xt in terms of frequencies whose amplitudes are uncorrelated. Now if we let fn () = n-1 E(|Jn ()|2 ), and take the above argument further we see that 1 c(k) = cov(Xk , Xt+k ) = n n s=1 n E(|Jn (s )| ) exp(iks - i(k + t)s ) = 2 fn (s ) exp(-iks ).(8.3) s=1 For more details on this see Priestley (1983), Section 4.11 (pages 259-261). Note that the above can be considered as the eigen decomposition of the stationary covariance function, since n c(u, v) = c(u - v) = fn (s ) exp(ius ) exp(-ivs ), s=1 where {exp(its )} are the eigenfunctions and fn (s ) the eigenvalues. Of course the entire time series {Xt } will have infinite length (and in general it will not belong to 2 ), so it is natural to ask whether the above results can be generalised to {X t }. The answer is yes, by replacing the sum in (8.3) by an integral to obtain 2 c(k) = 0 exp(ik)dF (), 89 where F () is a positive nondecreasing function. Comparing with (8.3), we observe that f n (k ) is a positive function, thus its integral (the equivalent of F ()) is positive and nondecreasing. Therefore heuristically we can suppose that F () 0 fn ()d. Moreover the analogue of (8.2) is Xt = exp(ik)dZ(), where Z() is right continuous orthogonal increment process (that is E((Z( 1 ) - Z(2 )(Z(3 ) - Z(4 )) = 0, when the intervals [1 , 2 ] and [3 , 4 ] do not overlap) and E(|Z()|2 ) = F (). We give the proof for these results the following section. We mention that a more detailed discussion on spectral analysis in time series is give in Priestley (1983), Chpaters 4 and 6, Brockwell and Davis (1998), Chapters 4 and 10, Fuller (1995), Chapter 3, Shumway and Stoffer (2006), Chapter 4. In many of these references they also discuss tests for periodicity etc (see also Quinn and Hannan (2001) for estimation of frequencies etc.). 8.3 8.3.1 Spectral representations The spectral distribution We first state a theorem which is very useful for checking positive definiteness of a sequence. See Brockwell and Davis (1998), Corollary 4.3.2 or Fuller (1995), Theorem 3.1.9. To prove part of the result we use the fact that if a sequence {ak } 2 , then g() = 2 1 k= ak exp(ik) L2 (by Parseval's theorem) and ak = 0 g() exp(ik) and the follow2 ing result. Lemma 8.3.1 Suppose k=- |c(k)| < , then we have (n-1) 1 n as n . k=-(n-1) |kc(k)| 0 PROOF. The proof in straightforward in the case that 1 O( n ). < . First we note The proof is slightly more tricky in the case |c(k)| < for every > 0 there exists a N such that for all n N , that since k=- |c(k)| < . Let us suppose that n > N , then we have the bound |k|n 1 n (n-1) k= |kc(k)| < , that k= |c(k)| in this case (n-1) |k| k=-(n-1) n |c(k)| = k=-(n-1) |kc(k)| 1 n (N -1) k=-(N -1) |kc(k)| + 1 n |k|N |kc(k)| 1 2n (N -1) k=-(N -1) |kc(k)| + . 1 Hence if we keep N fixed we see that n k=-(N -1) |kc(k)| 0 as n . Since this is true for all (for different thresholds N ) we obtain the required result. (N -1) 90 Theorem 8.3.1 (The spectral density) Suppose the coefficients {c(k)} are absolutely summable (that is k |c(k)| < ). Then the sequence {c(k)} is nonnegative definite if an only if the function f (), where f () = is nonnegative. Moreover 2 1 2 k=- c(k) exp(ik) c(k) = 0 exp(ik)f ()d. (8.4) It is worth noting that f is called the spectral density corresponding to the covariances {c(k)}. PROOF. We first show that if {c(k)} is a non-negative definite sequence, then f () is a nonnegative function. We recall that since {c(k)} is non-negative then for any sequence x = (x 1 , . . . , xN ) (real or complex) we have n x s,t=1 xs c(s - t)s 0 (where xs is the complex conjugate of xs ). Now we consider the above for the particular case x = (exp(i), . . . , exp(in)). Define the function 1 fn () = 2n n s,t=1 exp(is)c(s - t) exp(-it). Clearly fn () 0. We note that fn () can be rewritten as fn () = Comparing f () = 1 2 1 2 (n-1) k=-(n-1) n - |k| c(k) exp(ik). n with fn () we see that (n-1) k=- c(k) exp(ik) f () - fn () 1 2 := In + IIn . |k|n 1 c(k) exp(ik) + 2 k=-(n-1) |k| c(k) exp(ik) n Now since k=- |c(k)| < it is clear that In 0 as n . Using Lemma 8.3.1 we have IIn 0 as n . Altogether the above implies f () - fn () 0 n . (8.5) Now it is clear that since for all n, fn () are nonnegative functions, the limit f must be nonnegative (if we suppose the contrary, then there must exist a sequence of functions {f nk ()} which are not necessarily nonnegative, which is not true). Therefore we have shown that if {c(k)} is a nonnegative definite sequence, then f () is a nonnegative function. 1 We now show that f (), defined by 2 k=- c(k) exp(ik), is a nonnegative function then {c(k)} is a nonnegative sequence. We first note because {c(k)} 1 it is also in 2 hence we 2 have that c(k) = 0 f () exp(ik). Now we have n s,t=1 2 n 2 n xs c(s - t)s = x f () 0 s,t=1 xs exp(i(s - t))s d = x 91 f ()| 0 s=1 xs exp(is)|2 d 0. Hence we obtain the desired result. The above theorem is very useful. It basically gives a simple way to check whether a sequence {c(k)} is non-negative definite or not (hence whether it is a covariance function - recall Theorem 1.1.1). Example 8.3.1 Suppose we define the empirical covariances cn (k) = ^ 1 n n-k t=1 Xt Xt-k |k| n - 1 0 otherwise then {^n (k)} is positive definite sequence. Therefore, using Lemma 1.1.1 there exists a stationary c time series {Zt } which has the covariance cn (k). ^ To show that the sequence is non-negative definite we will consider the Fourier transform of the sequence (the spectral density) and show that it is nonnegative. The fourier transform of {^n (k)} is c (n-1) (n-1) exp(ik)^n (k) = c k=-(n-1) k=-(n-1) 1 exp(ik)^n (k) = c n n-|k| Xt Xt+|k| t=1 1 = n n t=1 Xt exp(it) 0. Since it is positive, this means that {^n (k)} is a positive definite sequence. c We now state a useful result which relates the largest and smallest eigenvalue of of a variance matrix of a stationary process to the smallest and largest values of the spectral density. Lemma 8.3.2 Suppose that {Xk } is a stationary process with covariance function {c(k)} and spectral density f (). Let n = var(Xn ), where Xn = (X1 , . . . , Xn ). Suppose inf f () m > 0 and sup f () M < Then for all n we have min (n ) inf f () max (n ) sup f (). PROOF. Let e1 be the eigenvector with smallest eigenvalue 1 corresponding to n . Then using c(s - t) = f () exp(i(s - t))d we have n n min (n ) = e1 n e1 = 2 s,t=1 n es,1 c(s - t)et,1 = f () s,t=1 2 es,1 exp(i(s - t))et,1 d = 2 n = 0 f ()| s=1 es,1 exp(is)|2 d f () 0 0 | s=1 es,1 exp(is)|2 d inf f (), since | sup f (). n 2 s=1 es,1 exp(is)| d = 1. Using a similar method we can show that max (n ) A consequence of the above result is if a spectral density is bounded from above and bounded away from zero then it is non-singular and with a bounded spectral norm. Lemma 8.3.3 Suppose the covariance {c(k)} decays to zero as k , then for all n, n = var(Xn ) is a non-singular matrix (Note we do not specify that the covariances are absolutely summable). 92 PROOF. See Brockwell and Davis (1998), Proposition 5.1.1. Theorem 8.3.1 only holds when the sequence {c(k)} is absolutely summable. Of course this may not always be the case. An example of an `extreme' case is the time series X t = Z. Clearly this is a stationary time series and its covariance is c(k) = var(Z) for all k. In this case the autocovariances {c(k) = 1}, is not absolutely summable, hence the representation of the covariance in Theorem 8.3.1 can not be applied to this case. The reason is because the fourier transform of the infinite sequence {1} is not well defined (since {1} does not belong to 1 and also 2 ). However, we now show that Theorem 8.3.1 can be generalised to include all non-negative definite sequences and stationary processes, by considering the spectral distribution rather than the spectral density (we use the integral g(x)dF (x), a definition is given in the Appendix). Theorem 8.3.2 A function {c(k)} is non-negative definite sequence if and only if 2 c(k) = 0 exp(ik)dF (), (8.6) where F () is a right-continuous (this means that f (x + h) f (x) as 0 < h 0), nondecreasing, non-negative bounded function on [-, ] (hence it has all the properties of a distribution and it can be consider as a distribution - it is usually called the spectral distribution). This representation is unique. PROOF. We first show that if {c(k)} is non-negative definite sequence, then we can write 2 c(k) = 0 exp(ik)dF (), where F () is a distribution function. Had {c(k)} been absolutely 2 summable, then we can use Theorem 8.3.1 to write c(k) = 0 exp(ik)dF (), where F () = 1 k=- c(k) exp(ik). By using Theorem 8.3.1 we know that f () is 0 f ()d and f () = 2 nonnegative, hence F () is a distribution, and we have the result. In the case that {c(k)} is not absolutely summable we cannot use this approach but we adapt some of the ideas used to prove Theorem 8.3.1. As in the proof of Theorem 8.3.1 define the nonnegative function fn () = 1 2n n s,t=1 exp(is)c(s - t) exp(-it) = 1 2 (n-1) k=-(n-1) n - |k| c(k) exp(ik). n When the {c(k)} is not absolutely summable, the limit of fn () may no longer be well defined. To circumvent our dealing with functions which may have awkward limits, we consider instead their integral, which we will show will always be a distribution function. Let us define the function Fn () whose derivative is fn (), that is Fn () = 0 fn ()d 0 2. 2 Since fn () is nonnegative, Fn is a nondecreasing function, and it is bounded (Fn () = 0 fn ()d c(0)). Hence Fn satisfies all properties of a distribution and can be treated as a distribution function. Now it is clear that for every k we have 2 exp(ik)dFn () = 0 (1 - 93 |k| n )c(k) 0 |k| n 0 (8.7) If we let dn,k = 0 exp(ik)dFn (), we see that for every k, dn,k dk as n . But we should ask what this tells us about the limit of the distribution {Fn }? Intuitively, the distributions {Fn } should (weakly converge) to a function F and this function should also be a distribution function (if {Fn } are all nondecreasing functions, then its limit must be nondecreasing). In fact this turns out to be the case by applying Helly's theorem (see Appendix). Roughly speaking, it states that given a sequence of distributions {Gk } which are all bounded, then there exists a distribution G, which is the limit of a subsequence of {Gki } (effectivly this determining conditions for a sequence of functions to be compact), hence for every h L2 we have h()dGki () h()dG(). 2 We now now apply this result to the sequence {Fn }. We observe that the sequence of distributions {Fn } are all uniformly bounded (by c(0)). Therefore applying Helly's theorem there must exist a distribution F , which is the limit of a subsequence of {Fn }, that is for every h L2 we have h()dFki () h()dF (), i , for some subsequence {Fki }. We now show that above is true not only for a subsequence but the actual sequence {Fk }. We observe that {exp(ik)} is a basis of L2 , and that the sequence 2 { 0 exp(ik)dFn ()}n converges for all k, to c(k). Therefore for all h L2 we have h()dFk () h()dF (), k , for some distribution function F . Therefore looking at the case {exp(ik)} we have exp(ik)dFk () exp(ik)dF (), k . Since exp(ik)dFk () c(k) and exp(ik)dFk () exp(ik)dF (), then we have c(k) = exp(ik)dF (), where F is a distribution. To show that {c(k)} is a non-negative definite sequence when c(k) is defined as c(k) = exp(ik)dFk (), we use the same method given in the proof of Theorem 8.3.1. Example 8.3.2 We now construct the spectral distribution for the time series Xt = Z. Let F () = 0 for < 0 and F () = var(Z) for 0 (hence F is the step function). Then we have cov(X0 , Xk ) = var(Z) = exp(ik)dF (). 8.3.2 The spectral representation theorem We now state the spectral representation theorem and give a rough outline of the proof. 94 Theorem 8.3.3 If {Xt } is a second order stationary time series with mean zero, and spectral distribution F (), and the spectral distribution function is F (), then there exists a right continuous, orthogonal increment process {Z()} (that is E((Z(1 ) - Z(2 )(Z(3 ) - Z(4 )) = 0, when the intervals [1 , 2 ] and [3 , 4 ] do not overlap) such that 2 Xt = 0 exp(it)dZ(), (8.8) where for 1 2 , E(Z(1 ) - Z(2 ))2 = F (1 ) - F (2 ) (noting that F (0) = 0). (One example of a right continuous, orthogonal increment process is Brownian motion, though this is just one example, and usually Z() will be far more general than Brownian motion). Heuristically we see that (8.8) is the decomposition of Xt in terms of frequencies, whose amplitudes are orthogonal. In other words Xt is decomposed in terms of frequencies exp(it) which have the orthogonal amplitudes dZ() (Z( + ) - Z()). Remark 8.3.1 Note that so far we have not defined the integral on the right hand side of (8.8), this is known as a stochastic integral. Unlike many deterministic functions (functions whose derivative exists), one cannot really suppose dZ() Z ()d, because usually a typical realisation of Z() will not be smooth enough to differentiate. For example, it is well known that Brownian is quite `rough', that is a typical realisation of Brownian motion satisfies |B(t 1 , ) - , where is a realisation and 1/2, but in general will not be B(t2 , )| K( )|t1 - tt | larger. The integral g()dZ() is well defined if it is defined as the limit (in the mean squared sense) of discrete sums. In other words let Zn () = n Z(k )Ink -1 ,nk () and k=1 n g()dZn () = k=1 g(k ){Z(k ) - Z(k-1 )}, then g()dZ() is the mean squared limit of { g()dZn ()}n that is E[ g()dZ()- g()dZn ()]2 . For a more precise explanation, see Priestley (1983), Sections 3.6.3 and Section 4.11 and Brockwell and Davis (1998), Section 4.7. A very elegant explanation on the different proofs of the spectral representation theorem is given in Priestley (1983), Section 4.11. We now give a rough outline of the proof using the functional theory approach. PROOF of the Spectral Representation Theorem To prove the result we will define two Hilbert spaces H1 and H2 , where H1 one contains deterministic functions and H2 contains random variables. We will define what is known as an isomorphism (a one-to-one mapping which preserves the norm and is linear) between these two spaces. 2 Let H1 be defined by all functions f , if 0 f 2 ()dF () < , then f H1 and define the inner product on H1 to be 2 < f, g >= 0 f (x)g(x)dF (x). (8.9) We first note that {exp(ik)} belongs to H1 , moreover they also span the space H1 . Hence if f H1 , then there exists coefficients {aj } such that f (x) = j aj exp(ij). Let H2 be the 95 space spanned by {Xt }, hence H2 = sp({Xt }) (it necessary to define the closure of this space, but we won't do so here) and the inner product is the covariance cov(Y, X). Now let us define the mapping T : H1 H2 n n T( j=1 aj exp(ik)) = j=1 a j Xk , (8.10) for any n (it is necessary to show that this can be extended to infinite n, but we won't do so here). We need to shown that T defines an isomorphism. We first observe that this mapping perserves the inner product. That is suppose f, g H1 , then there exists {fj } and {gj } such that f (x) = j fj exp(ij) and g(x) = j gj exp(ij). Hence by definition of T in (8.10) we have < T f, T g > = cov( j 2 fj X j , j g j Xj ) = j1 ,j2 fj1 gj2 cov(Xj1 , Xj2 ) 2 = 0 j1 ,j2 fj1 gj2 exp(i(j1 - j2 )) dF () = f (x)g(x)dF (x) =< f, g > . 0 Hence < T f, T g >=< f, g >, so the inner product is preserved. To show that it is a one-to-one mapping see Brockwell and Davis (1998), Section 4.7. Altogether this means that T defines an isomorphism betwen H1 and H2 . Therefore all functions which are in H1 have a corresponding random variable in H2 which display many similar properties. Since for all [0, 2], the identity functions I[0,] (x) H1 , we can define the random function {Z(); 0 2} to be T (I[0,] ) = Z(). Now since that mapping T is linear we observe that T (I[1 ,2 ] ) = Z(1 ) - Z(2 ). Moreover, since T preserves the norm we have for any non-intersecting intervals [1 , 2 ] and [3 , 4 ] that E((Z(1 ) - Z(2 )(Z(3 ) - Z(4 )) = < T (I[1 ,2 ] ), T (I[3 ,4 ] ) >=< I[1 ,2 ] , I[3 ,4 ] > = I[1 ,2 ] (x)I[3 ,4 ] dF () = 0. Therefore by construction {Z(); 0 2} is an orthogonal increment process, with E((Z(1 ) - Z(2 )2 ) = < T (I[1 ,2 ] ), T (I[1 ,2 ] ) >< I[1 ,2 ] , I[2 ,3 ] > 2 = 1 dF () = F (1 ) - F (2 ). Having defined the two spaces which are isomorphic and the random function {Z(); 0 2} and function I[0,] (x) which are have orthogonal increments. We can now prove the result. We note that for any function g L2 we can write 2 g() = 0 g(s)dI(s - ), where I(s) is the identity function with I(s) = 0, for s < 0, and I(s) = 1, for s 0 (hence dI( - s) = (s)ds and (s) is the dirac delta function). We now consider the special case g(t) = exp(it), and apply the isomophism T to this 2 T (exp(it)) = 0 exp(its)dT (I( - s)), 96 where the mapping goes inside the integral due to the linearity of the isomorphism. Now we observe that I(s - ) = I[0,s] () and by definition of {Z(); 0 2} we have T (I[0,s] ()) = Z(s). Substituting this into the above gives 2 Xt = 0 exp(its)dZ(s), which gives the required result. 8.3.3 The spectral densities of MA, AR and ARMA models We obtain the spectral density function for MA() processes. Using this we can easily obtain the spectral density for ARMA processes. Let us suppose that {Xt } satisfies the representation Xt = j t-j j=- |j | (8.11) < . We (8.12) j=- where {t } are iid random variables with mean zero and variance 2 and recall that the covariance of above is c(k) = E(Xt Xt+k ) = Since j=- |j | j=- j j+k . < , it can be seen that |c(k)| k j=- k |j | |j+k | < . Hence by using Theorem 8.3.1, the spectral density function of {Xt } is well defined. There are several ways to derive the spectral density of {Xt }, we can either use (8.12) and f () = 1 k c(k) exp(ik) or obtain the spectral representation of {X t } and derive f () from the 2 spectral representation. We prove the results using the latter method. Since {t } are iid random variables, using Theorem 8.3.3 there exists an orthogonal random function {Z()} such that t = 1 2 2 exp(it)dZ(). 0 Since E(t ) = 0 and E(2 ) = 2 multiplying the above by t , taking expectations and noting t that due to the orthogonality of {Z()} we have E(dZ(1 )dZ(2 )) = 0 unless 1 = 2 we have 2 ) = 2 d. that E(|dZ()| Using the above we can obtain the spectral representation for {X t } 1 Xt = 2 Hence 2 2 0 j exp(-ij) exp(it)dZ(). j=- Xt = 0 A() exp(it)dZ(), 97 1 where A() = 2 j=- j exp(-ij), noting that this is the unique spectral representation of Xt . Now multiplying the above by Xt+k and taking expectations gives 2 E(Xt Xt+k ) = c(k) = 0 A(1 )A(-2 ) exp(it1 - i(t + k)2 )E(dZ(1 )dZ(2 )). Due to the orthogonality of {Z()} we have E(dZ(1 )dZ(2 )) = 0 unless 1 = 2 , altogether this gives 2 E(Xt Xt+k ) = c(k) = 0 |A()|2 exp(-ik)E(|dZ()|2 ) = 2 0 2 |A()|2 exp(-ik)d. 2 2 Comparing the above with (8.4) we see that the spectral density f () = 2 |A()|2 = 2 | j=- j exp(-ij)| . Therefore the spectral density function corresponding to the MA() process defined in (8.11) is f () = 2 |A()|2 = 2 | 2 j=- j exp(-ij)|2 . Example 8.3.3 Let us suppose that {Xt } is a stationary ARM A(p, q) time series (not necessarily invertible or causal), where p q Xt - j Xt-j = j=1 j=1 j t-j , {t } are iid random variables with E(t ) = 0 and E(2 ) = 2 . Then the spectral density of {Xt } t is f () = 2 |1 + 2 |1 - q 2 j=1 j exp(ij)| q 2 j=1 j exp(ij)| We note that because the ARMA is the ratio of trignometric polynomials, this is known as a rational spectral density. 8.3.4 Higher order spectrums We recall that the covariance is measure of linear dependence between two random variables. Higher order cumulants are a measure of higher order dependence. For example, the third order cumulant for the zero mean random variables X1 , X2 , X3 is cum(X1 , X2 , X3 ) = E(X1 X2 X3 ) and the fourth order cumulant for the zero mean random variables X1 , X2 , X3 , X4 is cum(X1 , X2 , X3 , X4 ) = E(X1 X2 X3 X4 ) - E(X1 X2 )E(X3 X4 ) - E(X1 X3 )E(X2 X4 ) - E(X1 X4 )E(X2 X3 ). From the definition we see that if X1 , X2 , X3 , X4 are independent then cum(X1 , X2 , X3 ) = 0 and cum(X1 , X2 , X3 , X4 ) = 0. 98 Moreover, if X1 , X2 , X3 , X4 are Gaussian random variables then cum(X1 , X2 , X3 ) = 0 and cum(X1 , X2 , X3 , X4 ) = 0. Indeed all cumulants higher than order two is zero. This comes from the fact that cumulants are the coefficients of the power series expansion of the logarithm of the moment generating function of {Xt }. Since the spectral density is the fourier transform of the covariance it is natural to ask whether one can define the higher order spectra as the fourier transform of the higher order cumulants. This turns out to be the case, and the higher order spectra have several interesting properties. Let us suppose that {Xt } is a stationary time series (notice that we are assuming it is strictly stationary and not second order). Let cum(t, s) = E(X0 , Xt , Xs ) and cum(t, s, r) = E(X0 , Xt , Xs , Xr ) (noting that like the covariance the higher order cumulants are invariant to shift). The third and fourth order spectras is defined as f (1 , 2 ) = f (1 , 2 , 3 ) = cum(s, t) exp(is1 + it2 ) s=- t=- cum(s, t, r) exp(is1 + it2 + ir3 ). s=- t=- r=- Example 8.3.4 (Third and Fourth order spectra of a linear process) Let us suppose that {Xt } satisfies Xt = j=- 4 where j=- |j | < , E(t ) = 0 and E(t ) < . Let A() = is straightforward to show that j=- j j t-j exp(ij). Then it f4 (1 , 2 , 3 ) = 4 A(1 )A(2 )A(3 )A(-1 - 2 - 3 ), where 3 = cum(t , t , t ) and 4 = cum(t , t , t , t ). We see from the example, that unlike the spectral density, the higher order spectras are not necessarily positive or even real. A review of higher order spectra can be found in Brillinger (2001). Higher order spectras have several applications especially in nonlinear processes, see Subba Rao and Gabr (1984). We will consider one such application in a later chapter. f3 (1 , 2 ) = 3 A(1 )A(2 )A(-1 - 2 ) 8.4 The Periodogram and the spectral density function Our aim is to construct an estimator of the spectral density function f () associated with the second order stationary process {Xt }. 99 8.4.1 The periodogram and its properties Let us suppose we observe {Xt }n which are observations from a zero mean, second order t=1 stationary time series. Let us suppose that the autocovariance function is {c(k)}, where c(k) = E(Xt Xt+k ) and k |c(k)| < . We recall that we can be estimate c(k) using 1 cn (k) = ^ n Given that the spectral density is f () = then a natural estimator of f () is IX () = 1 2 n-1 n-|k| Xt Xt+k . t=1 1 2 k=- c(k) exp(ik), cn (k) exp(ik). ^ k=-(n-1) (8.13) We usually call IX () the periodogram. We will show that the periodogram has several nice properties that make it a suitable candidate for the spectral density estimator. The only problem is that the raw periodogram turns out to be an inconsistent estimator. However with some modifications of the periodogram we can construct a good estimator of the spectral density. Lemma 8.4.1 Suppose that {Xt } is a second order stationary time series with and IX () is defined in (8.13). Then we have IX () = and E(IX ()) - f () 1 2 |c(k)| + |k| |c(k)| 0 n (8.15) 1 2 n-1 k |c(k)| < cn (k) exp(ik) = ^ k=-(n-1) 1 2n n Xt exp(it) t=1 2 (8.14) |k|n |k|n 1 as n . Hence in the case that k=- |kc(k)| < we have E(IX ()) - f () = O( n ). n 1 Moreover var( 2n t=1 Xt exp(it)) = E(IX ()). PROOF. (8.14) follows immediately from the definition of the periodogram. To obtain the first inequality in (8.15) is straightforward. To show that n . To show that thus we obtain the desired result. |k| |k|n n |c(k)| 0 as n , we note that |k| |k|n n |c(k)| k 0 as n we use |c(k)| < , therefore k |c(k)| < and Lemma 8.3.1, |k|n |c(k)| |k|n |c(k)| + 0 as We see from the above that the periodogram is both a non-negative function as well as an asymptotically unbiased estimator of the spectral density. Hence is has inherited several of the 100 characteristics of the spectral density. However a problem with the periodogram is that it is extremely irratic in its behaviour, in fact in its limit it does not converge to the spectral density. Hence as an estimator of the spectral density it is inappropriate. We will demonstrate this in the following two propositions and later discuss why this is so and how it can be made into a consistent estimator. We start by considering the periodogram of iid random variables. Proposition 8.4.1 Suppse {t }n are iid random variables with mean zero and variance 2 . t=1 2 n 1 1 We define J () = 2n n t exp(it) and I () = 2n t=1 t exp(it) . Then we have t=1 J () = for any finite m 2 I2m ), (8.17) 2 I ()/ 2 2 (2) (which is equivalent to the exponential distribution with mean one), (I (k1 ), . . . , I (km )) converges in distribution to (J (k1 ) , . . . , J (km ) ) 2 N (0, D (Jz ()) (Jz ()) 2 N (0, D 2 I2 ), 2 (8.16) cov(I (j ), I (k )) = 4 2n 4 2n j=k 22 2 2 + j=k n t=1 t,n (8.18) where j = 2j/n and k = 2k/n (and k, j = 0, n). PROOF. We first show (8.16). We note that 1 2n 1 2n n t=1 t,n where n t=1 t,n and (J (k )) = 1 2n and (J (k )) = t,n = t cos(2kt/n) and t,n = t sin(2kt/n). We note that (J (k )) = 1 (J (k )) = 2n n t,n are the weighted sum of iid random variables, t=1 hence {t,n } and {t,n } are martingale differences. Therefore, to show asymptotic normality, we will use the martingale central limit theorem with the Cramer-Wold device to show that (8.16). We show the result we need to verify the three conditions of the martingale CLT. First we consider the variances and the conditional variances 1 2n 1 2n 1 2n n t=1 n t=1 n t=1 E |t,n | t-1 , t-2 , . . . E |t,n |2 t-1 , t-2 , . . . E t,n t t-1 , t-2 , . . . 2 = = = 1 2n 1 2n 1 2n n t=1 n t=1 n t=1 cos(2kt/n)2 2 t sin(2kt/n)2 2 t P P 2 2 2 2 P cos(2kt/n) sin(2kt/n)2 E(2 ) t t 1 2n 1 n n t=1 sin(2k 2t/n) = 0. Finally we need to verify the Lindeberg condition, we only verify it for argument holds true for 1 2n 1 2n n t=1 n t=1 1 2n n t=1 t,n . n t=1 t,n , the same We note that for every n > 0 we have E |t,n |2 I(|t,n | 2 n ) 1 E |t,n |2 I(|t,n | 2 n ) t-1 , t-2 , . . . = 2n t=1 P E |t |2 I(|t | 2 n ) = E |t |2 I(|t | 2 n ) 0 101 as n , noting that the second to last inequality is because |t,n | = | cos(2t/n)t | t . Hence we have verified Lindeberg condition and we obtain (8.16). The proof of (8.17) is similar, hence we omit the details. We can show that I () 2 (2), because I () = (J ())2 + (J ())2 , hence from (8.16) we have I ()/ 2 (2). To prove (8.18) we note that cov(I (j ), I (k )) = We recall that cov(t1 t1 +k1 , t2 t2 +k2 ) = cov(t1 , t2 +k2 )cov(t2 , t1 +k1 ) + cov(t1 , t2 )cov(t1 +k1 , t2 +k2 ) + cum(t1 , t1 +k1 , t2 , t2 +k2 ). We note that since {t } are iid random variables, then for most t1 , t2 , k1 and k2 the above covariance is zero. The exceptions are when t1 = t2 and k1 = k2 or t1 = t2 and k1 = k2 = 0 or t1 - t2 = k1 = -k2 . Counting all these combinations we have cov(I (j ), I (k )) = 2 2n2 k t t 1 2n2 cov(Xt1 Xt1 +k1 , Xt2 Xt2 +k2 ). k1 k2 t1 t2 exp(ik(j - k ))2 + 2 1 2n2 4 t where 2 = var(t ) and 4 = cum(t , t , t , t ). We note that for j = k, t exp(ik(j - k )) = 0 and for j = k, t exp(ik(j - k )) = n, substutiting this into cov(I (j ), I (k )) gives us the desired result. We have seen that the periodogram for iid random variables does not converge to a constant and indeed its distribution is asymptotically exponential. This suggests that something similar holds true for linear processes. This is the case. In the following lemma we show that the periodogram of a general linear process Xt = j=- j t is IX () = | where f () = | j j exp(ij)|2 I () + op (1) = f ()I () + op (1), j j exp(ij)|2 is the spectral density of {Xt }. Lemma 8.4.2 Let us suppose that {Xt } satisfy Xt = j=- j t , where j=- |j | < , 2 . Then we have and {t } are iid random variables with mean zero and variance JX () = j j exp(ij)J () + Yn (), n-j t=1-j (8.19) E(Yn ())2 (8.20) where Yn () = 1 ( n1/2 1/2 )2 . j=- |j | min(|j|, n) 1 n j j exp(ijUn,j , Un,j = Furthermore exp(it)t - n t=1 exp(it)t , IX () = | j j exp(ij)|2 |J ()|2 + Rn (), where E(sup |Rn ()|) 0 and as n . 1/2 | < then E(sup |R ()|2 ) = O(n-1 ). If in addition E(4 ) < and j n t j=- |j 102 PROOF. See Priestley (1983), Theorem 6.2.1 or Brockwell and Davis (1998), Theorem 10.3.1. Using the above we see that IX () f ()I (). This suggest that most of the properties which apply to I () also apply to IX (). Indeed in the following theorem we show that the asympototic distribution of IX () is exponential with mean and variance f (). By using the above result we now generalise Proposition 8.4.1 to linear processes. Theorem 8.4.1 Suppose {Xt } satisfies Xt = j=- |j | < . Let In () j=- j t , where denote the periodogram associated with {X1 , . . . , Xn } and f () be the spectral density. Then (i) If f () > 0 for all [0, 2] and 0 < 1 , . . . , m < , then In (1 )/f (1 ), . . . , In (m )/f (m ) converges in distribution (as n ) to a vector of independent exponential distributions with mean one. 1/2 | < . Then for = 2j and = (ii) If in addition E(4 ) < and j j k t j=- |j n have 2(2)2 f (k ) + O(n-1/2 ) j = k = 0 or cov(I(k ), I(j )) = (2)2 f (k ) + O(n-1/2 ) 0 < j = k < O(n-1 ) j = k 2k n we where the bound is uniform in j and k . PROOF. See Brockwell and Davis (1998), Theorem 10.3.2. Remark 8.4.1 (Summary of properties of the periodogram for linear processes) (i) The periodogram is nonnegative and is an asymptotically an unbiased estimator of the spectral density (when j |j | < ). (ii) Like the spectral density is it symmetric about zero: In () = In ( + ). (iii) At the fundemental frequencies {I(j )} are asymptotically uncorrelated. (iv) If 0 < < , I() is asymptotically exponentially distributed with mean f (). We see that the periodogram is extremely irratic and does not converge (in anyway) to the spectral density as n . In the following section we discuss this further and consider modifications of the spectral density which lead to a consistent estimate. 8.4.2 Estimating the spectral density There are several (pretty much equivalent) explanations as to why the raw periodogram is not a good estimator of the spectrum. Intuitively, the simplest explanation is that we have included too many covariance estimators in the estimation of f (). We see from (8.13) that the periodogram is the Fourier transform of the estimates covariances at n different lags. Typically the variance for each covariance cn (k) will be about O(n-1 ), hence roughly speaking the variance of In () ^ 103 will be the sum of these n O(n-1 ) variances which leads to a variance of O(1), which clearly does not converge to zero. This suggest that if we use m (m << n) covariances in the estimation of f () rather than all n, (where we let m ) we may reduce the variance in the estimation (with the cost of introducing some bias) to yield a good estimator of the spectral density. This indeed turns about to be the case. Another way is to approach the problem is from a nonparametric angle. We note that from Theorem 8.4.1, at the fundemental frequencies {I(k )} can be treated as uncorrelated random variables, with mean f (k ) and variance f (k ). Therefore we can rewrite I(k ) as I(k ) = E(I(k )) + (I(k ) - E(I(k ))) f (k ) + f (k )Uk , k = 1, . . . , n, (8.21) where {Uk } is approximately a mean zero, variance one sequence of uncorrelated random variables and k = 2k/n. We note that (8.21) resembles the usual nonparametric function plus additive noise, often considered in nonparametric statistics. This suggest that another way to estimate the spectral density us to use a locally weighted average of {I( k )}. Interestingly both the estimation methods mentioned above are practically the same method. It is worth noting that Parzen (1957) first proposed a consistent method to estimate the spectral density. Furthermore, classical density estimation and spectral density estimation are very similar, and it was spectral density estimation motivated methods, which motivated methods to estimate the density function (one of the first papers on density estimation is Parzen (1962)). Equation (8.21) motivates the following nonparametric estimator of f (). ^ fn (j ) = k j-k 1 K( )I(k ), bn bn (8.22) where W () is a kernel which satisfies W (x)dx = 1 and ^ fn (j ) is the local average about frequency j : ^ fn (j ) = 1 bn j+bn/2 W (x)2 dx < . An example of I(k ). k=j-bn/2 1/2 | < and Theorem 8.4.2 Suppose {Xt } satisfy Xt = j j=- |j j=- j t , where ^ E(4 ) < . Let fn () be the spectral estimator defined in (8.22). Then t ^ E(fn (j )) f (j ) and ^ var(fn (j )) bn , b 0 as n . 1 bn f (j ) 2 bn f (j ) (8.23) 0 < j < j = 0 or . (8.24) PROOF. The proof of both (8.23) and (8.24) are based on the kernel K(x/b) getting narrow as b 0, hence there is more localisation as the sample size grows (just like nonparametric 104 regression). We note that since f () = 2 | continuous in . To prove (8.23) we take expections ^ E(fn (j )) - f (j ) = k 2 j=- exp(ij)| , the spectral density, f , is k 1 K( ) E I(j-k ) - f (j ) bn bn 1 k K( ) E I(j-k ) - f (j-k ) + bn bn 1 k |K( ) f (j ) - f (j-k ) bn b = k k := I + II. Using Lemma 8.4.1 we have I = k 1 k |K( )| E I(j-k ) - f (j-k ) bn bn 1 bn |K( k )| bn |c(k)| + |k| |c(k)| n 0. K Now we consider k |k|n |k|n II = k 1 k K( ) f (j ) - f (j-k ) . bn bn Since the spectral density f () is continuous, then we have II as bn , b 0 and n . The above two bounds mean give (8.23). We will use Theorem 8.4.1 to prove (8.24). We first assume that j = 0 or n. Evaluating the variance using Theorem 8.4.1 we have ^ var(fn (j )) = k1 ,k2 j - k2 j - k1 1 K( )K( )cov(I(k1 ), I(k1 )) 2 (bn) bn b = k 1 j-k j-k 1 K( )K( )var(I(k )) + O( ) (bn)2 bn b n k 1 1 1 K( )2 f (j-k ) + O( 1/2 ) f (j ). 2 (bn) bn bn n = k A similar proof can be used to prove the case j = 0 or n. The above result means that the mean squared error of the estimator is ^ E fn (j ) - f (j ) as bn , b 0 and n . Moreover ^ E fn (j ) - f (j ) 2 2 0 1 2 ^ ) + E(fn (j )) - f (j ) bn 1 |k| = O( ) + O |c(k)| + |c(k)| . bn n = O( |k|n |k|n 105 Hence the rate of convergence depends on the bias of the estimator, in particular the rate of decay of the covariances. If the covariance decay exponentially (as in the case of ARMA processes) 2 1 ^ the bias is extremely small and E fn (j ) - f (j ) = O( bn ). There are several example of kernels that one can use and each has its own optimality property. An interesting discussion on this is given in Priestley (1983), Chapter 6. As mentioned briefly above we can also estimate the spectrum by truncating the number of covariances estimated. We recall that IX () = 1 2 n-1 cn (k) exp(ik). ^ k=-(n-1) Hence a viable estimator of the spectrum is ~ fn () = 1 2 n-1 ( k=-(n-1) k )^n (k) exp(ik), c m hence () is a weight function with very little weight (or no weight) for the covariances at large lags. A useful example, using the rectangular function is (x) = 1 if |x| 1 and zero otherwise is ~ fn () = 1 2 m cn (k) exp(ik), ^ k=-m ~ ^ where m << n. Now fn () has similar properties to fn () with m playing the same role as the window bn. Indeed there is a very close relationship between the two which can be seen by 2 1 ~ using (8.1). Substituting cn (k) = 2 0 In () exp(-ik)d into fn gives ^ ~ fn () = 1 (2)2 n n-1 In () k=-(n-1) ( 1 k ) exp(ik( - ))d = m 2 In ()Wm ( - )d, 1 k 1 where Wm () = 2n n-1 k=-(n-1) ( m ) exp(ik). Now Wm () and b W ( b ) (defined in (8.22)) are not necessarily the same function, but they share many of the same characteristics. In fact they are both asymptotically normal (we discuss this in the remark below). Remark 8.4.2 (The distribution of the spectral density estimator) Using that the periodogram In ()/f () is asymptotically 2 (2) distributed and uncorrelated at the fundemental ^ frequencies, we can deduce approximate the distribution of fn (). To obtain the distribution consider the example ^ fn (j ) = 1 bn j+bn/2 I(k ). k=j-bn/2 j+bn/2 k=j-bn/2 I(k ) Since I(k )/f (k ) are approximately 2 (2), then since the sum local neighbourhood of j , we have that f (j )-1 j+bn/2 k=j-bn/2 I(k ) is taken over a is approximately 2 (2bn). Ex- ^ tending this argument to aribtrary kernels we have that bnfn (j )/f (j is approximately 2 (2bn). 106 We note that when bn is large, then 2 (2bn) is close to normal with mean 2bn and variance ^ 4bn. Hence bnfn (j )/f (j is approximately normal with mean 2bn and variance 4bn. Therefore ^ bnfn (j ) N (2f (j ), 4f (j )). Using this approximation, we can construct confidence intervals for f ( j ). 8.5 The Whittle Likelihood In Chapter 4 we considered various methods for estimating the parameters of an ARMA process. The most efficient method (when the errors were Gaussian) was the Gaussian maximum likelihood estimator. This estimator was defined in the time domain, but it is interesting to note that a very similar estimator which is asymptotically equivalent to the GMLE estimator can be defined in the frequency domain. We first define the estimator using heuristics to justify it. We then show how it is related to the GMLE (it is the frequency domain approximation of the time domain estimator). First let us suppose that we observe {Xt }n , where t=1 p q Xt = j=1 j Xt-j + j=1 (0) j t-j + t , (0) (0) (0) and {t } are iid random variables. As before we will assume that 0 = {j } and 0 = {j } are such that the roots of the characteristic polynomial are greater than 1 + . Let us defined the Discrete Fourier Transform 1 Jn () = 2n n Xt exp(it). j=1 2k n . We will consider at the fundamental frequencies k = we have cov(Jn (k1 ), JX (k2 )) As we mentioned in Section 8.2 var(Jn (k1 )) f (k1 ) k1 = k2 0 k1 = k2 . . Hence if the innovations are Gaussian then JX () is complex Gaussian and we have approximately Jn (1 ) . . Jn = N (0, diag(f (1 ), . . . , f (n ))). . Jn (n ) Therefore since Jn is normally distributed (complex) random vector with mean zero and diagonal matrix variance matrix diag(f (1 ), . . . , f (n )), the the log likelihood of Jn is approximately n Lw (, ) = k=1 log |f, (k )| + 107 |Jn (k )|2 . f, (k ) To estimate the parameter we would choose the and which minimises the above criterion, that is (w , w ) = arg max Lw (, ), n n , (8.25) where consistents of all parameters where the roots of the characteristic polynomial have absolute value greater than (1 + ). Whittle (1962) showed that the above criterion is an approximation of the GMLE. The correct proof is quite complicated and uses several matrix approximations due to Grenander and Szeg (1958). Instead we give a heuristic proof which is quite enlightening. o Remark 8.5.1 (Some properties of circulant matrices) (i) Let us define the n dimensional circulant matrix c(0) c(1) c(2) . . . c(n-2) c(n-1) c(n-1) c(0) c(2) . . . c(n-3) c(n-2) . = c(n-1) c(1) c(0) See that the elements in each row are the same, just a rotation of each other. The eigenvalues and vectors of have interesting properties. Define fn () = n c(k) exp(ik), k=1 then the eigenvalues are {fn (j )} with corresponding eigevectors ej = (1, exp(2ij/n), . . . , exp(2ij(n- 1)/n)). Hence let E = (e1 , . . . , en ), then we can write as = EE -1 where = diag(f (1 ), . . . , f (n )). (ii) You may wonder about the relevance of circulant matrices to the current setting. However the variance covariance matrix of a stationary process is c(0) c(1) c(2) . . . c(n-2) c(n-1) c(1) c(0) c(2) . . . c(n-3) c(n-2) . var(Xn ) = = c(n-1) c(1) c(0) (8.26) This is a Toeplitz matrix and we observe for large n it is very close to the circulant matrix, the differences are in the endpoints of the matrix. Hence we it can be show that for large n EE -1 , (8.27) and E -1 E (where E denotes the complex conjugate of E). We will use the results in the lemma above to prove the lemma below. We first observe that in the Gaussian maximum likelihood for the ARMA process can be written as in terms of its spectrum (see 4.10) Ln (, ) = det |(, )| + Xn (, )-1 Xn = det |(f, )| + Xn (f, )-1 Xn , (8.28) where (f, )s,t = f, () exp(i(s - t))d and Xn = (X1 , . . . , Xn ). We now show that Ln (, ) Lw (, ). 108 Lemma 8.5.1 Suppose that {Xt } is a stationary ARMA time series with absolutely summable covariances and f, () is the corresponding spectral density function. Then n det |(f, )| + Xn (f, ) for large n. -1 Xn k=1 log |f, (k )| + |JX (k )|2 , f,) (k ) PROOF. We give a heuristic proof (details on the precise proof can be found in the remark below). Using (8.27) we have see that (f, ) can be approximately written in terms of the eigenvalue and eigenvectors of the circulant matrix associated with (f , ), that is (f, ) E(f, )E -1 and (f, )-1 E(f, )-1 E, (n-1) j=-(n-1) c, (k) exp(ik) (8.29) and k = (8.30) where (f, ) = diag(f(n) (1 ), . . . , f(n) (n )), f(n) () = 2k/n. A basic calculations gives that Xn E = (Jn (1 ), . . . , Jn (n )). Substituting (8.30) and (8.29) into (8.31) yields 1 Ln (, ) n 1 n n k=1 det f, (k ) + (n) (n) f, (k ) |Jn ()|2 = 1 w L (, ). n (8.31) Hence by using the approximation (8.29) we have derived the Whittle likelihood. This proof was first derived by Tata Subba Rao. Remark 8.5.2 (A rough flavour of the proof ) There are various ways to precisely prove this result. All of them show that the Toeplitz matrix can in some sense be approximated by a circulant matrix. This result uses Szeg's identity (Grenander and Szeg (1958)). Studying the o o Gaussian likelihood in (8.31) we note that the Whittle likelihood has a similar representation. That is n L(w) (, ) n = k=1 |Jn (k )|2 log |f, (k )| + f (k ; , ) n = k=1 -1 log |f, (k )| + Xn U (f, )Xn , -1 -1 where U (f, )s,t = f, ()-1 exp(i(s-t))d. Hence we can show that |U (f, )-(f, )-1 | 0 as n (and its derivatives with respect and also converge), then we can show that ~ Ln (, ) and Ln (, ) and its derivatives are asymptotically equivalent. Hence the GMLE and the Whittle estimator are asymptotic equivalent. The difficult part in the proof is stablishing -1 involves showing that |U (f, ) - (f, )-1 | 0 as n . It is worth noting that Rainer Dahlhaus has extensively developed this area and considered several interesting generalisations (see for example Dahlhaus (1996), Dahlhaus (1997) and Dahlhaus (2000)). We now show consistency of the estimator (without showing that its equivalent to the GMLE). To simply calculations we slightly modify the estimator and exchange the summand in (8.25) with an integral to obtain Lw (, ) = 2 0 log |f, (k )| + 109 |Jn (k )|2 . f, (k ) To estimate the parameter we would choose the and which minimises the above criterion, that is (w , w ) = arg max Lw (, ), n n , (8.32) f, ()-1 exp(ik)d and k Lemma 8.5.2 (Consistency) Let us suppose that dk (, ) = . Let (n , n ) be defined a in (8.32). Then we have (w , w ) ( 0 , 0 ). n n P |dk (, )| < PROOF. Recall to show consistency we need to show pointwise convergence of Ln and equiconn , and show that it tinuity. First we show pointwise convergence by evaluating the variance of L converges to zero as n . We first note that by using dk (, ) = f, ()-1 exp(ik)d and 1 w using In () = 2 n-1 ^ k=-(n-1) cn (k) exp(ik) we can write Ln as 1 w L (, ) = n n 2 0 1 log |f, (k )| + n n-1 n-|r| dr (, ) r=-(n-1) k=1 Xk Xk+r . Therefore taking the variance gives var(Lw (, )) n We note that cov(Xk1 Xk1 +r1 , Xk2 Xk2 +r2 ) = cov(Xk1 , Xk2 )cov(Xk1 +r1 , Xk2 +r2 ) + cov(Xk1 , Xk2 +r2 )cov(Xk1 +r1 , Xk2 ) + cums(Xk1 Xk1 +r1 , Xk2 Xk2 +r2 ). Now we note that k2 |cum(Xk1 Xk1 +r1 , Xk2 Xk2 +r2 )| < and substituting this and k |dk (, )| < into (8.33), we have that 1 var(Lw (, )) = O( ), n n hence var(Lw (, )) 0 as n . Define n Lw(, )) = E(Lw (, )) = n Hence, since var(Lw (, )) 0 we have n Lw (, ) Lw (, ). n To show equicontinuity we apply the mean value theorem to Lw . We note that because the n parameters (, ) , have characteristic polynomial whose roots are greater than (1 + ) then f, () is bounded away from zero (indeed there exists a > 0 where inf ,(,) f, () ). Hence it can be shown that there exists a random sequence {Kn } such that |Ln (1 , 1 ) - 110 P k2 1 = 2 n n-1 dr1 (, )dr2 (, ) r1 ,r2 =-(n-1) n-|r1 | n-|r2 | k1 =1 k2 =1 cov(Xk1 Xk1 +r1 , Xk2 Xk2 +r2 ) (8.33) cov(Xk1 , Xk2 ) < , hence det f, () + f0 ,0 () . f, (k ) Ln (2 , 2 ))| Kn ( (1 - 2 ), (1 - 2 ) ) and Kn converges almost surely to a finite constant as n . Therefore Ln is stochastically equicontinuous (and equicontinuous in probability). Since the parameter space is compact, all three conditions in Section 5.6 are satisfied and we have consistency of the Whittle estimator. We now show asymptotic normality of the Whittle estimator and in the following remark show its relationship to the GMLE estimator. Theorem 8.5.1 Let us suppose that dk (, ) = Let (w , w ) be defined a in (8.32) n n f, ()-1 exp(ik)d and k |dk (, )| < . D n (w - 0 ), ( w - 0 ) N (0, V -1 + V -1 W V -1 ) n n where V W = = 1 2 1 2 2 0 2 0 0 f0 ,0 () f0 ,0 ()2 2 f0 ,0 () d f0 ,0 ()2 f0 ,0 ()-1 f4,0 ,0 (1 , -1 , 2 ), f0 ,0 ()-1 and f4,0 ,0 (1 , 2 , 3 ) = 4 A(1 )A(2 )A(3 )A(-1 - 2 - 3 ) is the fourth order spectrum corresponding to the ARMA process with A() = 0 (exp(i)) . 0 (exp(i)) PROOF. See, for example, Brockwell and Davis (1998), Chapter 10.8. Remark 8.5.3 (i) It is interesting to note that in the case that {Xt } comes from a linear time series (such as an ARMA process) then using f4,0 ,0 (1 , 1 , -2 ) = 4 |A(1 )|2 |A(2 )|2 = 4 f (1 )f (2 ) (for linear processes) we have 2 2 W = = = 1 2 4 2 2 2 0 2 0 2 0 f0 ,0 ()-1 f0 ,0 ()-1 2 f4,0 ,0 (1 , -1 , 2 ) 1 2 f0 ,0 () f , ()d f0 ,0 ()2 0 0 log f0 ,0 ()d 2 4 1 2 2 2 2 2 0 = 2 2 4 1 2 2 2 log 2 2 2 = 0, where we note that 0 log f0 ,0 ()d = 2 log 2 by using Kolmogorov's formula. Hence for linear processes the higher order cumulant plays no role and above theorem reduces to n (n - 0 ), (n - 0 ) N (0, V -1 ) D (ii) Since the GMLE and the Whittle likelihood are asymptotically equivalent they should lead to the same asymptotic distributions. We recall that the GMLE has the asymptotic distri ^ D 2 ^ bution n( n - 0 , n - 0 ) N (0, 0 -1 ), where = E(Ut Ut ) E(Vt Ut ) E(Ut Vt ) E(Vt Vt ) 111 and {Ut } and {Vt } are autoregressive processes which satisfy 0 (B)Ut = t and 0 (B)Vt = t . It can be shown that E(Ut Ut ) E(Vt Ut ) E(Ut Vt ) E(Vt Vt ) = 1 2 2 0 f0 ,0 () f0 ,0 ()2 f0 ,0 () d. f0 ,0 ()2 112 Chapter 9 Nonlinear Time Series So far we have focused on linear time series, that is time series which have the representation Xt = j=- j t-j , where {t } are iid random variables. Such models are exrtemely useful and are used widely in several applications. However, a typical realisation from a linear time series, will be quite regular with no suddent bursts or jumps. This is due to the linearity of the system. However, if one looks at financial data, for example, there are sudden burst of volatility and extreme values, which calm down after a while. It is not possible to model such behaviour well with a linear time series. In order to capture this nonlinear behaviour several nonlinear models have been proposed. The models typically consists of products of random variables which help to explain the sudden irratic bursts in the data. Over the past 30 years there has been a lot research into nonlinear time series models. Popular nonlinear models include the bilinear model, (G)ARCHtype models, random autoregressive coefficient models and threshold models, to name but a few (see, for example, Subba Rao (1977), Granger and Andersen (1978), Nicholls and Quinn (1982), Engle (1982), Subba Rao and Gabr (1984), Bollerslev (1986), Terdik (1999), Fan and Yao (2003) and Straumann (2005)) In this chapter we will focus on the ARCH model which is extremely popular in financial time series (and its closely related cousin the GARCH model). Before fitting a nonlinear model it is important to establish whether it is worth fitting a nonlinear time series model to the data. In the second part of the paper we consider a test for linearity of the time series. 9.1 The ARCH model ARCH-type processes are often used to model the volatilities in financial markets. They are an example of a nonlinear stochastic process. The ARCH model was first proposed by Engle (1982), and since its conceptions various different ARCH flavours have been proposed. These include the benchmark GARCH process (Bollerslev (1986)), the EGARCH, IGARCH and AGARCH models to name but a few. In this section we will focus on the original ARCH model. {Xt } is called an ARCH(p) process 113 if it satisfies p Xt = t Zt , 2 t = a 0 + j=1 2 aj Xt-j (9.1) where E(Zt ) = 0, a0 > 0, aj > 0 for j = 1, . . . , p and p aj = < 1. Excellent references j=1 for the properties of stationary ARCH processes and estimation are Giraitis et al. (2000) and Straumann (2005). 9.1.1 Some properties of the ARCH process By expanding Xt as a Volterra series expansion (a nonlinear generalisation of the moving-average process) we obtain the following theorem. 2 Theorem 9.1.1 Suppose {Xt } is an ARCH(p) process and E(Zt ) = 1 then the series 2 2 Xt = a 0 Zt + k1 k k mt (k) 2 Zt-Pr r=0 (9.2) where mt (k) = j1 ,...,jk 1 a0 r=1 a jr s=0 js (j0 = 0), converges almost surely, has a finite mean and is the unique, stationary, ergodic solution of (9.1). PROOF. A formal expansion of (9.1) gives (9.2). We first show that X t is well-defined. Since (9.2) is the sum of positive random variables and the coefficients are also positive we need only 2 to show that the expectation of (9.2) is finite. By using E(Zt ) = 1, p aj < < 1 and the j=1 monotone convergence theorem we can obtain a finite bound for the expectation of (9.2). Since 2 2 {Zt } are iid random variables, we notice that Xt = g(Zt , . . . , ) where k k g(xt , . . . , ) = a0 x2 + t k=1 j1 ,...,jk 1 a0 r=1 a jr r=0 B Pr s=0 js xt (j0 = 0) and B k xt = xt-k . Therefore g() is a time-invariant function. Thus by using Theorem 5.2.1 that {Xt } is a stationary, ergodic process. 2 2 To show uniqueness of Xt we must show that any other solution is equal to Xt with prob2 is another solution of (9.1). By recursively applying relation (9.1) r ability one. Suppose Yt times to Yt2 we have r-1 2 Yt2 = a0 Zt + k=1 2 Thus the difference between Yt2 and Xt is 2 Xt - Yt2 = Ar - Br r mt (k) + Br where Br = jr <...<j0 =t i=1 aji-1 -ji Yj2 r a0 r-1 2 Z ji i=0 where Ar = k=r mt (k). 114 We now show for any > 0 P(|Ar - Br | > ) < (since this implies by the Borel-Cantelli r=1 Lemma that the the event {|Ar - Br | > } can only occur finitely often with probability one, if a.s 2 this is true for all > 0, we have |Ar - Br | 0). By using E(Zt ) = 1 and p aj = we have j=1 r-1 2 r . Furthermore Y 2 and E(Ar ) C jr i=0 Zji are independent (if i < r then ji > jr ). Therefore r-1 2 E(Yj2 i=0 Zji ) = E(Yj2 ) and we have r r k E(Br ) = jr <...<j0 =t r=1 ajr-1 -jr E(Yj2 ) 1 r E(Yj2 )r . r a0 a0 Now by using the Markov inequality we have P(Ar > ) C1 r / and P(Br > ) C1 r / for some constant C1 . Therefore P(|Ar - Br | > ) C2 r /. Thus P(|Ar - Br | > ) < . r=1 a.s 2 Since this is true for all > 0, we have Yt2 = Xt and therefore the required result. We first observe that since Xt = t Zt we have that cov( t Zt , s Zs ) = 0 for s = t. Hence the ARCH process is uncorrelated but dependent process. In some sense the ARCH model can be considered as a generalisation of the AR model. That is the squares of ARCH model satisfy p 2 2 Xt = 2 Zt = a 0 + j=1 2 2 2 aj Xt-j + (Zt - 1)t . (9.3) p We observe that the since j=1 |aj | < 1 the roots of the characteristic polynomial a(z) = p 2 j have roots outside the unit circle. Moreover 2 1 - j=1 aj z t = (Zt - 1)t are martingale 2 2 2 2 differences (since E((Zt - 1)t |Xt-1 , Xt-2 , . . .) = t E(Zt - 1) = 0), hence cov( t , s ) = 0 for s = t. In many respects (9.3) is similar to an AR representation except that { t } are martingale 2 2 differences and not iid random variables. We now obtain the best predictor of X t given Xt-1 , . . .. 2 in the sigma algebra F We first note that the best predictor of Xt t-1 = (Xt-1 , . . .) is the random 2 variable E(Xt |Ft-1 ) Ft-1 , since it minimises the mean squared error minY Ft-1 E(Xt - Y )2 . Hence the conditional expectation is p p 2 aj Xt-j j=1 2 2 Xt |Xt-1 , . . . 2 (Zt 2 1)t 2 Xt-1 , . . . 2 aj Xt-j . j=1 E = E a0 + + - = a0 + We mention that usually the best predictor gives smaller mean squared error than the best liner 2 predictor. However in this case, since the best predictor is a linear combination of X t-1 , . . ., it is also the best linear predictor. Using (9.3) we see that by taking expectations we have p 2 E(Xt ) = a0 + j=1 2 aj E(Xt-j ) 2 E(Xt ) = a0 1- . p j=1 aj 2n 2n Moreover, by using (9.2) it can be shown that E(Xt ) is finite if and only if E(Zt )1/n p aj < j=1 1. We see this places quite a huge restriction on the innovations {Zt }, and in general only a few moments of the ARCH process tend exist. This, however fits with empirical observations, where it is believed that often financial data is thick tailed. 115 9.2 The quasi-maximum likelihood for ARCH processes In this section we consider an estimator of the parameters a0 = {aj : j = 0, . . . , p} given the observations {Xt : t = 1, . . . , N }, where {Xt } is a ARCH(p) process. We use the conditional 2 log-likelihood to construct the estimator. We will assume throughout that E(Z t ) = 1 and p j=1 j = < 1. We now construct an estimator of the ARCH parameters based on Z t N (0, 1). It is worth mentioning that despite the criterion being constructed under this condition it is not necessary that the innovations Zt are normally distributed. In fact in the case that the innovations are not normally distributed but have a finite fourth moment the estimator is still good. This is why it is called the quasi-maximum likelihood , rather than the maximum likelihood (similar to the how the GMLE estimates the parameters of an ARMA model regardless of whether the innovations are Gaussian or not). Let us suppose that Zt is Gaussian. Since Zt = Xt / a0 + 0 and var(Xt |Xt-1 , . . . , Xt-p ) = a0 + is p p 2 j=1 aj Xt-j , p 2 j=1 aj Xt-j , then the log density of Xt given Xt-1 , . . . , Xt-p 2 Xt E(Xt |Xt-1 , . . . , Xt-p ) = log(a0 + j=1 2 aj Xt-j ) + a0 + p 2 . j=1 aj Xt-j Therefore the conditional log density of Xp+1 , Xp+2 , . . . , Xn given X1 , . . . , Xp is n p log(a0 + t=p+1 j=1 2 aj Xt-j ) + a0 + p 2 j=1 aj Xt-j 2 Xt . This inspires the the conditional log-likelihood 1 Ln () = n-p n p log(0 + t=p+1 j=1 2 j Xt-j ) + 0 + 2 Xt p 2 j=1 j Xt-j . To obtain the estimator we define the parameter space p = { = (0 , . . . , p ) : j=1 j 1, 0 < c1 0 c2 < , c1 j } and assume the true parameters lie in its interior a = (a0 , . . . , ap ) Int(). We let ^ an = arg min Ln (). (9.4) 9.2.1 Consistency of the quasi-maximum likelihood estimator ^ In this section we will show consistency and asymptotic normality of the estimator an . As mentioned about, very few moments of the ARCH process exist (the more moments that exist the more restricted the parameters a0 are). Hence we want to prove the results under weak moment conditions. We will see below that the choice of the parameter space (where the parameters are bounded from below) helps reduce the number of moments. We mention that 116 the results proved here are not under the weakest conditions and that ARCH models are a subset of GARCH models. Consistency and asymptotic normality of the GARCH QMLE parameter estimator (which is close to the ARCH QMLE discussed in the previous sections, but uses many of the ideas in ARMA estimation) was first shown (under rather weak conditions) in Berkes et al. (2003). It is straightforward to prove the result by using the erdogic theorem repeatedly. Let X2 2 f (Xt ) = log(0 + p j Xt-j ) + 2( +Pp t X 2 ) . By Theorem 5.2.1, for every , j=1 0 j=1 j t-j {f (Xt ) : t Z} is a stationary, ergodic processes. This allows us to obtain the limit of L n (). We first want to show that the limit of these quantities are bounded. p j=1 aj Lemma 9.2.1 Let us suppose that {Xt } is a stationary process with 1 sup n - p 1 n - p sup 1 sup n-p 1 n - p sup 1 n-p sup 1 n - p sup n p < 1. Then p log(0 + t=p+1 n t=p+1 n t=p+1 n t=p+1 n t=p+1 n t=p+1 j=1 2 j Xt-j ) 1 n-p 1 n-p 1 c1 n p log(c2 + t=p+1 n 2 Xt a.s. j=1 a.s. 2 c2 Xt-j ) E log c2 + j=1 2 c2 Xt-j 0 + 0 + (0 + (0 + (0 + 2 Xt p 2 j=1 j Xt-j 2 Xt-j p 2 j=1 j Xt-j 2 2 Xt Xt-j t=p+1 c1 | 1 2 E(Xt ) c1 1jp 1 n-p 1 n-p 1 n-p 2 Xt a.s. 1 2 2 E(Xt ) c2 c1 t=p+1 1 2 Xt a.s. 1 2 3 c3 E(Xt ) c1 1 t=p+1 2 Xt a.s. 1 2 4 E(Xt ) c4 c1 t=p+1 1 n n n p 2 2 j=1 j Xt-j ) 1jp 1 j 1 , j2 p 1 j1 , j2 , j3 p. 2 2 Xt Xt-j1 Xt-j2 p 2 3 j=1 j Xt-j ) 2 2 2 2 Xt Xt-j1 Xt-j2 Xt-j3 p 2 4 j=1 j Xt-j ) PROOF. The proof is straightforward from the definition of . We note that if we did not bound the parameters j in away from zero, then it would not 4 be possible to show that the expectations were finite without assuming that E(X t ) < , which we recall is a highly restrictive assumption. Let p L() = E log(0 + 2 j Xt-j ) + E j=1 0 + 2 Xt p 2 j=1 j Xt-j . < 1. Then Lemma 9.2.2 Let us suppose that {Xt } is a stationary process with Ln () L() sup a.s. 2 p j=1 aj Ln () 3 1 n-p Ln () n t=p+1 n a.s. 2 p 1 c1 L() 2 2 Xt-j + c-1 Xt 1 a.s. sup Ln () 1 n-p 1 c3 t=p+1 1 j=1 p 1 2 2 pE(Xt-j ) + c-1 E(Xt ) 1 c1 1 2 2 pE(Xt-j ) + c-1 E(Xt ) . 1 c3 1 j=1 2 2 Xt-j + c-1 Xt 1 a.s. 117 Lemma 9.2.3 Suppose Ln () is defined as in (9.4) then sup |Ln () - L()| 0. a.s (9.5) PROOF. To prove uniform convergence it is sufficient to show pointwise convergence and equicontinuity of Ln () (since is compact) (see Theorem 5.4.1). By using the ergodic theorem and Lemma 9.2.1 we have |Ln () - L()| 0. a.s (9.6) We now show that Ln () is stochastically equicontinuous. By the mean value theorem for every 1 , 2 there exists an such that |Ln (1 ) - Ln (2 )| 1 n-p n t=p+1 1 1+ 2c1 c1 2 Xt LN ( ) 2 1 - 2 2 1 - 2 2 . X2 a.s. X2 1 1 1 t t Since {Xt } is an ergodic sequence then n-p n t=p+1 2c1 1+ c1 ) E( 2c1 1+ c1 ). Hence Ln () is stochastically equicontinuous. Now by pointwise convergence of Ln (), equicontinuity of Ln () and the compactness of we have uniform convergence of Ln (). ^ Theorem 9.2.1 Suppose {Xt : t = 1 . . . , n} is from a ARCH(p) process and the estimator an a.s ^ is as defined in (9.4). Then we have an a0 as n . PROOF. The result follows immediately from Lemma 9.2.3 and Theorem 5.4.1. 9.2.2 Asymptotic normality of the quasi-maximum likelihood estimator L(^ n ) = 0 we have a (9.7) 2L a.s. To prove asymptotic normality we use Taylor expansion techniques. Since (^ n - a0 ) = a 2 Ln ( n )-1 Ln (a0 ), a a n ( n ) ^ where an lies between an and a0 . In the following lemma we show that 2 L(a 0 ). ^ Lemma 9.2.4 Let us suppose that {Xt } is a stationary process and an lies between an and a0 . Then 2 Ln ( n ) a a.s. 2 L(a0 ), where 2 L(a0 ) = E Xt-1 Xt-1 . 4 t PROOF. To prove the result we consider 2 Ln ( n ) - a 2 L(a0 ) 2 3 Ln ( n ) - a Ln (~ n ) a 2 2 an - a 0 Ln (a0 ) + 2 2 + 2 Ln (a0 ) - 2 Ln (a0 ) - 2 L0 (a0 ) L0 (a0 ) . 118 3 L () Using Lemma 9.2.1 we have that sup is in its limit almost surely bounded. n 2 a.s. Hence by using the above, Lemma 9.2.2, and that an a0 we have 2 Ln ( n ) - a 2 L(a0 ) a.s. 0, as required. We now show asymptotic normality of Assumption 9.2.1 For some > 0 E(|Zt |4(1+) ) < . It can be shown that the first derivative of Ln () is Ln () = 1 n-p n t=p+1 Ln (a0 ), which requires the following assumption. (9.8) 0 + Xt-1 Xt Xt-1 , - p 2 2 j Xt-j (0 + p j Xt-j )2 j=1 j=1 2 2 2 where Xt = (1, Xt-1 , . . . , Xt-p ). Now evaluating the above at the true parameters (using Xt = 2 + (Z 2 - 1) 2 ) we have t t t Ln (a0 ) = 1 n-p 2 (Zt - 1) Xt-1 2 t t=p+1 p j=1 aj n Lemma 9.2.5 Let us suppose that {Xt } is a stationary process with tion 9.2.1 is satisfied. Then we have 2 where 2 = var(Zt ) and < 1 and Assump- n Ln (a0 ) N (0, 2 ), D =E Xt-1 Xt-1 . 2 t (Z 2 -1) PROOF. To prove the result we note that each element of { t 2 Xt-1 }t . Hence we use the t martingale central limit theorem, together with the Cramer-Wold device to prove the result. Theorem 9.2.2 Let us suppose that {Xt } is a stationary process with sumption 9.2.1 is satisfied. Then we have D n(^ n - a0 ) N (0, 2 -1 ). a PROOF. To prove the result we use (9.7), Lemmas 9.2.4 and 9.2.5. It is straightforward, hence we omit the details. p j=1 aj < 1 and As- 119 9.3 Testing for linearity of a time series In this section we consider a test for linearity, first proposed by Subba Rao and Gabr (1980) (where the full details can be found). In the same paper a test for Gaussianity is also given, however in this section we will focus on the test for linearity. We mention that the test is based on the third order spectrum, however without much modification one can also construct a similar test based on the fourth (or higher order) spectrums (which may be useful in the case that innovation density is symmetric about zero, hence cum(t , t , t ) = 0, in this case we need to the fourth order spectrum). 9.3.1 Motivating the test statistic In order to motivate the test let us review some of the results in Section 8.3.4. In Section 8.3.4 we defined the higher order spectrum which is based on the higher order cumulants of a process. The higher order spectrum of a linear process has a very nice form. Let us suppose that {X t } satisfies Xt = j=- 3 where j=- |j | < , E(t ) = 0 and E(|t | ) < . Let A() = is straightforward to show that the third order spectrum is j=- j j t-j exp(ij). Then it (9.9) f3 (1 , 2 ) = 3 A(1 )A(2 )A(-1 - 2 ) where 3 = cum(t , t , t ). We recall that the fourth order spectrum has a similar form and that the (second order) spectral density is f () = 2 A()A(-), where 2 = var(t ). In other words the spectrums (of all orders) of a linear process can be deduced from the transfer function A() and the cumulants of the innovations. This is not the case for nonlinear processes, where the higher order spectrum can have a very complicated form. It is worth noting that the third and fourth order spectrum of an ARCH(1) process is messy to evaluate however, it can be shown its third order spectrum does not have the form given in (9.9). The representation of f3 in terms of the transfer functions A() is not unique to the linear processes, it is possible to construct a very strange nonlinear process where f 3 (1 , 2 ) = 3 A(1 )A(2 )A(-1 - 2 ). However, if the third order spectrum does not have this form then we can conclude it is a nonlinear process. Returning to (9.9) and taking its absolute square gives |f3 (1 , 2 )|2 = 2 |A(1 )|2 |A(2 )|2 |A(-1 - 2 )|2 = 3 Therefore |f3 (1 , 2 )|2 2 3 = . f (1 )f (2 )f (-1 - 3 ) 23 2 In other words, in the case that f3 satisfies (9.9) at all frequencies (1 , 2 ) the ratio are constant. 120 (9.10) |f3 (1 ,2 )|2 f (1 )f (2 )f (-1 -3 ) 2 3 f (1 )f (2 )f (-1 - 3 ). 23 2 This discussion motivates the following test for linearity. We estimate f and f 3 , then construct the test statistic based on (9.10) and the estimates of f and f3 . Hence, in the test for linearity we will test H0 : |f3 (1 , 2 )|2 = constant f (1 )f (2 )f (-1 - 3 ) HA : |f3 (1 , 2 )|2 depends on(1 , 2 ). f (1 )f (2 )f (-1 - 3 ) 9.3.2 Estimates of the higher order spectrum Let us suppose {Xt } is a stationary time series and we observe {Xt }n . In this section we use t=1 {Xt }n to estimate f and f3 t=1 We recall that in Section 8.4.2 we considered methods to estimate the spectral density. We ^ ~ consider the periodogram (frequency domain) estimator f and the time domain estimator f , both estimates are asymptotically equivalent. We now define an estimator of f 3 based on the latter method. We recall that an estimate of the spectral density is ~ fn () = 1 2 n-1 ( k=-(n-1) k )^n (k) exp(ik), c m (9.11) where is the lag window, m << n and 1 cn (k) = ^ n n-|k| t=1 (Xt - X)(Xt+k - X). Using a similar method we can define an estimator of f3 . We first need to estimate the third order cumulant 1 cn (k1 , k2 ) = ^ n n-max(0,k1 ,k2 ) t=1 (Xt - X)(Xt+k1 - X)(Xt+k2 - X). Using the above the third order spectrum estimator is ~ f3,n (1 , 2 ) = 1 (2)2 n-1 n-1 ( k1 =-(n-1) k2 =-(n-1) k1 k2 -(k1 + k2 ) )( )( )^n (k1 , k2 ) exp(ik1 1 + ik2 2 ) c m m m (9.12) (for details on this estimator see Van Ness (1966)). Now using (9.10) it is clear that the test ~ ~ ~ ~ statistic should be based on the ratios f3,n (1 , 2 )/fn (1 )fn (2 )fn (-1 - 2 ) over several different values of (1 , 2 ). That is g (1 , 2 ) = ^ ~ f3,n (1 , 2 ) , ~ ~ ~ fn (1 )fn (2 )fn (-1 - 2 ) f ( , ) (9.13) 1 2 where g (1 , 2 ) is an estimator of g(1 , 2 ) = fn (1 )fn3,n 2 )fn (-1 -2 ) . If the sample size n the ^ ( window length m are sufficiently large, Van Ness (1966) and Brillinger (2001) have shown that g (1 , 2 ) is normally distributed. ^ 121 The ratio g(1 , 2 ) can be evaluated at all the fundemental frequencies {( 2k1 , 2k2 )}, and n n potentially a test can be made on the `constantness' of g(1 , 2 ) over all {( 2k1 , 2k2 ); 1 n n k1 , k2 n} could be performed (note that it can be shown that there are regions of [0, 2] 2 , where there are repetitions, but for ease of presentation we shall not take this into account). A problem is that due to the smoothing in (9.11) and (9.12) there will be quite a lot of dependence in g over adjacent frequencies. Therefore, rather than consider the grid {( 2k1 , 2k2 )} we ^ n n consider the coarser grid {( 2k1 m , 2k2 m ) : 1 k1 , k2 n/m} (notice that m is the window n n length). On the coarser grid the random variables g(1 , 2 ) tend to be less correlated. To see why frequencies which are not too close tend to uncorrelated, consider the simpler example of the smooth periodogram. The periodogram is asymptotically uncorrelated at adjacent frequencies, howeversmoothing introduces dependence. But by choosing frequencies which are not too close, the smooth periodogram is close to uncorrelated. We now define a neighbourhood. Each neighbourhood Ns1 ,s2 consists of all the frequencies Ns1 ,s2 = 2s1 rm 2k1 m 2s2 rm 2k2 m : -(r + 1) k1 , k2 r , + , + n n n n rm rm hence Ns1 ,s2 is the neighbourhood of 2s1n , 2s2n . An illustration of this grouping is given in Figure 9.1. The idea is that all the small squares inside the local neighbour should have a mean which is about the same because we have assumed that the spectrums f and f3 are sufficiently smooth functions. In other words ^ E(f3 (1 , 2 )) f3 2s1 rm 2s2 rm , , n n 2s2 rm -2s1 rm 2s2 rm 2s1 rm ~ ~ ~ f f (9.14) , - E(fn (1 )fn (2 )fn (-1 - 2 )) f n n n n for all 1 , 2 Ns1 ,s2 . Now we define a vector containing the g evaluated for all the values in ^ Ns1 ,s2 s 1 ,s2 = := g ^ 2s1 rm 2rm 2s2 rm 2rm 2s1 rm 2(r + 1)m 2s2 rm 2(r + 1)m ,..., - , - + , - n n n n n n n n s1 ,s2 (1), . . . , s1 ,s2 (r2 ) 1 ,s2 We see that s is an (2r)2 -dimensional vector which contains g evaluated in a neighbourhood ^ There are (n/(2mr))2 different neighbourhoods. We now construct r 2 different vectors each of dimension (n/(2mr))2 , where each vector contains one element from each neighbourhood Ns1 ,s2 . That is let n n Y k = (1,1 (k), . . . , 2mr , 2mr (k)). rm rm rm rm of ( 2s1n , 2s2n ). Since all the frequencies are in the neighbourhood of ( 2s1n , 2s2n ) by us2s1 rm 2s2 rm ing (9.14) we can expect the mean of every single element of s ,s to be about g( n , n ). 1 2 Hence we have (2r)2 random vectors {Y k }. Now, as mentioned above (see below (9.13)), each element of Y k is normally distributed where the mean of Yk is approximately E(Y k ) g 2rm 2rm , . . . , g 2, 2 , n n 122 . m (w(1),w(2)) (w(3),w(4)) (w(5),w(6)) (w(7),w(8)) m2r Figure 9.1: The grid of all frequencies in {( 2k1 , 2k2 )} are made into a coarser grid. Each small n n square is a locally averaged frequency. 123 Since we have chosen the element of each neighbourhood to be approximately uncorrelated, then we can assume that {Y k } are iid normally distributed random vectors. Returning to the null hypothesis we recall that we are testing the hypothesis that H0 : |f3 (1 , 2 )|2 = constant f (1 )f (2 )f (-1 - 3 ) HA : |f3 (1 , 2 )|2 depends on(1 , 2 ). f (1 )f (2 )f (-1 - 3 ) This means that under the null the elements of the vector Y k have the same mean. Hence the above can be restated as n n H0 : 1,1 = 2,1 = . . . = 2mr , 2mr HA : at least one of the means are different rm rm where s1 ,s2 = E(^( 2s1n , 2s2n )). We now review some results in multivariate analysis which g will help us to construct the test statistic. 9.3.3 Hotelling's T 2 -statistic In this section we summarise some results from multivariate analysis. An interested reader is referred to the excellent book Anderson (2003) (Chapter 5) for a detailed introduction. Let us suppose that {X t } are iid random vectors of dimension p, which are normally distributed with mean and variance . Hotelling's T 2 -statistic is generally used to test hypothesis on the mean . In many respects in can be considered as a multivariate generalisation of the t-test. We first construct the test statistic to test the hypothesis H0 : = 0 and then consider the generalisation to the case H0 : 1 = 2 = . . . = p . Let us suppose we observe {X t }n , t=1 then under the null the distribution of nX n N (0, ), where X n = n-1 n X t . Therefore, t=1 nX n -1 X n 2 (p). Hence, if the variance were known, we could use nX n -1 Xn as the test statistic. Of course, often will be unknown and needs to be estimated from the observations. Let n S= t=1 (X t - X n )(X t - X n ) . 1 Then we can use n-p S as an estimator of (noting that we normalise by n - p because we are estimating p means). Therefore under the null we can use as the test statistic T 2 = X n S -1 X n , (note we could normalise by n/(n - p)). We note that under the assumption of normality of X t that S is effectively the sum of n - p random variables, hence roughly speaking S 2 (n - p) (of course this is not strictly true since S is a random matrix and not a scalar). We note that if Y Y 2 (p) and X 2 (n - p), then (Y /p)/(X/(n - p)) = n-p X Fp,n-p . Now under the null p we have nX n -1 X n 2 (p) and S 2 (n - p), therefore the distribution of T 2 is (n - p)n 2 T Fp,n-p . p We note that under the alternative the distribution of (n-(p-1))n T 2 will be a non-central Fp,n-p , (p-1) where the non-centrality parameter depends on the deviation of the means { k } from zero. 124 We now apply the above results to test the hypothesis H 0 : 1 = 2 = . . . = p Define the (p - 1) p matrix B= HA : at least one of the means are different. ... ... ... .. . ... ... ... ... ... ... .. .. . . . . . 1 -1 1 -1 0 0 0 0 1 -1 0 0 0 0 1 -1 0 . . . . . . . . ... . . . . 0 0 0 0 0 (9.15) where X = (X1 , . . . , Xp ) . Therefore the T 2 -statistic is (9.16) T 2 = B X n ) (B SB)-1 B X n . Since B X n is a p - 1-dimensional vector, under the null the distribution of the test statistic is (n - (p - 1))n 2 T Fp-1,n-p+1 . (p - 1) We note that under the null, the p - 1-dimensional vector B X n satisfies X1 - X2 X2 - X3 nB X n = n N (0, BB ), . . . p-1 - Xp X 9.3.4 The test statistic for the test for linearity |f3 (1 , 2 )|2 depends on(1 , 2 ). f (1 )f (2 )f (-1 - 3 ) We now apply the above results to test for linearity of the time series. We want to test H0 : |f3 (1 , 2 )|2 = constant f (1 )f (2 )f (-1 - 3 ) n n H0 : 1,1 = 2,1 = . . . = 2mr , 2mr HA : We note that the above is equivalent to HA : at least one of the means are different rm rm where s1 ,s2 = E(^( 2s1n , 2s2n )). We recall that we have constructed the vectors {Y k } which g are approximately iid normally distributed. We use (9.16) to construct the test statistic. Using the definition of B given in (9.15) we use as the test statistic T 2 = B Y n ) (B SY B)-1 B Y n , where Y = 1 (2r)2 k=1 Y k and (2r)2 SY = k=1 (Y k - Y )(Y k - Y ) . Under the null the test statistic has the distribution n ((2r)2 - ( 2mr - 1))(2r)2 2 T F n -1,(2r)2 - n +1 . n 2mr 2mr ( 2mr - 1) 125 Chapter 10 Mixingales In this section we prove some of the results stated in the previous sections using mixingales. We first define a mixingale, noting that the definition we give is not the most general definition. Definition 10.0.1 (Mixingale) Let Ft = (Xt , Xt-1 , . . .), {Xt } is called a mixingale if it satisfies 2 1/2 t,k = E E(Xt |Ft-k ) - E(Xt ) , where t,k 0 as k . We note if {Xt } is a stationary process then t,k = k . Lemma 10.0.1 Suppose {Xt } is a mixingale. Then {Xt } almost surely satisfies the decomposition Xt = j=0 E(Xt |Ft-j-1 ) - E(Xt |Ft-j-1 ) . (10.1) PROOF. We first note that by using a telescoping argument that m Xt - E(Xt ) = k=0 E(Xt |Ft-k ) - E(Xt |Ft-k-1 ) + E(Xt |Ft-m-1 ) - E(Xt ) . 2 By definition of a martingale E E(Xt |Ft-m-1 ) - E(Xt ) 0 as k , hence the remainder term in the above expansion becomes negligable as m and we have almost surely Xt - E(Xt ) = Thus giving the required result. We observe that (10.1) resembles the Wold decomposition. The difference is that the Wolds decomposition decomposes a stationary process into elements which are the errors in the best 126 k=0 E(Xt |Ft-k ) - E(Xt |Ft-k-1 ) . linear predictors. Whereas the result above decomposes a process into sums of martingale differences. It can be shown that functions of several ARCH-type processes are mixingales (where t,k Kk (rho < 1)), and Subba Rao (2006) and Dahlhaus and Subba Rao (2007) used these properties to obtain the rate of convergence for various types of ARCH parameter estimators. In a series of papers, Wei Biao Wu considered properties of a general class of stationary processes which satisfied Definition 10.0.1, where k < . k=1 In Section 10.2 we use the mixingale property to prove Theorem 6.1.3. This is a simple illustration of how useful mixingales can be. In the following section we give a result on the rate of convergence of some random variables. 10.1 Obtaining almost sure rates of convergence for some sums The following lemma is a simple variant on a result proved in Mricz (1976), Theorem 6. o Lemma 10.1.1 Let {ST } be a random sequence where E(sup1tT |St |2 ) (T ) and {phi(t)} is a monotonically increasing sequence where (2j )/(2j-1 ) K < for all j. Then we have almost surely 1 ST = O T (T )(log T )(log log T )1+ . T PROOF. The idea behind the proof is to that we find a subsequence of the natural numbers and define a random variables on this subsequence. This random variable, should `dominate' (in some sense) ST . We then obtain a rate of convergence for the subsequence (you will see that for the subsequence its quite easy by using the Borel-Cantelli lemma), which, due to the dominance, can be transfered over to ST . We make this argument precise below. Define the sequence Vj = supt2j |St |. Using Chebyshev's inequality we have P (Vj > ) Let (t) = (2j ) . (t)(log log t)1+ log t. It is clear that j=1 P (Vj > (2 )) j j=1 C(2j ) < , (2j )(log j)1+ j where C is a finite constant. Now by Borel Cantelli, this means that almost surely V j (2j ). Let us now return to the orginal sequence ST . Suppose 2j-1 T 2j , then by definition of Vj we have a.s (2j ) Vj ST < (T ) (2j-1 ) (2j-1 ) under the stated assumptions. Therefore almost surely we have ST = O((T )), which gives us the required result. 127 We observe that the above result resembles the law of iterated logarithms. The above result is very simple and nice way of obtaining an almost sure rate of convergence. The main problem is obtaining bounds for E(sup1tT |St |2 ). There is on exception to this, when St is the sum of martingale differences then one can simply apply Doob's inequality, where E(sup1tT |St |2 ) E(|ST |2 ). In the case that ST is not the sum of martingale differences then its not so straightforward. However if we can show that ST is the sum of mixingales then with some modifications a bound for E(sup1tT |St |2 ) can be obtained. We will use this result in the section below. 10.2 Proof of Theorem 6.1.3 We summarise Theorem 6.1.3 below. Theorem 1 Let us suppose that {Xt } has an ARMA representation where the roots of the characteristic polynomials (z) and (z) lie are greater than 1 + . Then (i) 1 n (ii) 1 n for any > 0. By using Lemma ??, and that n t=r+1 t Xt-r is the sum of martingale differences, we prove Theorem 6.1.3(i) below. PROOF of Theorem 6.1.3. We first observe that {t Xt-r } are martingale differences, 2 2 2 hence we can use Doob's inequality to give E(supr+1sT ( s t=r+1 t Xt-r ) ) (T -r)E(t )E(Xt ). Now we can apply Lemma ?? to obtain the result. We now show that 1 T T n n t Xt-r = O( t=r+1 (log log n)1+ log n ) n (10.2) Xt-i Xt-j = O( t=max(i,j) (log log n)1+ log n ). n (10.3) Xt-i Xt-j = O( t=max(i,j) (log log T )1+ log T ). T However the proof is more complex, since {Xt-i Xt-j } are not martingale differences and we cannot directly use Doob's inequality. However by showing that {Xt-i Xt-j } is a mixingale we can still show the result. To prove the result let Ft = (Xt , Xt-1 , . . .) and Gt = (Xt-i Xt-j , Xt-1-i Xt-j-i , . . .). We observe that if i > j, then Gt Ft-i . 128 Lemma 10.2.1 Let Ft = (Xt , Xt-1 , . . .) and suppose Xt comes from an ARMA process, where the roots are greater than 1 + . Then if E(4 ) < we have t E E(Xt-i Xt-j |Ft-min(i,j)-k ) - E(Xt-i Xt-j ) PROOF. By expanding Xt as an MA() process we have E(Xt-i Xt-j |Ft-min(i,j)-k ) - E(Xt-i Xt-j ) = j1 ,j2 =0 2 Ck . aj1 aj2 E(t-i-j1 t-j-j2 |Ft-k-min(i,j) ) - E(t-i-j1 t-j-j2 ) . Now in the case that t - i - j1 > t - k - min(i, j) and t - j - j2 > t - k - min(i, j), E(t-i-j1 t-j-j2 |Ft-k-min(i,j) ) = E(t-i-j1 t-j-j2 ). Now by considering when t - i - j1 t - k - min(i, j) or t - j - j2 t - k - min(i, j) we have have the result. Lemma 10.2.2 Suppose {Xt } comes from an ARMA process. Then (i) The sequence {Xt-i Xt-j }t satisfies the mixingale property and almost surely we can write Xt-i Xt-j as Xt-i Xt-j - E(Xt-i Xt-j ) = n E E(Xt-i Xt-j |Ft-min(i,j)-k ) - E(Xt-i Xt-j |Ft-k-1 ) 2 Kk , (10.4) Vt,k (10.5) k=0 t=min(i,j) where Vt,k = E(Xt-i Xt-j |Ft-k-min(i,j) ) - E(Xt-i Xt-j |Ft-k-min(i,j)-1 ), are martingale differences. 2 (ii) Furthermore E(Vt,k ) Kk and s E sup min(i,j)sn t=min(i,j) {Xt-i Xt-j - E(Xt-i Xt-j )})2 Kn, (10.6) where K is some finite constant. PROOF. To prove (i) we note that by using Lemma 10.2.1 we have (10.4). To prove (10.5) we use the same telescoping argument used to prove Lemma 10.0.1. To prove (ii) we use the above expansion to give s E sup min(i,j)sn t=min(i,j) s {Xt-i Xt-j - E(Xt-i Xt-j )})2 Vt,k s 2 (10.7) = E sup min(i,j)sn k=0 t=min(i,j) s = E sup s k1 =0 k2 =0 min(i,j)sn t=min(i,j) Vt,k1 2 Vt,k2 t=min(i,j) 1/2 2 = k=0 E sup min(i,j)sn t=min(i,j) Vt,k1 129 {Vt,k }t are also martingale differences. Hence we can apply Doob's inequality to E supmin(i,j)sn and by using (10.4) we have s Now we see that {Vt,k }t = {E(Xt-i Xt-j |Ft-k-min(i,j) )-E(Xt-i Xt-j |Ft-k-min(i,j)-1 )}t , therefore s t=min(i,j) V E sup min(i,j)sn t=min(i,j) Vt,k 2 n E Vt,k t=min(i,j) 2 n = t=min(i,j) 2 E(Vt,k ) K nk . Therefore now by using (10.7) we have s E sup min(i,j)sn t=min(i,j) {Xt-i Xt-j - E(Xt-i Xt-j )})2 Kn. Thus giving (10.6). We now use the above to prove Theorem 6.1.3(ii). PROOF of Theorem 6.1.3(ii). To prove the result we use (10.6) and Lemma 10.1.1. 130 Appendix A Appendix A.1 Background: some definition and inequalities The norm of an object, is a postive numbers which measure the `magnitude' of that object. Suppose x = (x1 , . . . , xn ) Rn , then we define x 1 = n |xj | and x 2 = ( n |x2 )1/2 j=1 j=1 j (this is known as the Euclidean norm). There are various norms for matrices, the most popular is the spectral norm spec : let A be a matrix, then A spec = max (AA ), where max denotes the largest eigenvalue. Some norm definitions. Z denotes the set of a integers {. . . , -1, 0, 1, 2, . . .}. R denotes the real line (-, ). Complex variables. i = -1 and the complex variable z = x + iy, where x and y are real. Often the radians representation of a complex variable is useful. If z = x + iy, then it can also be written as r exp(i), where r = x2 + y 2 and = tan-1 (y/x). If z = x + iy, its complex conjugate is z = x - iy. The roots of a rth order polynomial a(z), are those values 1 , . . . , r where a(i ) = 0 for i = 1, . . . , r. The mean value theorem. This basically states that if the partial derivative of the function f (x1 , x2 , . . . , xn ) has a bounded in the domiain , then for x = (x1 , . . . , xn ) and y = (y1 , . . . , yn ) n f (x1 , x2 , . . . , xn ) - f (y1 , y2 , . . . , yn ) = where x lies somewhere between x and y. The Taylor series expansion. i=1 (xi - yi ) f xi x=x This is closely related to the mean value theorem and a second order expansion is n f (x1 , x2 , . . . , xn ) - f (y1 , y2 , . . . , yn ) = i=1 f (xi - yi ) + xi n i,j=1 (xi - yi )(xj - yj ) f 2 xi xj x=x 131 Partial Fractions. We use the following result mainly for obtaining the MA() expansion of an AR process. n i=1 (1 Suppose that |gi | > 1 for 1 i n. Then if g(z) = satisfies 1 = g(z) n i=1 ri j=1 - z/gi )ri , the inverse of g(z) gi,j , z (1 - gi )j z -j gi ) where gi,j = ..... Now we can make a polynomial series expansion of (1 - valid for all |z| 1. Dominated and monotone convergence. which is We will use this all over the place to exchange infinite sums and expectations. Basically if j=1 |aj |E(|Zj |) < , then by using Dominated convergence we have E( j=1 a j Zj ) = j=1 aj E(Zj ). Cauchy Schwarz inequality. In terms of sequences it is j=1 j=1 j=1 | aj bj | ( a2 )1/2 ( j b2 )1/2 j . For integrals and expectations it is E|XY | E(X 2 )1/2 E(Y 2 )1/2 Holder's inequality. This is a generalisation of the Cauchy Schwarz inequality. It states that if 1 p, q and p + q = 1, then E|XY | E(|X|p )1/p E(|Y |q )1/q . A similar results is true for sequences too. Martingale differences. Let Ft be a sigma-algebra, where Xt , Xt-1 , . . . Ft . Then {Xt } is a sequence of martingale differences if E(Xt |Ft-1 ) = 0. Minkowski's inequality. If 1 < p < , then n n (E( i=1 Xi ) ) p 1/p i=1 (E(|Xi |p ))1/p . 132 Doob's inequality. This inequality concerns martingale differences. Let Sn = 2 E( sup |Sn |2 ) E(SN ). nN n t=1 Xt , then Burkhlder's inequality. o Suppose that {Xt } are martingale differences and define Sn = have n p {E(Sn )}1/p n k=1 Xt . For any p 2 we 2p p E(Xk )2/p k=1 1/2 . An application, is to the case that {Xt } are identically distributed random variables, then p p we have the bound E(Sn ) E(X0 )2 (2p)p/2 np/2 . It is worthing noting that the Burkhlder inequality can also be defined for p < 2 (see o Davidson (1994), pages 242). It can also be generalised to random variables {X t } which are not necessarily martingale differences (see Dedecker and Doukhan (2003)). Riemann-Stieltjes Integrals. In basic calculus we often use the basic definition of the Riemann integral, g(x)f (x)dx, and if the function F (x) is continuous and F (x) = f (x), we can write g(x)f (x)dx = g(x)dF (x). There are several instances where we need to broaden this definition to include functions F which are not continuous everywhere. To do this we define the Riemann-Stieltjes integral, which coincides with the Riemann integral in the case that F (x) is continuous. g(x)dF (x) is defined in a slightly different way to the Riemann integral g(x)f (x)dx. Let us first consider the case that F (x) is the step function F (x) = n ai I[xi-1 ,xi ] , then i=1 g(x)dF (x) is defined as g(x)dF (x) = n (ai - ai-1 )g(xi ) (with a-1 = 0). Already i=1 we see the advantage of this definition, since the derivative of the step function is not well defined at the jumps. As most functions can be written as the limit of step functions nk ), we define g(x)dF (x) = (F (x) = limk Fk (x), where Fk (x) = i=1 ai,nk I[xi -1,x ] limk nk i=1 (ai,nk k-1 In statistics the function F will usually non-decreasing and bounded. We call such functions distributions. Theorem A.1.1 (Helly's Theorem) Suppose that {Fn } are a sequence of distributions with Fn (-) = 0 and supn Fn () M < . There exists a distribution F , and a subsequence Fnk such that for each x R Fnk F and F is right continuous. Remark A.1.1 Martingales arise all the time. Its useful to know if the true distributional is used, the gradient of the conditional log likelihood evaluated at the true parameter is the sum of martingale differences. We show why this is true now. Let BT = T log f (Xt |Xt-1 , . . . , X1 ) t=2 be the conditonal log likelihood and CT () its derivative, where T - ai-1,nk )g(xik ). ik CT () = t=2 log f (Xt |Xt-1 , . . . , X1 ) . 133 We want to show that CT (0 ) is the sum of martingale differences. By definition if CT (0 ) is the sum of martingale differences then E log f (Xt |Xt-1 , . . . , X1 ) =0 Xt-1 , Xt-2 , . . . , X1 = 0, we will show this. Rewriting the above in terms of integrals and exchanging derivative with integral we have E = = = log f (Xt |Xt-1 , . . . , X1 ) =0 Xt-1 , Xt-2 , . . . , X1 log f (xt |Xt-1 , . . . , X1 ) =0 f0 (xt |Xt-1 , . . . , X1 )dxt f (xt |Xt-1 , . . . , X1 ) 1 =0 f0 (xt |Xt-1 , . . . , X1 )dxt f0 (xt |Xt-1 , . . . , X1 ) f (xt |Xt-1 , . . . , X1 )dxt =0 = 0. |X Therefore { log f (Xt t-1 ,...,X1 ) =0 }t are a sequence of martingale differences and Ct (0 ) is the sum of martingale differences (hence it is a martingale). 134 Bibliography Hong-Zhi An, Zhao-Guo. Chen, and E.J. Hannan. Autocorrelation, autoregression and autoregressive approximation. Ann. Statist., 10:926936, 1982. T. W. Anderson. Statistical Analysis of Time Series. Wiley, 1994. T. W. Anderson. An Introduction to Multivariate Analysis. Wiley, New Jersey, 2003. M. Bartlett. Introduction to Stochastic Processes: With Special Reference to Methods and Applications. Cambridge University Press, Cambridge, 1981. I. Berkes, L. Horvth, and P. Kokoskza. GARCH processes: Structure and estimation. Bernoulli, a 9:20012007, 2003. I. Berkes, L. Horvath, P. Kokoszka, and Q. Shao. On discriminating between long range dependence and changes in mean. Ann. Statist., 34:11401165, 2006. R.N. Bhattacharya, V.K. Gupta, and E. Waymire. The hurst effect under trend. J. Appl. Probab., 20:649662, 1983. P.J. Bickel and D.A. Freedman. Some asymptotic theory for the bootstrap. Ann. Statist., pages 11961217, 1981. P. Billingsley. Probability and Measure. Wiley, New York, 1995. T Bollerslev. Generalized autoregressive conditional heteroscedasticity. J. Econometrics, 31: 301327, 1986. G. E. P. Box and G. M. Jenkins. Time Series Analysis, Forecasting and Control. Cambridge University Press, Oakland, 1970. D.R. Brillinger. Time Series: Data Analysis and Theory. SIAM Classics, 2001. P. Brockwell and R. Davis. Time Series: Theory and Methods. Springer, New York, 1998. R Dahlhaus. Maximum likelihood estimation and model selection for locally stationary processes. J. Nonparametric Statist., 6:171191, 1996. R Dahlhaus. Fitting time series models to nonstationary processes. Ann. Stat., 16:137, 1997. R Dahlhaus. A likelihood approximation for locally stationary processes. Ann. Stat., 28:1762 1794, 2000. 135 R. Dahlhaus and D. Janas. A frequency domain bootstrap for ratio statistics in time series analysis. Ann. Statistic., 24:19341963, 1996. R. Dahlhaus and S. Subba Rao. A recursive online algorithm for the estimation of time-varying arch parameters. Bernoulli, 13:389422, 2007. J Davidson. Stochastic Limit Theory. Oxford University Press, Oxford, 1994. J. Dedecker and P. Doukhan. A new covariance inequality. Stochastic Processes and their applications, 106:6380, 2003. R. Engle. Autoregressive conditional heteroscedasticity with estimates of the variance of the United Kingdom inflation. Econometrica, 50:9871006, 1982. J. Fan and Q. Yao. Nonlinear time series: Nonparametric and parametric methods. Springer, Berlin, 2003. J. Franke and W. Hrdle. On bootstrapping kernel spectral estimates. Ann. Statist., 20:121145, a 1992. J. Franke and J-P. Kreiss. Bootstrapping stationary autoregressive moving average models. J. Time Ser. Anal., pages 297317, 1992. W. Fuller. Introduction to Statistical Time Series. Wiley, New York, 1995. L. Giraitis, P. Kokoskza, and R. Leipus. Stationary ARCH models: Dependence structure and central limit theorem. Econometric Theory, 16:322, 2000. C. W. J. Granger and A. P. Andersen. An introduction to Bilinear Time Series models. Vandenhoek and Ruprecht, Gttingen, 1978. o U. Grenander and G. Szeg. Toeplitz forms and Their applications. Univ. California Press, o Berkeley, 1958. G.R. Grimmett and D. R. Stirzaker. Probability and Random Processes. Oxford University Press, Oxford, 1994. P Hall and C.C. Heyde. Martingale Limit Theory and its Application. Academic Press, New York, 1980. E. Hannan. Multiple Time Series. Wiley, New York, 1970. E.J. Hannan and Rissanen. Recursive estimation of ARMA order. Biometrika, 69:8194, 1982. J.-P. Kreiss. Asymptotical properties of residual bootstrap for autoregression. Technical report www.math.tu-bs.de/stochastik/kreiss.htm, 1997. Rosenblatt. M. and U. Grenander. Statistical Analysis of Stationary Time Series. Chelsea Publishing Co, 1997. T. Mikosch and C. Stric. Is it really long memory we see in financial returns? In P. Embrechts, a a editor, Extremes and Integrated Risk Management, pages 149168. Risk Books, London, 2000. 136 T. Mikosch and C. Stric. Long-range dependence effects and arch modelling. In P. Doukhan, a a G. Oppenheim, and M.S. Taqqu, editors, Theory and Applications of Long Range Dependence, pages 439459. Birkhuser, Boston, 2003. a F. Mricz. Moment inequalities and the strong law of large numbers. Z. Wahrsch. verw. Gebiete, o 35:298314, 1976. E. Moulines, P. Priouret, and F. Roueff. On recursive estimation for locally stationary time varying autoregressive processes. Ann. Statist., 33:26102654, 2005. D.F. Nicholls and B.G. Quinn. Random Coefficient Autoregressive Models, An Introduction. Springer-Verlag, New York, 1982. E. Parzen. On consistent estimates of the spectrum of a stationary process. Ann. Math. Statist., 1957. E. Parzen. On estimation of the probability density function and the mode. Ann. Math. Statist., 1962. E. Parzen. Stochastic Processes (Classics in Applied Mathematics). Society for Industrial Mathematics, 1999. D. N. Politis, J. P. Romano, and M. Wolf. Subsampling. Springer, New York, 1999. M. B. Priestley. Spectral Analysis and Time Series: Volumes I and II. Academic Press, London, 1983. B.G. Quinn and E.J. Hannan. The Estimation and Tracking of Frequency. Cambridge University Press, 2001. R. Shumway and D. Stoffer. Time Series Analysis and Its applications: With R examples. Springer, New York, 2006. D. S. Stoffer and K. D. Wall. Bootstrappping state space models: Gaussian maximum likelihood estimation. Journal of the American Statistical Association, 86:10241033, 1991. D. S. Stoffer and K. D. Wall. Resampling in State Space Models. Cambridge University Press, 2004. W.F. Stout. Almost Sure Convergence. Academic Press, New York, 1974. D. Straumann. Estimation in Conditionally Heteroscedastic Time Series Models. Springer, Berlin, 2005. S. Subba Rao. A note on uniform convergence of an arch() estimator. Sankhya, pages 600620, 2006. T. Subba Rao. On the estimation of bilinear time series models. In Bull. Inst. Internat. Statist. (paper presented at 41st session of ISI, New Delhi, India), volume 41, 1977. T. Subba Rao and M. M. Gabr. A test for linearity of a stationary time series. J of Time Series Analysis, 1:145158, 1980. 137 T. Subba Rao and M. M. Gabr. An Introduction to Bispectral Analysis and Bilinear Time Series Models. Lecture Notes in Statistics (24). Springer, New York, 1984. Gy. Terdik. Bilinear Stochastic Models and Related Problems of Nonlinear Time Series Analysis; A Frequency Domain Approach, volume 142 of Lecture Notes in Statistics. Springer Verlag, New York, 1999. J. W. Van Ness. Asymptotic properties of the bi-spectra. Ann. Math. Stat, 37:12571257, 1966. P. Whittle. Gaussian estimation in stationary time series. Bulletin of the International Statistical Institute, 39:105129, 1962. 138
Find millions of documents on Course Hero - Study Guides, Lecture Notes, Reference Materials, Practice Exams and more. Course Hero has millions of course specific materials providing students with the best way to expand their education.

Below is a small sample set of documents:

UNC - SOCI - 850
12 October 2001 DOCTORAL EXAMINATION IN SOCIAL STRATIFICATION Answer 4 questions total as indicated in the options below. If you have questions, you may contact Rachel Rosenfeld (2-1008, 967-6845, rachel_rosenfeld@unc.edu). A. Answer the following qu
UNC - SOCI - 801
UNC - READ - 3079663
Announcement of Position VacancyCollege of Education and Human Development School of Leadership and Policy Studies Bowling Green State University, Bowling Green, Ohio 43403 Visit our web site at http:/www.bgsu.edu/colleges/edhd/eals/index.htmlAn Eq
UNC - READ - 3635692
UNC - READ - 4378901
UNIVERSITY OF MISSOURI-COLUMBIA COLLEGE OF EDUCATION Assistant/Associate Professor Position Department of Educational Leadership and Policy AnalysisThe Department of Educational Leadership and Policy Analysis (ELPA) in the MU College of Education (C
UNC - READ - 4734078
Town of Chapel Hill Employment ApplicationAn Equal Opportunity Employer Chapel Hill Town Hall 405 Martin Luther King Jr. Blvd. Chapel Hill, North Carolina 27514 (919) 968-2700 Fax (919) 968-2839INSTRUCTIONS: It is important that you fill out all s
UNC - READ - 4624629
PostDoctoral Position Available Atmospheric ModelingDepartment of Environmental Sciences and Engineering University of North Carolina, Chapel Hill, North Carolina 27599 USA April 10, 2008An immediate opening exists for a qualied person to particip
UNC - READ - 4928864
Employment OpportunitiesOrange County, North CarolinaEqual Opportunity/Affirmative Action EmployerCustomer Service RepresentativeOffice Assistant IAnimal Services Department Posting 1040-948Description: Orange County Animal Services Department
UNC - READ - 4976951
Office of the City Manager Personnel Policy Memorandum To: Assistant City Manager/Department Heads Signature:Date of Issue December 17, 1993 Subject:Effective Date January 3, 1994Number PER 712, R-1RESTRICTED DUTYP. Lamont Ewell, City Manag
UNC - READ - 3432798
Michael F. Easley, GovernorWilliam G. Ross Jr., SecretaryN.C. Department of Environment and Natural ResourcesRelease: IMMEDIATE Date: June 2, 2006 Contact: Susan Massengale Phone: (919) 733-7015 ext. 227STORMWATER PERMIT PROGRAM FOR DESIGNATED
UNC - READ - 4659833
DIRECTOR OF SPONSORED PROGRAMS Appalachian State University invites applications for the new position of Director of Sponsored Programs. The Office of Research and Sponsored Programs (ORSP) is a pre-award office that promotes and facilitates external
UNC - READ - 4666505
North Carolina Department of Health and Human Services Division of Social Services325 North Salisbury Street Raleigh, North Carolina 27699-2439 Courier # 56-20-25-MSC# 2439Michael F. Easley, Governor Dempsey Benton, SecretarySherry S. Bradsher,
UNC - READ - 3075680
The Educational Leadership Program College of Education Assistant Professor Job No. 33477 The Educational Leadership Program announces an opportunity to join our nationally recognized leadership faculty that includes highly accomplished area school l
UNC - READ - 1402767
Texas A&amp;M University Corpus ChristiDepartment of Educational Administration and ResearchTerm: Beginning August 2003 or January 2004Department ChairRank OpenResponsibilities: Tenure-track chair position responsible for: administering departmen
UNC - READ - 3084152
University ofOregonCollege ofEducationhe University of Oregona member of the Association of American Universities (AAU) seeks faculty dedicated to pre-eminent scholarship, research, teaching, and service. The University of Oregon has an enroll
Texas A&M - MATH - 251
Title:Oct318:03AM(1of11)Title:Oct318:18AM(2of11)Title:Oct318:21AM(3of11)Title:Oct318:31AM(4of11)Title:Oct318:43AM(5of11)Title:Oct318:55AM(6of11)Title:Oct319:53AM(7of11)Title:Oct3110:02AM(8of11)Title:Nov28:22AM(9of11)Title:Nov28:29AM
Texas A&M - MATH - 150
Title:Sep268:29PM(1of19)Title:Sep268:32PM(2of19)Title:Sep268:30PM(3of19)Title:Sep268:30PM(4of19)Title:Sep268:31PM(5of19)Title:Sep268:31PM(6of19)Title:Sep268:33PM(7of19)Title:Sep268:34PM(8of19)Title:Sep268:34PM(9of19)Title:Sep268:35P
Texas A&M - M - 131
Title:Feb127:30PM(1of18)Title:Feb127:30PM(2of18)Title:Feb127:30PM(3of18)Title:Feb127:30PM(4of18)Title:Feb127:30PM(5of18)Title:Feb127:57PM(6of18)Title:Feb127:30PM(7of18)Title:Feb128:19PM(8of18)Title:Feb128:20PM(9of18)Title:Feb128:30P
Texas A&M - MATH - 142
Homework #3 Math 142 Section:Name: Row:This assignment is due by 3:30 pm on February 12, 2008 You can turn it in to me in class or drop it by the office, Blocker 640D. Be sure that you follow the homework rules, they can be found on your syllabus
Texas A&M - MATH - 152
Math 152-copyright Joe Kahlig, 08CPage 1Section 10.1: A sequence is a list of numbers written in a denite order. a1 , a2 , a3 , . or {an } n=1 1. Find a general formula for these sequences. (a) 5 6 7 8 , , , , . 9 16 25 36 6 9 15 1, , , 2, , . 4
Texas A&M - MATH - 152
Math 152-copyright Joe Kahlig, 08CPage 1Section 9.3: Arc Length Suppose that C is a smooth curve dened by x = f (t) and y = g(t) for [a, b]. Let {Pi } be a set of points on the curve that partition of the interval [a, b] such that t is equal for
Texas A&M - MATH - 152
Math 152-copyright Joe Kahlig, 08CPage 1Section 9.4: Surface AreaThe surface area of a curve rotated about the x-axis:The surface area of a curve rotated about the y-axis:1. Find the area of the surface obtained by rotating the curve y = ab
Texas A&M - MATH - 152
Math 152-copyright Joe Kahlig, 08CPage 1Section 9.2: First Order Linear Equations A first-order linear differential equation is one that can be put into the form y + P (x)y = Q(x) 1. Classify these differential equations as linear or separable.
Texas A&M - MATH - 152
Math 152-copyright Joe Kahlig, 08CPage 1Section 9.5: Moments and Center of Mass We want to nd the point at which a thin plate would ballance. This point is called the center of mass(centroid) of the plate.First consider two points located on th
Texas A&M - MATH - 152
Texas A&M - MATH - 152
Math 152-copyright Joe Kahlig, 08BReview of Sections 6.4 and 6.5 1. Use the gure to compute the following. The areas of the regions are given.f(x)Area of Region I = 15 Area of Region II = 8I IIIAArea of Region III = 3IIBCA(a)0 C
Texas A&M - MATH - 251
Title:Aug318:04AM(1of8)Title:Aug318:26AM(2of8)Title:Aug318:34AM(3of8)Title:Aug318:38AM(4of8)Title:Aug318:46AM(5of8)Title:Aug318:51AM(6of8)Title:Aug319:54AM(7of8)Title:Aug319:59AM(8of8)
Texas A&M - MATH - 150
Title:Sep78:25PM(1of21)Title:Sep78:26PM(2of21)Title:Sep78:26PM(3of21)Title:Sep78:27PM(4of21)Title:Sep78:27PM(5of21)Title:Sep78:27PM(6of21)Title:Sep78:27PM(7of21)Title:Sep78:28PM(8of21)Title:Sep78:28PM(9of21)Title:Sep78:28PM(10of21)
Texas A&M - MATH - 251
Title:Sep199:53AM(1of13)Title:Sep199:59AM(2of13)Title:Sep1910:02AM(3of13)Title:Sep1910:06AM(4of13)Title:Sep218:31AM(5of13)Title:Sep218:33AM(6of13)Title:Sep218:40AM(7of13)Title:Sep218:45AM(8of13)Title:Sep218:51AM(9of13)Title:Sep219:3
Texas A&M - D - 104
CLUB EDTexas 4-H Club Managers Tool Box TEXAS 4H LEADERSHIP OPPORTUNITIESThe 4H program offers youth many opportunities to learn and apply leadership skills. Service opportunities exist for members from the club level on through the state lev
Texas A&M - D - 104
10/14/20084-H Recognition ModelDeveloped by: Texas 4-H &amp; Youth Development Strengthening Clubs Initiative TeamTopics2. National 4-H Recognition Model 3. Recognition IdeasRecognition of 4-H Members Focus on development of each individual 4-H
UNC - COMP - 770
Global IlluminationComputer Graphics COMP 770 (236) Spring 2007 Instructor: Brandon Lloyd13/26/07From last time Robustness issues Code structure Optimizations Acceleration structures Distribution ray tracing anti-aliasing depth of fie
UNC - COMP - 006
COMP 006D-001, Fall 2003Pledge FormCOMP 006D-001, Comprehensive PledgeFor all of my work in COMP 006D-001, including homework assignments, I pledge that I will neither give nor accept unauthorized aid. This includes, but is not limited to, the
UNC - READ - 2324085
Call for AbstractsCall for Abstractsfor Air &amp; Waste Management Association's98th Annual Conference &amp; ExhibitionThe Air &amp; Waste Management Association's 98th Annual Conference &amp; Exhibition will be held in Minneapolis, Minnesota, on June 2124, 20
UNC - CHAPT - 725
Environmental Organic Chemistry, 2nd Edition. Rene P. Schwarzenbach, Philip M. Gschwend and Dieter M. Imboden Copyright 02003 John Wiley &amp;L Sons, Inc.1197Appendix CPHYSICOCHEMICAL PROPERTIES OF ORGANIC COMPOUNDSAppendix C contains the names, m
Texas A&M - EDCI - 619
Texas A&M - EDCI - 619
UNC - ECON - 423
Econ 423: Questions from Previous Versions of Quiz 6 [Fall 2000-present]1. The actual money multiplier will decrease but the potential money multiplier will not be affected by an increase in the: (a) national debt of the federal government. (b) usur
UNC - ECON - 423
Econ 423: Questions from Previous Versions of Exam 21. Creative responses in mortgage markets to the stagflation of the 1970s and the resulting increases in nominal interest rates did not include increased emphasis on: (a) assumable loans. (b) ballo
UNC - ECON - 423
Econ 423: Questions from Previous Final Examinations1. Risk that can be reduced significantly through diversification is: (a) inflation risk. (b) specific risk, or unique risk. (c) default risk. (d) interest rate risk. (e) exchange rate risk. (f) ma
UNC - INLS - 521
CoverTitle pageTitle page verso18Rule 1.1D3Other title informationOriginal title in the same language as the title proper and appearing on the chief source of information020 245 _ _ 250 260 300044689429X Gray lady down : b a novel : ori
UNC - INLS - 521
CoverTitle page versoFacing title pagesZoom In1.6SERIES AREA [Field 440 - Series statement/added entry/title] or [Field 490 - Series statement] Title proper of series [Subfield a - Title]80 Rule 1.6B1 Title proper of series1.6B245 _ _
UNC - INLS - 760
INLS 760, Dr. Capra Assigned: Tues, March 4Project 6 Metadata and Full-Text SearchSpring 2008 Due: 5:00pm, Tues, March 18Overview For this project, you will add metadata and full-text search capabilities to the Digital Library (DL). What to do
UNC - INLS - 521
Title pageTitle page verso1.4 1.4CPUBLICATION, DISTRIBUTION, ETC. AREA[Field 260 - Publication, distribution, etc. (Imprint)]Place of publication, distribution, etc.[Subfield a - Place of publication, distribution, etc.] 35 Rule 1.4C1 Place
UNC - ECON - 423
Econ 423: Questions from Previous Versions of Quiz 6 [Fall 2000-present]1. Changes in the discount rate change the supply of money by changing. (a) potential money multipliers. (b) the Fed's holdings of foreign exchange. (b) the level of national de
UNC - TAM - 0506
TransAtlantic Masters ProgramPolitical Science 211 Fall 2005Varieties of Democratic Capitalism in Europe and North AmericaTuesdays and Thursdays 2-3:15 John D. Stephens Hamilton 353 Office hours: Tuesdays and Thursdays 3:30-5 962-0409, 932-1168 E
UNC - TAM - 0607
TransAtlantic Masters ProgramPolitical Science 745 Fall 2006Varieties of Democratic Capitalism in Europe and North AmericaTuesdays and Thursdays 3:30-4:45 Murphy 112 John D. Stephens Center for European Studies 223 East Franklin Street Office hou
UNC - TAM - 0506
University of North Carolina at Chapel Hill Poli Science 121 Europe Undivided Mon, Wed, Fri 8:00 10:30 a. m. UCIS, seminar roomProfessor Christiane Lemke Fall 2005 Office hours: Mon, Wed, 11-12Politics of EU-EnlargementOn May 1, 2004, ten new s
UNC - TAM - 0506
University of North Carolina Political Science/ Transatlantic Masters Program EU GOVERNANCE POLI 273 TAM 975 * Fall 2005 Instructor: Professor Liesbet Hooghe Class Hours: Wednesday, 2-5pm, Peabody 0010 Office hours: Wed. 10:30am-12:30noon, Hamilton H
UNC - TAM - 0607
POLI 891 .975: TAM/CES Fall 2006 Friday Lecture Series Instructor: Patrick Egan 459 Hamilton Hall T: 919-962-3041 Course Profile POLI 891 is a three credit pass/fail course designed to enhance students understanding of transatlantic studies through l
UNC - EPID - 600
ENVR 101 Fall 05 M-W 11 -11:50AM OLD CLINIC AUDITORIUM (Follow the red line) Cross South Columbia; go on right side between HSL and MacNider Hall Go down 1 flight of steps enter MacNider door (1 on Map); then go up to 2nd floor level; enter hall and
Texas A&M - MATH - 141
Math141,WIR Week 4-copyright Maggie Arnold1WIR Week 4-Chapter 41. Maximize P = 2x + 3y subject to: 2x + 4y 12 x - 2y 1 x 0, y 0 2x + 4y + u = 12 x - 2y + v = 1 -2x - 3y + P = 0 Using SMPLX 2 4 1 0 0 12 1 -2 0 1 0 1 -2 -3 0 0 1 0 pivot
Texas A&M - MATH - 304
MATH 304 Linear Algebra Lecture 3: Applications of systems of linear equations.Systems of linear equations a11 x1 + a12 x2 + + a1n xn = b1 a21 x1 + a22 x2 + + a2n xn = b2 am1 x1 + am2 x2 + + amn xn = bmHere x1 , x2 , . . . , xn a
Texas A&M - MATH - 645
Lecture 2 Copyright c Sue Geller 2006 This week we concentrate on mathematical induction. I hope all of you have seen it at some time but suspect that some of you may not have. So Im going to start at the beginning. The idea of induction is akin to a
Texas A&M - MATH - 151
MATH 151 Spring 2004, Solutions for Quiz # 12 Problem #1. (5 points) For the function f (x) = x (x - 1)2Section : 808 (2 points) Find the vertical and horizontal asymptotes of f (x). (2 points) Find the local extrema of f (x) (decide for each po
Texas A&M - M - 151
M151B Practice Problems for Final ExamCalculators will not be allowed on the exam. Unjustified answers will not receive credit. On the exam you will be given the following identities: n(n + 1) ; k= 2 k=1nn(n + 1)(2n + 1) k = ; 6 k=12nnk3 =
Texas A&M - MATH - 222
Math 222 - - Exam I SolutionsInstructor - Al Boggess Fall 19981. We wish to write the vector0 1 ,6 C w = B 10 A @ ,19as a linear combination of v1 ; v2 ; v3 . This means we must hunt for constants x1 ; x2 x3 with w = x1 v1 + x2 v2 + x3 v3 . Thi
Texas A&M - M - 640
Relevant Matlab Commands1. At any time, you can get help on any Matlab command by typing help and then the command name (e.g. help matrix will pull up help on entering matrices) 2. You can enter matrices in matlab as follows: type x = [2 - 3 4] for
Texas A&M - MATH - 222
Math 222 - Selected Homework Solutions from Sections 2.2 and 2.3Instructor - Al Boggess Fall 1998Page 98 - Section 2.25 We are to show det A = n det A. The key is to writeA = I ANow take determinants: det A = det IA = det I det A Now, I has
Texas A&M - MATH - 222
Math 222 - Selected Homework Solutions for Assignment 7Instructor - Al Boggess Fall 1998Section 4.1 12. Let v1 ; : : : vn be a basis for V and let L1 and L2 be two linear transformations mapping V into W . We are to show that if L1 vi = L2 vi for
Texas A&M - MATH - 222
Math 222 - Selected Homework Solutions from Sections 3.1 and 3.2Instructor - Al Boggess Fall 1998Exercises for Section 3.1 9 a Show 0 = 0 for each scalar . We must show 0 + x = x for all vectors x. For then, 0 must be 0 since the Zero in the vect