Saturday, August 25, 2012

How many parameters did he really estimate?

This is a post on mathematical statistics.  I am going to make things easy for myself and assume that our purpose in life is to forecast y, that y = xBeta+epsilon and that epsilon is iid normal.  Also there are N observations of y, N+1 of x and the aim is to forecast the y which will follow the N+first x. This will be xBetahat for some estimate Betahat.
 x is is a k dimensional vector.

Given all of these assumptions, if Betahat is estimated by OLS, there is a simple formula  for the expected value of  that N+1st  (y-xBethat) squared
it is equal to (N+k)/(N-k) times the average squared residual (the average of  within sample y-xBetahat squared).

The k in the denominator is based on the fact that the unbiased estimate of the variance of epsilon is the sum of squared residuals divided by N-k (while the average squared residual is, of course, the sum divided by N).  The k in the numerator is due to the variance of (xBeta - xBetahat).

Now this means that if (all the assumptions above) and we assume that someone is making a forecast by running OLS and giving the fitted value, we can estimate k by seeing how much worse the out of sample forecasts are than the within sample forecasts.  More generally, that comparison yields a k which isn't exactly the number of parameters estimated by OLS but has something to do with the number of parameters which were estimated.

I am writing this post, because I am thinking of efforts to forecast presidential elections using economic data. The number of reported parameters is typically quite low.  The within sample fit is always excellent.  The out of sample forecasts are practically worthless.  

This makes it very clear that the data have been dredged -- that there is ex post model specification -- that researchers have explored various regressions until they get one which fits extremely well.  A calculation of the k which fits the ratio of out of sample means squared forecast error to within sample mean squared residual might be fun.

I can't do it (I don't know N or anything) but my guess based on the Silver analysis is that they correspond to k on the order of 80% of N.  Pretesting k variables is not the same as OLSing with k variables.  But I mean  that's some very impressively bad statistical analysis.

No comments: