Robert's Stochastic thoughts: 2/1/24

This will be a long boring post amplifying on my slogan.

I assert that asymptotic theory and asymptotic approximations have nothing useful to contribute to the study of statistics. I therefore reject that vast bul of mathematical statistics as absolutely worthless.

To stress the positive, I think useful work is done with numerical simulations -- monte carlos in which pseudo data are generated with psudo random number generators and extremely specific assumptions about data generating processes, statistics are calculated, then the process is repeated at least 10,000 times and the pseudo experimental distribution is examined. A problem is that computers only understand simple precise instructions. This means that the Monte Carlo results hold only for very specific (clearly false) assumptions about the data generating process. The approach used to deal with this is to make a variety of extremely specific assumptions and consider the set of distributions of the statistic which result.

I think this approach is useful and I think that mathematical statisticians all agree. They all do this. Often there is a long (often difficult) analysis of asymptotics, then the question of whether the results are at all relevant to the data sets which are actually used, then an answer to that question based on monte carlo simulations. This is the approach of the top experts on asymptotics (eg Hal White and PCB Phillips).

I see no point for the section of asymptotic analysis which no one trusts and note that the simulations often show that the asymptotic approximations are not useful at all. I think they are there for show. Simulating is easy (many many people can program a computer to do a simulation). Asymptotic theory is hard. One shows one is smart by doing asymptotic theory which one does not trust and which is not trustworthy. This reminds me of economic theory (most of which I consider totally pointless).

OK so now against asymptotics. I will attempt to explain what is done -- I will use the two simplest examples. In each case, there is an assumed data generating process and a sample size of data (hence N) which one imagines is generated. Then a statistic is estimated (often this is a function of the sample size and an estimate of a parameter of a parametric class of possible data generating processes). The statistic is modified by a function of the sample size (N). The result is a series of random variables (or one could say a series of distributions of random variables). The function of the sample size N is chosen so that the series of random variables converges in distribution to a random variable (convergence in distribution is convergence of the cumulative distribution function at all points where there are no atoms so it is continuous).

One set of examples (usually described differently) are laws of large numbers. A very simple law of large numbers assumes that the data generating process is a series of independent random numbers with identical distributions (iid). It is assumed (in the simplest case) that this distribution has a finite mean and a finite variance. The statisitc is the sample average. as N goes to invinity it converges to a degenerate distribution with all weight on the population average. It is also tru that the sample average converges to a distrubiton whith mean equal to the population mean and variance going to zero - that is for any positive epsilon there is an N1 so large that the variance of the sample mean is less than epsilon (conergence in quadratic rule). Also for any positive epsilon there is an N1 so large that if the sample size N>N1 then the probability that the sample mean is more than epsilon from the population mean is itself less than epsilon (convergence in probability). The problem is that there is no way to know what N! is. In particular, it depends on the underlying distribution. The population variance can be estimated using the sample variance. This is a consistent estimate so that the difference is less than epsilon if N>N2. The problem is that there is no way of knowing what N2 is.

Another very commonly used asymptotic approximation is the central limit theorem. Again I consider a very simple case of an iid random variable with a mean M and a finite variance V.

In that case (sample mean -M)N^0.5 will converge in distribution to a normal with mean zero and variance V. Again there is no way to know what the required N1 is. for some iid sitributions (say binary 1 or zero with probability 0.5 each or uniform from 0 to 1) N1 is quite low and the distribution looks just like a normal distribution for a sample size around 30. For others the distribution is not approximately normal for a sample size of 1,000,000,000.

I have criticisms of asymptotic analysis as such. The main one is that N has not gone to infinity. Also we are not imortal and my not live long enough to collect N1 observations.

Consider an even simpler problem of a series of numbers X_t (not stochastic just determistic numbers). Let's say we are interested in X_1000. What does knowing that the limit of X_t as t goes to infinity is 0 tell us about X_1000 ? Obvioiusly nothing. I can take a series and replace X_1000 with any number at all without changing the limit as t goes to infinity.

Also not only does the advice "use an asymptotic approximation" often lead one astray, it also doesn't actually lead one. The approach is to imaging a series of numbers such that X_1000 is the desired number and then look at the limit as t goes to infinity. The problem is that the same number X_1000 is the 1000th element of a of an uh large infinity of different series. one can make up a series such that the limit is 0 or 10 or pi or anything. the advice "think of the limit as t goes to infinity of an imaginary series with a limit that you just made up" is as valid an argument that X_1000 is approximately zero as it is that X_1000 is pi, that is it is an obviously totally invalide argument.

This is a very simple example, however there is the exact same problem with actual published asymptotic approximations. The distribution of the statistic for the actual sample size is one element of a very large infinity of possible series of distributions. Equally valid asymptotic analysis can imply completely different assertions about the distribution of the statistic for the actual sample size. As they can't both be valid and they are equally valid, they both have zero validity.

An example. Consider a random walk where x_t = (rho)x_(t-1) + epsilon_t where epsilon is n iid random variable with mean zero and finite variance. There is a standard result that if rho is less than 1 then (rhohat-rho)N^0.5 has a normal distribution. There is a not so standard result that if rho = 1 then (rhohat-rho)N^0.5 goes to a degenerate distribution equal to zero with probability one and (rhohat-rho)N goes to a strange distribution called a unit root distribution (with the expected value of (rhohat-rho)N less than 0.

Once I came late to a lecture on this and casually wrote converges in distribution to a normal with mean zero and variance before noticing that the usual answer was not correct in this case. The professor was the very brilliant Chris Vavanaugh who was one of the first two people to prove the result described below (and was not the first to publish).

Doeg Elmendorf who wasn't even in the class and is very very smart (and later head of the CBO) asked how there can be such a discontinuity at rho +1 when, for a sample of a thousand observations, there is almost no difference in the joint probability distribution or any possible statistic between rho = 1 and rho = 1 - 10^(-100). Prof Cavanaugh said that was his next topic.

The problem is misuse of asymptotics (or according to me use of asymptotics). Note that the question explicity referred to a sample size of 1000 not a sample size going to infinity.

So if rho = 0.999999 = 1 - 1/(a million) then rho^1000 is about 1 but rho^10000000000 iis about zero. taking N to infinity implies that, for a rho very slightly less than one, almost all of the regression coefficients of X_t2 on X_t1 (with t1. Now the same distribution of rhohat for rho = 0.999999 and N = 1000 is the thousandth element of many many series of random variables.

One of them is a series where Rho varies with the sample size N so Rho_N = 1-0.001/N

for N = 1000 rho = 0.999999 so the distribution of rhohat for the sample of 1000 is just the same as before. However the series of random variables (rhohat-rho)N^0.5 does not converge to a normal distribution -- it converges to a degenerate distribution which is 0 with probability 1.

In contrast (rho-rhohat)N converges to a unit root distribution for this completely different series of random variables which has the exact same distribution for the sample size of 1000.

There are two completely different equally valid asymptotic approximations.

So Cavanough decided which to trust by running simulations and said his new asymptotic aproximation worked well for large finite samples and the standard one was totally wrong.

See what happened ? Asymptotic theory did not answer the question at all. The conclusion (which is universally accepted) was based on simulations.

This is standard practice. I promise that mathematical statisticians will go on and on about asymptotics then check whether the approximation is valid using a simulation.

I see no reason not to cut out the asymptotic middle man.

Robert's Stochastic thoughts

Monday, February 19, 2024

Asymptotically We'll all be Dead