Monday, November 01, 2004

Faulty Polls or Faulty Analysis

One thing many poll watchers seem able to agree on this year is that there is something wrong with the polls. It appears that polls by different agencies give very different results, much too different to be explained by the stated sampling errors. In particular it appears the different agencies likely voter filters give systematically different results. I have been an enthusiastic proponent of this view since 2000.

Now I think that we may have seen patterns in random data and taken out our frustration over a deadlocked race on the pollsters.

I strongly suspect that some people are treating the margin of error of the polls as a margin of error of the difference Bush - Kerry when it is not. In fact the standard error of the difference Bush minus Kerry is almost twice the standard error of support for Bush or Kerry (see explanation below). I further suspect that some people assume that the variance due to sampling in the difference between results in two polls is about the same as the variance of each of the polls when in fact it is the sum of the variance of the two polls. These two mistakes would lead those people to underestimate by a factor of about 2.8 the variance in differences in Bush - Kerry in two valid (or equally biased) polls taken at the same time. This would explain why such people think there is something strange going on. It is less clear to me why people who understand simple statistics had the same impression.

After Luigi Giamboni and I attempted unsuccesfully to improve on pollsters raw numbers reweighting with internals, I finally checked the variance accross polls compared to the stated sampling variance, or taking the square root, I compared the standard devation of polls to the mean square average sampling standard error. The mean squared average is close to the ordinary average standard error which I will present here.

The variance across polls should be the sum of the sampling error, squared differences in bias across pollsters and the squared changes in actual opinion. Nonetheless it is only slightly greater than the stated average sampling variance.

I will present a lot of standard devations across polls and average sampling error for Bush-Kerry, that is percent who say they would vote for Bush minus percent who say they would vote for Kerry.

First consider all 98 polls of likely voters who were prompted with Nader as well as Bush and Kerrytaken in 2004 and reported at including an estimate of sampling variance. The standard deviation across polls of
Bush-Kerry is 3.97 % . The average standard error due to sampling alone in the 98 polls whihc report this is 3.11 %. The difference of 0.86 % is partly due to all of the variance in true opinions over the year. Now a kindof table

Likely Voter 3 way 98 polls standard deviation 3.97 average sampling standard error 3.11%.

Likely Voter 2 way 97 polls standard deviation 3.64 average sampling standard error 2.81 %.

Registered voter 3 way
105 polls standard deviation 3.90 average sampling standard error 2.90 %

Registered voter 2 way
77 polls standard deviation 5.23 average sampling standard error 3.17 %

The variation across polls is only slightly higher than one would expect from sampling error alone even though some were taken in August and others in September.

Now look at polls whose sample period ended in October

Likely Voter 3 way 41 polls standard deviation 3.22 average sampling standard error 3.13 %.

Likely Voter 2 way 17 polls standard deviation 2.04 average sampling standard error 3.10 %.

Registered voter 3 way
29 polls standard deviation 2.85 average sampling standard error 2.95 %

Registered voter 2 way
13 polls standard deviation 2.56 average sampling standard error 3.19 %

There is somewhat less than zero evidence of anomalous variation across polls. That is the variation across polls is slightly lower than one would guess given the stated sampling error.

Some explanation of the calculation

Bush-Kerry is 100% times the average over respondendts of a variable which is 1 if they support Bush -1 if they support Kerry and 0 if they are undecided or support someone else. The variance of the variable is the probability someone supports Bush or Kerry. A sample estimate of this probability is the fraction of people in the poll who say they support Bush or Kerry (Bush+Kerry)/100%. The variable is averaged across respondents and multiplied by 100% to give Bush - Kerry which therefore has variance 10,000*((Bush+Kerry)/100)/N where N is the number of people polled. The sampling standard error in Bush -Kerry is therefore
10 times the square root of (Bush+Kerry) divided by the square root of N %.

I am very very embarassed to admit that Brad DeLong had to explain this to me in 2000, but I trust no one will read this far.

