Sunday, May 16, 2004

Joshua Marshall writes “People who analyze polling data will often take a group of polls, toss out the outliers on either side, and then focus on the cluster of data in the middle which seems overlapping and confirming.” And I assume that he knows what he is talking about, that is, that people who analyze polling data often follow that procedure. I haven’t the foggiest idea why anyone who knows anything about statistics would do such a thing.

I think a strong case can be made for averaging all available polls.

I think a case might be made for ignoring the results from a polling agency which has systematically poor or biased performance. Poor in this case means predictions of election outcomes with a large mean standard error and biased means predictions of election outcomes with a significant mean error. Notice for electoral polls new information on poor performance or biased predictions comes only on election day. For pure opinion polls where the aim is to find if people approve of something or think that something is important such information never comes.

In particular strong evidence that one polling agency is, on average, more favourable to Republicans than the average poll is only strong evidence that it is biased if one has strong reasons to believe that the average poll is unbiased. This evidence is, in practice, obtained only on election days. My sense is that it is quite weak. For example, in 2000 Gallup polls of “likely voters” were significantly more favourable to Bush than the average poll. The average poll over predicted Bush’s share of the popular vote by an amount not explained by sampling error alone. From this I conclude nothing. We have one data point on the 2000 presidential election, which is not enough to discredit Gallup.

A still more pointed example is Rasmussen’s tracking poll which, unless I am mistaken, predicted that Bush would win by 6%. I check the Rasmussen tracking poll about twice a day and my only irritation is that half of the time I find that it has not been updated. I am currently in a good mood because Rasmussen has Kerry ahead today unlike yesterday. I know this is crazy, but putting little weight on the bad performance on November 7 2000 is not crazy.

The procedure described by Marshall makes very little sense. Evidently the practice is to discard a poll not a polling agency. In earlier posts Marshall seems to me to have done exactly that. I wonder why people do it. I will present 3 unsuccessful attempts to make sense of it.

First imagine that the analysts think that by far the principal reason that polls differ is sampling error even though this is not reasonable. If they thought this, then the best way of pooling polls would a an average weighted by the sample sizes of the polls. That is, the weight on each poll would have nothing to do with whether it is an outlier. If each sample is a true random sample and there is no other difference between polls this is a mathematical result as dubious as 2+2=4.

It is possible that there are other differences between polls than random sampling error. Indeed this is clearly true. I only know what I read at pollingreport but I can see that there are significant firm specific differences. This is the case considered above. If there is evidence that a firm is unreliable it’s polls should be given low weight (or ignored entirely) whether or not a specific poll is an outlier.

The only way to rationalize the practice described by Marshall is to assume that there is some process other than random sampling error which causes some specific polls to be far from the truth but does not affect other polls by the same firm. Another way of putting this is if the errors in polls have a fat tailed distribution, a trimmed mean is a better estimate of the state of opinion than an untrimmed mean. I haven’t the faintest idea of what could cause such a distribution of errors. I suppose that it is possible that this is because I don’t know as much about how pollsters really pick their samples as the analysts, but I certainly am not ready to rule out the possibility that the analysts are confused.


[Update] I have read the post whose intro set me off. The analysis in the post is, as usual for Marshall, excellent not just far better than the approach to analyzing polls mentioned in the intro.

Marshall writes “But to go back to my analogy about analyzing polls, even if we set aside the issue of whether there was this specific black operation -- noted by Hersh -- the basic story seems more and more clear, and increasingly confirmed from multiple sources. That is, that irregular methods originally approved for use against al Qaida terrorists who had just recently landed a devastating blow against the US, were later expanded (by which mix of urgency, desperation, reason, bad values or hubris remains to be determined) to the prosecution of the insurgency in Iraq.”

Now this makes sense. The fact that similar complex verbal claims are made by multiple sources is evidence that the shared aspects of the claims are true. If the claims did not confirm each other, we should have little confidence in the “average” claim even if we could average sentences as we average numbers. Since all the sources are anonymous, Marshall is not relying on their track record; they could not have one (of course he is, in part, relying on Hersh’s track record as well he should).

So what’s the difference ? I think the issue is that similar numbers do not confirm each other to the same extent that similar stories do.

It is easy for different polling results for Kerry minus Bush rounded to the nearest percent to be identical even if each is certainly an imprecise measure and even if each is biased or grossly unreliable. Since the possible numbers are few it is easy for two such numbers to be identical by coincidence. Of course it is much easier for them to be very close by coincidence. The important thing is that this is true even if both are far from the true population average.

On the other hand, it is more difficult for different stories to correspond unless the shared features are true. The number of different lies is so immense that they can’t match by accident. Of course, more difficult does not mean very difficult. The different sources might have conspired to mislead the press. Different reporters might, unknowingly, be quoting the same small set of anonymous sources. Finally lies can co-ordinate around rumours, with each liar claiming to have direct information that the rumour is true (I think Gary Sick was honestly mislead by such rumour confirming liars when he researched “October Surprise”).

One minor question is why does Marshall chose to present his clearly sound approach by analogy with a dubious method for analysing polling data ? Checking if a story is confirmed by multiple sources is not a new or controversial approach to journalism. I’m afraid that he momentarily lapsed into number envy, that is, found the intellectual status of statistical analysis attractive, even though, in this case, his analysis is much more sound than that of the numerical analysts.

A caveat. I criticise discarding outliers specifically in the case of pooling polls. The reason is that it seems to me hard to imagine why there would be bad polls, that is, why the distribution of polling errors would have poll specific (not firm specific) fat tails. In general, I am rather attracted to the approach of down-weighting or discarding outliers as is clear from my CV.

A final question is whether the attraction of discarding outliers is based in part on the analogy with the traditional practice of dismissing sources (or witnesses) whose claims are not confirmed by anyone else. I have no reason for my confidence, but, in my heart, I am not able to doubt that such a false analogy leads analysts to discard outliers even when they shouldn’t and would lead most people to discard outliers even when they are told (and understand and believe) that the numbers are drawn from an iid normal distribution.

No comments:

Post a Comment