The Simplest Meta-Analysis Problem
This is relevant to understanding the fascinating work of pyjamas in bananas
Let us say we are interested in finding the probability p that outcome A occurs when we perform experiment B. The natural approach is to perform experiment B N times and calculate the frequency of A occurring pe (pe for probability estimate which is equal to the mean of a variable which is 1 if A happens and zero otherwise). This approach has much going for it. The estimate is consistent by definition, since the limit of this estimate as N goes to infinity is Fisher's definition of probability. Also it is unbiased (for any sample size the expected value of the estimate is equal to the true probability) and is the maximum likelihood estimate so it is the best asymptotically normal estimate of the probability.
Now let us say that various groups indexed by i have done experiment B and group i has done it Ni times. Note I have assumed that each have done identical experiments, so I assume that group 10 did not just make up the data (it happens). Further I explicitly assume that they perform B identically as described in the protocol with no slips (that is I personally am not working in any of the groups) and measure whether A happened the same way.
What do we do with all the data ? Well given my assumptions we already know. The best estimate of p is the total number of cases in which A occurred divided by the total number of times which B was performed. That is, we should just pool the data as if they came from one big experiment.
Now what if they don't report all of their raw data but just pe (the frequency) and Ni the sample size ? No problem. The best estimate of p given all of the data is equal to the weighted average of the frequencies (pei) weighted by the sample sizes.
Now what if they don't report the sample sizes (never happens but what if) but do report estimated standard errors esei of their estimate pei of p ? No problem. The square of the estimated standard error is pei(1-pei)/N so we weight by
1) weighti = 1/(esei^2pei(1-pei)).
This is all obvious so far. However, it appears that there can be a temptation to do something wrong. The misleading anology is as follows. What should we do if we have a bunch of parameter estimates and each has a KNOWN standard deviation sei ? Here the best overall estimate of the parameter is the weighted average weighted by 1/sei^2. 1/sei^2 is called the precision so this is a precision weighted average.
One might make a mistake and decide to pretend that the estimated standard errors are exact so esei = sei. This would be a mistake, since, by assumption, in the example sei is the same for all i and equal to the square root of p(1-p)/N. In fact, it would be a very bad mistake. The average weighted by estimated precision will give a biased estimate of p unless p = 0.5. Futhermore, if the number of experiments performed by each group i (Ni) is drawn from a stationary distribution, the probability limit of the overall estimate of p as the number of groups goes to infinity is not p, that is the overall estimate is inconsistent.
How do I know this ? Well I know that the sample size weighted average is an unbiased consistent asymptotically efficient estimate of p and I can compare the expected values of this correct overall estimate to the expected value of the weighted average with weights equal to the estimated precision. Assume that Ni = N for each group, that is, that each group performs the experiment the same number of times (please this is just to make things easier to type). Then we see that the best unbiased asymptotically normal estimate of p is the unweighted average of pei. The weighted average with weights based on estimated precision will put less weight on experiments where pei is close to 0.5. If p is less than 0.5, this means that, on average, experiments with high pei have lower weight than experiments with high pei. This means that, the expected value of the (estimated precision) weighted average will be lower than p equals the expected value of the unweighted average. This is true for any number of groups and therefore true asymptotically too.
So is this mistake even possible ? It sure is. Is it not possible that a data analyst will assume it is OK to act as if an estimate is exactly true even when knowing it isn't, but there is at least one case in which the (estimated precision) weighted average is the best over all estimate of a parameter. The problem is that most meta-analysts consider only the second simplest problem -- estimating the mean of a variable which is normally distributed. In this case, the estimated mean and the estimated standard deviation are independent. Therefore weighting by the inverse of the estimated standard deviation can not bias the overall mean.
Even in the case of pooling estimates of the mean of a normally distributed variable, the sample size weighted mean of means is the most efficient estimate if we are sure that each group's measurement is drawn from an identical distribution so we know that the true standard errors of their estimated means are one over the square root of the sample size. However, the (estimated precision) weighted mean is a better estimate if the variance of the normals observed by different groups are different, as it would be if some were better at measuring than others.
There is no general theory in statistics that says that estimated means and estiamated standard deviations are, in general or usually, independent. That is a particular property of estimates of the parameters of the normal distribution. However applied statistics is unfortunately often based on the assumption that all random variables are normally distributed. The binary distribution of the example above is just one of many cases in which estimated means are not independent of estimated standard deviations even if every group is reported data drawn from the same distribution.
OK so what to do about real world problems ? I think it is unwise to assume that estimated means and estimated standard deviations are independent -- not all variables are normally distributed. It is also unwise to assume that all experimental groups produce data of identical quality. My preferred solution is to calculated a sample size weighted average and give a standard error for that overall average which is robust to variable quality of measurements by different groups (for example assuming that esei = sei when calculating the standard error of the overall sample size weighted average).
Such an estimate is unbiased if each experimental result is unbiased (pretty much a necessary assumption for meta analysis) and consistent in the number of groups. It is robust to the incompetence of one group (although not to scientific fraud).
Also, the estimated precision weighted average is efficient unbiased and consistent if the estimated precision is independent of the estimated mean. The sample size weighted average is not efficient but it is unbiased and consistent even if the estimated precision is not independent of the estimated mean. This means that a Wu-Hausman test can be used to test the null that the (estimated precision) weighted mean is unbiased.
OK so why am I thinking about this ? It is related to meta-analysis of the effects of anti-depressants. Different results are obtained by using different weights when averaging across studies. The differences are large compared to the estimated standard errors of the overall averages. Why might the effect of an anti-depressant be like a binary ? Let's say event A is the patient recovers. Sad to say, this happens less than half of the time during the test period even with modern anti-depressants (also it often happens with just a placebo). A study in which more patients recovered will have a larger variance of the patient outcome 1 = recovery 0 = still depressed, even if the results of all studies are drawn from the same distribution. Using the variance of patient outcomes as well as the sample size to calculate estimated standard deviations of the effect will bias the estimated precision weighted average effect away from curing half of the patients which means (I think) towards zero.
Of course the data are not 1 no longer depressed and 0 still depressed. Instead they are scores on a scale. However, I honestly believe that, in practice, the response to anti depressants is bimodal with chance of being roughly cured less than 50% and chance of no effect of the antidepressant more than 50%. In any case, it is easy to test the hypothesis that estimated means and estimated standard deviations are independent and it is easy to test the hypothesis that the expected value of the sample size weighted mean and the estimated precision weighted mean are the same.
There is, I think, a very strong presumption that, if one of the two is biased, it is the (estimated precision) weighted mean.