Caveat Lector II
More work on meta-analysis of the effects of anti-depressants. By now I am sure that I understand the notation "d" used by physicians when talking about statistics, since I can reproduced the results in Kirsch et al (2008). This, and repeated checking, convince me that I managed to read a *.gif and type numbers in correctly. Finally, I have reproduced the results calculated with a spread sheet using STATA 9. The following post might still contain boo-boos, but it is, I hope, getting close to a journal submission.
Kirsch et al (2008) estimate the average benefit of SSRI’s using a simple standard meta-analysis procedure “We conducted two types of data analysis …and another using each study's drug and placebo groups' arithmetic mean (weighted for the inverse of the variance) as the meta-analytic “effect size” . They first calculated an estimate of the overall average change of the Hamilton Rating Scale of Depression (HRSD) for patients who received an SSRI as the precision weighted average of the changes reported for the treated subsample in each trial. Then they calculated the average change for patients who received the placebo in the same way and calculated the difference. This approach involves two methodological choices. First the use of precision weights (weights inverse to the reported variance of the estimate of the mean change in each study). Second the decision to calculate overall mean changes for treated and control groups and then take the difference rather than first calculating the difference in the HRSD then calculating a weighted average across trials of those differences. In each case, Kirsch chose a method which, under strong assumptions, gives an efficient and unbiased estimate of the true overall average benefit. In each case there are alternative approaches which are less efficient under those assumptions but which are unbiased not only when the Kirsch et al estimates are unbiased, but also for many cases in which the Kirsch et al estimates are biased. That is they are less efficient under the null but more robust. In each case the null hypothesis that the Kirsch et al estimator is unbiased has been tested and rejected. The available unbiased estimate of the overall average benefit of SSRI’s is equal to 2.65 HRSD units, which is considerably higher than Kirsch et al’s biased estimate
Bias due to precision weighting when disturbance terms are not normal.
First consider the estimation of the mean change in HRSD for patients who received an SSRI. A simple natural alternative to the precision weighted average is the simple average change in HRSD across all patients. This is equal to the average of the average change across trials weighted by the number of patients who received an SSRI (from now on Nsi). This is an unbiased estimate of the sample size (Nsi ) weighted average expected change across trials.
The precision weighted average is an unbiased and efficient estimate of the sample size (Nsi ) weighted average expected change across trials if, within each trial, the improvement of the average patient were equal to the expected improvement (itself a function of the characteristics of the trial) plus a disturbance term with a mean zero normal distribution. The assumption of normality has the critically important implication that for given population means and averages the sample mean and the sample average are distributed independently. This is a particular property of the normal distribution which does not hold for other distributions.
In general it is possible that sampling alone will create a positive correlation between the sample mean and the sample variance. This would mean that trials that showed a higher than expected improvement would have a lower than expected weight and, thus, that the precision weighted average is biased down.
In a first effort to see if this might be a problem, the 35 sample standard deviations (sds) were regressed on the sample mean improvements in RRSD (CHANGEs) giving a coefficient of 0.25 with a standard error of. 0.1 Under the null of zero correlation, the ratio of the coefficient and the standard error (2.51) definitely does not have a t-distribution, since the sample standard error is definitely not normally distributed.
A more reasonable second step was to regress sds on CHANGEs by weighted least squares with weights equal to Ns). This gives coefficient (0.39) with a standard error of (0.08). Cehbychev’s inequality is sufficient to reject the null of zero correlation at the 5% level without any distributional assumptions.
Since there is an estimate which is efficient and unbiased under the null and another estimate which is unbiased both under the null and the alternative, it is possible to test the null hypothesis that the precision weighted average is unbiased using a Wu-Hausman test. The square of the difference in the estimates divided by the difference in the estimated variances is distributed as a chi2(1) under the null.
The variance of the CHANGE within subsamples (Varsi and Varpi) are needed to calculate the precision weighted average of change and to calculate the variance of both the precision weighted average and the simple average. Varsi can be calculated from the data in Table 1 of Kirsch et al using the formula Varsi = (CHANGEsi /dsi)2 The variance of the precision weighted average is the inverse of the sum over i of (Nsi /Varsi ). The variance of the simple average is the sum over i of
Nsi Varsi divided by the square of the overall total number of patients who received SSRIs.
The calculated statistic is 133.15 overwhelmingly rejecting the null.
The huge Wu-Hausman test statistic is partly the result of one outlier protocol 62 subsample mild, which has a very low reported sds and CHANGEs. If the very low sds is replaced by the Ns weighted average sds the Wu Housman statistic drops to 20.57 which is still overwhelmingly significant.
Define Npi As the number of patients who received the placebo in trial i, sdpi as the standard deviation in the change in HRSD among patients who received the placebo in trial i and changepi as the average change in HRSD among patients who received the placebo in trial i.
It is possible to test the null that the precision weighted average change of patients who received the placebo is an unbiased estimate of the overall average expected change in HRSD when receiving a placebo against the alternative that it is biased but the simple average is unbiased. In this case the Wu-Hausman test statistic is 0.65 which is not at all significant.
Finally it is possible to test the null hypothesis that the estimate of the additional benefit of an SSRI over a placebo estimated by Kirsch et al (1.8) is biased against the alternative hypothesis that it is biased but the difference in the simple average changes in HRSD across patients (2.18) is an unbiased estimate of the additional benefit. First note that, as discussed below, both methods require the implicit assumption that the disturbances to individual patients’ change in HRSD are independent. This implies that the overall average CHANGE of SSRI treated patients is distributed independently of the overall average CHANGE of placebo treated patients, so the variance of the difference between these averages is equal to the sum of the variances of these averages.
In this case the Wu-Hausman test statistic is 38.07 which is overwhelmingly significant and provides strong evidence that Kirsch et al’s estimate is biased down.
Bias due to the Exclusion of Indicator Variables for Each Trial
A separate and, perhaps, more important issue is the difference between first calculating overall mean changes in HRSD for SSRI treated and placebo treated patients and taking the difference as Kirsch et all did, instead of first calculating the difference in HRSD improvement of SSRI and placebo treated patients within each trial and then calculating an overall average. A basic principle of clinical trials is that the only reliable evidence is differences between treated and controlled patients when neither the patients nor the researchers know who is treated. This suggests that the basic principles of clinical testing imply that the first step of valid analysis must be calculating these differences. The null hypotheses that other approaches are unbiased are tested and rejected below. The explanation of this statistical result should be familiar as concern about the potential biases are exactly the reason that controlled double blind studies are conducted.
Some notation will be useful in the discussion below. First Changepij is the change in HRSD of the jth patient in the ith trial. In general Changepij will be affected by factors specific to patient j and by factors common to all patients in trial i so
1) Changepij = c + dSSRI + i + ij .
If is correlated with dSSRI, OLS and weighted least squares estimates of will be biased. Even if is not correlated with dSSRI, such estimates will be inefficient, because they are based on the assumption that the variance of is zero.
In order to explain the hypothesis tests, it is helpful to explain how the relevant calculations can be interpreted as OLS or weighted least squares regressions. For simplicity, first the case of simple OLS regressions with data on individual patients from different trials is discussed. As discussed above, a very simple estimate of the effect of SSRI’s in addition to the placebo effect is the average change of HRSD of SSRI treated patients minus the average change in HRSD of placebo treated patients. This is exactly equal to the coefficient on an indicator variable (dSSRI) which is one if the patient took an SSRI and 0 otherwise in a simple OLS regression which also includes a constant term. This estimate is biased if there is any excluded variable which is correlated both with dSSRI and with the change in HRSD (CHANGE). Since different fractions of patients were given SSRI’s in different trials, indicator variables for the trials are correlated with dSSRI. An indicator variable for the trial implementing protocol 62 [mild] would be one for patients enrolled in that trial whether they received an SSRI or a placebo, and zero for other patients.
The coefficient of CHANGE on dSSRI in a regression which also contains 35 indicator variables for the 35 trials (or equivalently 34 indicators for 34 trials and a constant) is called the within estimator of the additional benefit the SSRIs, because it is based on differences within each group, in this case each clinical trial.
Now it is easy to calculate the within estimate using only the data reported in table 1 of Kirsch et al and in particular CHANGE and the number of patients (N) in each of the 70 subsamples. The OLS estimate with individual data is equal to the weighted least squares estimates using N as weights of the average change in HRSD on the average value of dSSRI (the fraction of the patients in the trial who received an SSRI) and the indicator for the trial.
The coefficient is also equal to the weighted average of the difference in CHANGE for the SSRI treated patients and the placebo treated patients in each trial. The weights to be used to obtain the within estimate are the harmonic mean of the number of patients who received the SSRI and the number of patients who received the placebo Nhi = (NsiNpi/(Nsi + Npi ). The variance of the within estimate of the coefficient on dSSRI is the sum over I of Nhi((Varsi/Nsi)+(Varpi/Npi)) divided by the square of the sum over I of Nhi.
The resulting estimate of the benefit of the SSRIs is 2.65.with a standard error of 0.255.
A very strong case can be made a priori that such indicator variables should be included in the regression. As noted above, the indicator variables are correlated with the variable of interest, since different fractions of patients received an SSRI in different trials. Different trials also have different average values of CHANGE for many reasons. First the average baseline HRSD is different for different trials. Second all trials include not only ingestion of an SSRI or a placebo but also clinical management which may differ across trials and cause different average changes in HRSD. Finally, application Hamilton rating scale requires judgment on the part of the clinician and the application of the scale might not be identical across trials.
Common factors affecting different patients in the same trial ( imply that OLS is not appropriate. At the very least, common factors imply that OLS is not efficient even if the common factors are not correlated with dSSRI. It also seems extremely unlikely that OLS will be unbiased.
To see this, note that the OLS estimate of the coefficient is the weighted average of the within estimate described above and the, so called, between estimate which is equal to a weighted least squares regression of each trial’s overall average of CHANGE on the fraction of patients in the trial who receive an SSRI (FSRRI) and a constant. The weights are equal to the total number of patients in the trial. For the data presented in Table 1 of Kirsch et al, the between estimate of the additional benefit of an SSRI over a placebo is actually negative (-.).
Assuming that the within estimate is unbiased, the OLS estimate is unbiased only if the sample size weighted between estimate is unbiased. It should be clear that the data used in calculating the between estimate are not the sort of data usually considered acceptable when evaluating a pharmaceutical. For example, if one were willing to trust between estimators, one could estimate the effect of a pharmaceutical using one trial in which all patients receive the pharmaceutical and another trial in which all patients received a placebo. The view that trials should be double blind is inconsistent with the view that between estimators should be used at all.
Even using the extremely limited data available in table 1 of Kirsch et al, it is possible to demonstrate in two different ways that the OLS estimate of the benefit of SSRIs is biased. First, the only other variable in the table – baseline, is correlated with ACHANGE and with FSSRI. ABASELINEi is the overall average initial HRSD of patients involved in trial i. Weighted least squares, where the total number of patients in each trial are used as weights, of ACHANGE and FSSRI on ABASELINE are reported below. They are sufficient to demonstrate that the OLS estimate of the average benefit of an SSRI is biased down.
Also it is possible to perform a Wu-Hausman test to test the null that the OLS estimate is unbiased and efficient. This null hypothesis is not extremely interesting, since, for the OLS estimate to be efficient, it would be necessary for the correlation within trials of CHANGE to be zero. However, it is easily performed and rejects the null that OLS is efficient and unbiased. The statistic, which has a Chi2(1) distribution under the null, is equal to 13.49 overwhelmingly rejecting the null that the OLS estimate is unbiased.
Similarly, it is possible to test whether the Kirsch et al estimator is unbiased by comparing the estimate to weighted least squares estimates of the model including a full set of indicator variables for the trials in which the weights are equal to the subsample size divided by the subsample variance. This last estimator is a precision weighted within estimator. The coefficient on dSSRI is equal to the weighted average difference in CHANGE with an SSRI and placebo for each trial with the weight on the difference in the ith trial equal to 1/((varsi/Nsi)+(varpi/Npi)). The precision weighted within estimate is 2.40.
The variance of the precision weighted within estimator is equal to the sum of these weights. Thus it is possible to perform a Wu-Hausman test of the null that both the Kirsch et al estimator and the precision weighted fixed effects estimator are unbiased against the alternative that only the precision weighted fixed effects estimator is unbiased. The statistic, which is distributed as a Chi2(1) under the null is equal to 55.37 overwhelmingly rejecting the null hypothesis.
Also the null hypothesis that the precision weighted within estimator and the subsample size weighted within estimator are both unbiased is rejected against the alternative that only the sample size weighted estimator is unbiased, since the relevant Wu-Hausman statistic is 13.49.
Finally, the null that the Kirsch et al estimator and the subsample size weighted within estimator are unbiased is rejected against the alternative that only the subsample size weighted estimator is unbiased since the Wu-Hausman test statistic is 65.00.
The Wu-Hausman test is only valid if one of the estimators is efficient under the null hypothesis. In out case, this requires the assumption that all the common terms are exactly zero. A more interesting hypothesis is that is iid across trials with a positive variance and is uncorrelated with the fraction of patients receiving an SSRI. In this case, the most efficient estimate is the random effects estimator which is a weighted average of the within and between estimates with weights equal to the inverse of the variances of the estimates.
Unfortunately, the correct weights for the between estimator are not the inverse of the number of patients in the trial. Rather they are the inverse of the sum of the variance of and the inverse of the number of patients in the trial times the variance of . The variances of and can be estimated by maximum likelihood (if one can explain weighted random effects to a statistical package). Alternatively they can be estimated by iterated weighted least squares starting with the weighted least squares between estimate for zero variance , estimating the variance of from the difference between the sum of squared residuals from that regression and the sum of Vari . Then repeating weighted OLS with the new weights and calculating a new estimate of the variance of and continuing until the procedure converges.
This gives a between estimate quite different from the first estimate based on the assumption that the variance of is zero. The estimated coefficient on dSSRI is -0.48 with estimated variance 2.43. Thus the random effects estimate of the coefficient on dSSRI is 2.57 with a variance of 0.0637. Note that the random effects estimate is very close to the within estimate. Nonetheless the Wu_hausman test statistic is of 3.93 implies rejection of the null hypothesis that the random effects estimate is unbiased (the 5% critical level is 3.84).
Of the estimates reported above, the only estimate of the overall average effect of the SSRI’s on HRSD which is not known to be biased is 2.65 -- the result of sample size weighted within estimation.