Robert's Stochastic thoughts

Tuesday, March 11, 2008

Caveat Lector

For the past 2 days I have been interested in meta-analysis, that is, mostly criticizing the meta-analysis of trials of modern anti depressants (SSRI's).
I have attempted some calculations based on the arguments in the post below and the one after the one after that. It is very possible that I have not understood the paper which I am criticizing as the description of statistical methods is brief and the technical terms used by psychologists when talking about statistics are different from the technical terms used by economists when talking about statistics.

Also many of my calculations were performed with a spreadsheet (good for making errors), I typed in the raw data from a *.gif file (ditto) and may have made anynumber of booboos including conceptual errors. I consider this blog post like a conversation in the corridor and not like a draft paper, working paper or heaven forfend journal submission.

OK that said, I have the impression that I was right.

My concern in the post below is that precision* weighted means of means may be biased and inconsistent if the data are drawn from a distribution such that the sample variance is higher when the sample mean is higher.

The example I gave was estimating the distribution of a variable which takes values 0 or 1, that is estimating the probability that the variable is equal to one. If there are many samples drawn from the same distribution, the best asymptotically normal estimate of the probability is the overall frequency. This is equal to the weighted average of the estimates of the probability for each subsample weighted by the sample size. It is not equal to the average of the estimates of the probability for each subsample each weighted by the inverse of the estimated variance of that estimate of the probability. The inverse of the estimated variance of an estimate is called the precision. For a frequency estimate with a sample size of Ni it is Ni/(pe*(1-pe)) where pe is the frequency of ones in the sample (that is the sample freequency).
This puts lower weight on estimates close to 0.5. In particular, if the true probability is less than 0.5, the precision weighted mean estimate of the probability will be biased down.

I think this is relevant to estimating the effects of anti-depressants, because, IIRC the general view of clinicians is that a minority of patients experience a dramatic benefit from anti-depressants and the majority much less benefit (or maybe zero benefit).

It is important to stress two things. First it is not an unusual property of the binary distribution that estimated means and variances are not independent. Rather it is an extraordinary feature of the normal distribution that estimates of the mean and estimates of the variance are independent. To have confidence that sample means and sample variances are independent, we must be confident that the distribution is one of the few such that this is the case.

Second, it is not sufficient that freequencies, that is estimates of the probability, are normally distributed. This is almost exactly the case for freequencies calculated with more than 30 observations. However, the problem depends on the distribution from which each individual observation is drawn.

Now I am attempting to see whether this theoretical problem really is a problem analyzing the data collected by

Irving Kirsch, Brett J. Deacon, Tania B. Huedo-Medina, Alan Scoboria, Thomas J. Moore, Blair T. Johnson in table 1 of "Initial Severity and Antidepressant Benefits: A Meta-Analysis of Data Submitted to the Food and Drug Administration"

I should point out that Kirsch et al did not use a precision weighted mean. They decided that one study had an absurdly low sample variance and imposed a higher value.

column 3 of table 1 reports the average improvement in the Hamilton rating scale of depression (HRSD) for 35 trials of SSRI's submitted to the FDA (change). Column 4 reports "d" which (if I understand correctly) is the average improvement divided by the sample standard deviation of the improvement. The the estimate of the standard deviation (sd) of the effect is the ratio of column 3 divided by column 4. If I am confused about the definition of d, the rest of this post is worthless.

is it true that studies with a higher mean improvement have a higher standard deviation of improvements ? To see first I ran an OLS regression of sd . This is embarrassing as I am stressing that not all variables are normal and, in this case sd definitely isn't normally distributed. Still it is somewhat interesting that the coefficient is very slightly over 0.25 with an estimated standard error of a tiny bit less less than 0.1. The ratio definitely does not have a t-distribution, which is a pity, because STATA naively thinks that the null that the true coefficient is zero is rejected at the 2% level. When the estimated variance of the effect is regressed on change the t-like statistic (which does not have a t-distribution) is 2.25.

In practice, the means are not weighted by the inverse of the estimated variance but rather by the inverse of the estimated variance of the mean, that is they are weighted by the sample size divided by the estimated variance. Thus it makes more sense to regress the estimated variance on the average improvement (change) weighting by the number of observations. This gives a t-ratio of 5.2 and that is overwhelming.

When sd is regressed on change using sample sizes as weights (pweights for STATA and I checked by hand STATA does what I think it does with pweights) I get a coefficient of over 0.38 and a standard error of under 0.079 and a ratio of 4.95. This is huge.

Now I ask is the precision weighted mean improvement significantly lower than the sample size weighted estimated mean ? Here the sample size weighted estimate is unbiased and consistent if each study is unbiased (the standard hypothesis in meta analysis). It is also consistent if a finite number of studies are no good at all (for example fraudulent). That is, as the number of valid studies goes to infinity, the sample size weighted mean result will converge to the true expected value even if there is a fixed number of worthless totally wrong or fraudulent studies.

The same is not necessarily true of the precision weighted average effect. It is however, a more precise (or efficient) estimate of the mean. This means that I can perform Wu-Hausman test. If both estimates are unbiased, the expected difference of the estimates is zero and the variance of the difference in the estimates is equal to the estimated variance of the sample size weighted average minus the variance of the efficient precision weighted average. This means that the square of the difference in the estimates divided by the difference of the variances should have a chi squared(1) distribution.

I calculated the averages in excel (warning). I get the sample size weighted average is 10.04 and the precision weighted average is 9.59. My estimate of the variance of the sample size weighted average is 0.01947 and my estimate of the variance of the precision weighted average is 0.01799 so the difference is 0.001485. This gives a Wu-Hausman test statistic of 0.45^2/0.001485) = 133.1 which is, to use a technical term hubongous.

I don't like the Wu-Hausman test which is unreliable in small samples. For my purposes the sample size is the sum of the sample sizes minus 35 parameter estimates or 3257 which is rather large. However, I feel more comfortable using a modified Wu-Hausman test which has low power but definitely no size distortion. Personally, I prefer to divide by the larger of the two estimated variances. That gives a modified statistic of 10.15 which is still overwhelmingly significant.

The precision weighted mean estimate of the effect of SSRI's on the HRSD is definitely biased down (unless I made a mistake).

Now the interest in the studies is not the average change with SSRI's but the difference between that average and the average change with a placebo.

For the placebo parts of the studies, the regression coefficient of the estimated standard deviation (psd) on the average change (pchange) is less than 0.0131 with a t-ratio of 0.13. The regression of the estimated variance on pchange has coefficient
less than 0.0344 with t-ratio 0.02. The sample size weighted regression of psd on pchange has coefficient -0.0251 and t-ratio -0.21. The sample size weighted regression of the estimated variance on pchange has coefficient -0.810 and t-ratio -0.40

There is essentially no evidence of a positive correlation of mean change and sample variance for depressed people who received the placebo.

I almost dare predict that the Wu-Hausman test might not reject the null that the precision weighted mean placebo effect is unbiased. I am now off to calculate (see how long it takes me).

update: I have now calculated the Wu-Hausman statistic for the precision weighted mean change in the placebo treated patients and the sample size weighted change in the Placebo treated patients. The precision weighted mean change is 7.81 the sample size weighte mean is 7.852 and the Wu-Hausman statistic is 0.649. The expected value of the statistic under the null is 1. There is no evidence that the precision weighted mean of the change HRSD in patients receiving the Placebo is biased.

update 2: a robustness check. One experiment was quite unlike the others. It was experiment 62 mild, reported in row 4 of table 1. It is unique in thet the patients had relatively mild depression (average inititial HRSD of 17). The reduction of the HRSD score of patients who received Prozac was very low compared to the reduction of treated patients in most studies. The sample variance of the change in HTSD was also very low. This single totally atypical study is responsible for a considerable part of the difference between the precision weighted average aand the sample size weighted average change of HRSD of patients who received SSRI's.

As a robustness check, I replaced the estimated variance from the study with the sample size weighted average estimated variance of 64.79. Naturally, this has no effect on the sample size weighted mean. The precision weighted mean rises to 9.906 and the Wu-Hausman test statistic declines to 20.57 which is still overwhelmingly significant.

Kirsch et al wrote " One trial reported SDcs for its drug and placebo groups that were less than 25% the size of the other trials; because preliminary analyses also revealed that this trial was an outlier, these two standard deviations were treated as missing and imputed." They don't write which trial, but the description fits 62 mild. Interestingly, my results using the suspect SDcs are very similar to those reported by Kirsch et al with a difference of precision weighted means drug minus placebo of 1.78, while if I impute using the average variance (SDcs squared) I get a difference of over 2.09. Kirsch et al imputing using (among other variables) the baseline HRSD and not a simple mean. I guess that their imputed value was also fairly small, because the initial HRSD was low.

update 3: More meta-analysis.

A simple OLS regression across studies of the average change in HRSD of patients who received placebo on the average change in HRSD of patients who received SSRI's gives a large coefficient 0.288 with a t-ratio of 1.72 significant at the 10% level (5 % one tailed). when the regression is weighted by the precision of the estimates of the change for those who received the placebo (number who received the placebo divided by estimated variance) the coefficient rises to over 4.19 with a t-ratio of 5.1. The logic of this weighting is that the placebo samples are generally smaller than the SSRI samples so the precision of the change for those who received the placebo is lower.

This suggests that there are experiment effects on the change in HRSD. They may be due to different patient populations, in particular different baseline HRSD. They may also be due to the fact that the Hamilton test is partly subjective and might be applied differently by different teams of psychologists. Unfortunately such differences affect the estimated difference between SSRI and Placebo as different fractions of patients received placebos in different trials.

To avoid this, the analysis can be conducted by first , for each trial, taking the difference in the change of HRSD for patients receiving SSRI and patients receiving the placebo and then averaging this difference across trials.

The weights for such an average can be determined by analogy with sample size weights. Sample size weights would give an efficient estimate of the overall difference in changes of HRSD if the true variance of the changes was the same for all different trials. If it is assumed also to be the same for placebo and SSRI parts of trials, the resulting weights are 1/((1/Ndi)+(1/Npi)) where Ndi is the number of patients receiving SSRI in trial i and Npi is the number of patients receiving placebo in trial i. These weights are correct because the variance of a difference of two independent variables is the sum of the variances.

This approach will give an unbiased estimate of the difference of effects if each study is unbiased, because the weights are exogenous.

This approach gives an estimate of the additional benefit of SSRI's over placebos of over 2.65. excluding the one study with mildly depressed patients (protocol 62 mild) causes the estimate to increase to 2.77.

And just to leave no doubt that I am fiddling till I get the estimate over 3, I note that if only the 27 studies with an average baseline HRSD over 24 are included the estimate increases to over 3.31.

But hey just for fun and I said I was doing it and this is my blog.

pulled back from comments. PJ himself writes

Hi, interesting stuff, but I'm not sure if you can explain the findings of Kirsch et al by this alone - essentially you are pointing out fairly subtle differences in detected effect sizes, but Kirsch et al find an effect size that is substantially different from you, or me.

As I discuss here, I'm not convinced that Kirsch et al found their effect size measure 'd' by dividing the change score by the SD of the change, even though it does indeed look like that is what they are claiming in the paper - which is why I derived my estimates for the SD of the change from the confidence intervals.

What is interesting is that you find pretty similar effect sizes to me, and I just can't see how Kirsch et al arrived at their figure of a change of 1.8 in the HRSD - just looking at the numbers for the studies I can't see how you could combine them together to reach their figure, and look at the scatter plot and regression here, how are they getting an estimated effect size that seems so far outside the region of the data?

I'd also point out that their measure 'd', is a pretty odd measure of effect size, if it is the HRSD change/change SD that is a very strange measure indeed, but even a properly formed SMD (dividing by estimates of the SD of the HRSD score, not the change in the HRSD score) makes it difficult to assign any meaning to subtracting the drug 'd' from the placebo 'd' to come up with an effect size, and it is difficult to compare such a measure with the Cohen d, which is a true SMD.

But even then, a cursory look at the relationship between the difference between 'd's and the difference between change scores still suggests an effect size much larger than 1.8 HRSD points, indeed more like the 2.7 points that every other method of analysis seems to suggest.

In actual fact I find that calculating a true meta-analytic SMD (where it is the difference in change scores that is normalised to the SDs of the change scores) that can be compared with a Cohen d showed quite a small effect size (.25) so I don't know why, if they sought to deliberately underestimate the 'clinical efficacy' of anti-depressants, they didn't just compare that to NICE's Cohen d > .5 criterion.

You'd think, in these days of online open access journals they'd have given a bit more detail and supplemental material so us obsessive types could tell exactly what it was they did - and then slag it off!
# posted by Blogger pj : 2:40 PM

2 comments:

pj said...: Hi, interesting stuff, but I'm not sure if you can explain the findings of Kirsch et al by this alone - essentially you are pointing out fairly subtle differences in detected effect sizes, but Kirsch et al find an effect size that is substantially different from you, or me.

As I discuss here, I'm not convinced that Kirsch et al found their effect size measure 'd' by dividing the change score by the SD of the change, even though it does indeed look like that is what they are claiming in the paper - which is why I derived my estimates for the SD of the change from the confidence intervals.

What is interesting is that you find pretty similar effect sizes to me, and I just can't see how Kirsch et al arrived at their figure of a change of 1.8 in the HRSD - just looking at the numbers for the studies I can't see how you could combine them together to reach their figure, and look at the scatter plot and regression here, how are they getting an estimated effect size that seems so far outside the region of the data?

I'd also point out that their measure 'd', is a pretty odd measure of effect size, if it is the HRSD change/change SD that is a very strange measure indeed, but even a properly formed SMD (dividing by estimates of the SD of the HRSD score, not the change in the HRSD score) makes it difficult to assign any meaning to subtracting the drug 'd' from the placebo 'd' to come up with an effect size, and it is difficult to compare such a measure with the Cohen d, which is a true SMD.

But even then, a cursory look at the relationship between the difference between 'd's and the difference between change scores still suggests an effect size much larger than 1.8 HRSD points, indeed more like the 2.7 points that every other method of analysis seems to suggest.

In actual fact I find that calculating a true meta-analytic SMD (where it is the difference in change scores that is normalised to the SDs of the change scores) that can be compared with a Cohen d showed quite a small effect size (.25) so I don't know why, if they sought to deliberately underestimate the 'clinical efficacy' of anti-depressants, they didn't just compare that to NICE's Cohen d > .5 criterion.

You'd think, in these days of online open access journals they'd have given a bit more detail and supplemental material so us obsessive types could tell exactly what it was they did - and then slag it off!; 2:40 PM
Robert said...: Actually I think I understand how Kirsch et al got their results. I get a weighted average difference of change of 1.

notation: changeij is the average change in HRSD of patients in study i who got the SSRI (j=1) or the placebo (j=0). Nij is the sample size of patients in trial i who get j pills.

In one calculation I used changeij/dij as the standard deviation of the change and thus (changeij/dij)^2/Nij as the estimated variance of the average change. The *separately* for drug and placebo data, I calculated the precision weighted average over i of changeij. This gave me an average change of 7.809 for the placebo and 9.592 for the SSRI treated for a difference of 1.78.

I guess this is what they did. I the confidence intervals are screwy and d is as described in the paper.; 3:58 PM