Saturday, March 08, 2008

Prozac Fan Talks Back
1.8>0.

Update 2: DeLongians and missing links If you are still clicking links and getting here, don't stop clicking. The really good stuff is just one click away at Pyjamas in Bananas.

I am going to comment on a controversy from February 25, the stone age in Blogtime. I had a thought (which turns out to have been wrong) and it got me to click links in a post by Kevin always click the link Drum.

I get emotional and run on. My points, if any, are that the study which is described as showing that SSRI's (modern anti depressants) don't work, in fact finds a positive effect of SSRI's on measured depression which is overwhelmingly statistically significant. Journalists, including Sarah Boseley, health editor
# The Guardian made claims about the results of this study which are absolutely totally false. The authors of the study put great evidence on a standard for clinical significance, which seems to me to be very high and very poorly conceived.

Like Kevin Drum, I find it hard to believe this claim In an article by Sarah Boseley in "The Guardian" "The study examined all available data on the drugs, including results from clinical trials that the manufacturers chose not to publish at the time. The trials compared the effect on patients taking the drugs with those given a placebo or sugar pill.

When all the data was pulled together, it appeared that patients had improved - but those on placebo improved just as much as those on the drugs."

Drum notes extensive anecdotal evidence. I am much more strongly convinced based on my personal experience (I should note that according to my "don't do this at home kids" application of the Hamilton Rating Scale to my recollection of my state when I first took Prozac, I was severely depressed and the study finds a larger difference between drug an placebo for the severely depressed). Also, Boseley's claim about the study she claims to be summarizing is absolutely totally false.

Interestingly, I got to the guardian article clicking the following link in Drum's post: "Compared to a placebo, they improved patients' scores on the most widely used depression scale by only 1.8 points:" Clearly Sarah Boseley, health editor
# The Guardian is totally dishonesnt, totally innumerate or both. 1.8>0. Patients on Placebo did not improve just as much as patients on SSRI's. Furthermore, this isn't even a case of treating a statistically insignificant difference from as evidence (or proof) that the true value is zero.
In fact, pooling the studies "Irving Kirsch, Brett J. Deacon, Tania B. Huedo-Medina, Alan Scoboria, Thomas J. Moore & Blair T. Johnson" find a significant additional benefit of taking a SSRI rejecting the null of no benefit with a p value of "<0.001". The standard level for statistical significance is p <0.05. In fact, as shown in their paper Kirsch, Deacon, Huedo-Medina, Scoboria, Moore & Johnson find overwhelmingly strong evidence that SSRI's cause improvement in depression.

They stress the novel result that the estimate of the magnitude of the effect is smaller than that based on published trials. Researchers tend stress the novelty of their results. Oddly big Pharma, which spends huge amounts of money on advertizing, doesn't seem to have managed to hire anyone intelligent enough to point out that 1.8>0.

Kirsch, Deacon, Huedo-Medina, Scoboria, Moore & Johnson note that it is lower than 3 and that "National Institute for Clinical Excellence (NICE) used a drug–placebo difference of three points as a criterion for clinical significance when establishing guidelines for the treatment of depression in the United Kingdom." Thus the opinion of a bureaucratic organization is held to be the truth. What do this 1.8 and this 3 mean ? I quote from the background information for non experts provided by the journal.

Doctors measure the severity of depression using the “Hamilton Rating Scale of Depression” (HRSD), a 17–21 item questionnaire. The answers to each question are given a score and a total score for the questionnaire of more than 18 indicates severe depression.


It sure appears to me that 1.8 is a tenth of the way from the best possible score (which I would guess is very rare) to severe depression. But the NICE says it's clinically insignificant and the NICE must know.

Now the NICE clearly has no clue what it is talking about. I dare write this, because I am about to talk about math.

I quote from the article

"standardized mean difference (d), which divides change by the standard deviation of the change score (SDc) [10], and another using each study's drug and placebo groups' arithmetic mean (weighted for the inverse of the variance) as the meta-analytic “effect size” [11].

The first analysis permitted a determination of the absolute magnitude of change in both the placebo and treatment groups. Results permitted a determination of overall trends, analyses of baseline scores in relation to change, and for both types of models, tests of model specification, which assess the extent to which only sampling error remains unexplained. The results in raw metric are presented comparing both groups, but because of the variation of the SDcs, the standardized mean difference was used in moderator analyses in order to attain better-fitting models [12]. These results are compared to the criterion for clinical significance used by NICE, which is a three-point difference in Hamilton Rating Scale of Depression (HRSD) scores or a standardized mean difference (d) of 0.50 [1]."


Now the standardized mean difference is not used to test the hypothesis that the true difference is zero (that implies dividing by the standard deviation of the mean not the standard deviation) nor is it used to calculate a confidence interval around the estimated effect. NICE suggests it be used as a measure of the magnitude of an effect "clinical significance" for no good reason at all.

Take two studies of two different drugs, both show a positive effect say 2.9 for drug 1 and 1.5 for drug 2. The sample sizes are large enough that these effects are statistically significantly different not only from zero but from each other. Hell I'm pretending so make the sample size 400 in each case. The sd of the effect drug 1 is 10 and the sd of the effect of drug 2 is 2. Hmm the sd of the means are 0.5 and 0.2 so the z-scores testing whether each drug is effective are 5.8 and 15 both strongly significant. The sd of the difference in means is the square root of the sum of the squares of the sd's of them means or ... uhm roughly 0.50990195 so the z score for the difference is more than 2.7, that is statistically significant.

NICE would conclude that drug 2 has a clinically significant benefit because it has a standardized mean difference of 0.75 while drug 1 does not have a clinically significant benefit because it has an average benefit of 2.9<3 and a standardized mean difference of 0.29 <0.5.

Thus, although the data demonstrate that the mean benefit of drug 2 is greater than the mean benefit of drug 1, NICE would conclude that the benefit of drug 2 is clinically significant while the benefit of drug 1 is clinically insignificant.

I think this is crazy. NICE developed a rule of thumb which mixes point estimates and confidence intervals. The mixed rule requires statistical significance (a 95% interval must not contain zero) and an arbitrary level of the point estimate measured effect (if there were any calculation behind it it wouldn't be exactly three) which takes no account of the fact that the point estimate is not exact.

Now some may argue that a high mean effect with a high standard deviation implies a very bad effect on many people. This is not true. It is based on assuming that the distribution of effects is normal. There is no way of calculating the probability that some patients score will drop by more than say 3 from means and standard deviations.

update: The very excellent Michael O'Hare falls for it. He writes "Antidepressants seem to be only somewhat more effective against depression than placebos," and "drugs that don't work much better than sugar pills?". He accepts the idea that 1.8 is roughly zero (note he is smart and honest and doesn't say that 1.8=0 as Sarah Boseley did). To be fair to 0'Hare, his post is not about SSRI's but about the power of placebos.

However, I am not convinced that he understands the evidence on the placebo effect.
The studies report the change in the Hamilton rating from the first interview until some period after beginning treatment with the SSRI or the placebo. There was a huge improvement among people who took the placebo. It is, however, not at all clear that this was a placebo effect. O'Hare's implicit assumption is that there would have been no improvement without the placebo. He mentions no evidence to this effect.
Another explanation of the data is that depression comes and goes. So people who are depressed when the study started might improve even if there is no treatment, not even with a placebo.

To measure the placebo effect, it would be necessary to compare it to no treatment, that is to interview some patients and ask them to come back in say 4 weeks and do nothing else. A substantial fraction of people would not cooperate with such a study, and, I would guess, a larger fraction of depressed than non depressed people.

update: I have added actual links to the article to click (sorry).
Pyjamas in Bananas seems to have done really excellent work. I pull back from comments

Putting the limitations of NICE's arbitrary criteria for 'clinical significance' to one side, I'm not at all sure how Kirsch et al arrived at an effect size of 1.8 HRSD points (my own analyses, and frankly just eye-balling the data, suggests a figure more like 2.8 points overall, with both paroxetine and venlafaxine exceeding three points of HRSD change).

By using the rather odd measure of HRSD change score normalised as the standardised mean difference for each group separately their results look rather different to what we would expect from analysing the raw HRSD change scores, and it is not clear that they can be compared to the NICE criterion of d > .5.

Adopting this measure means that a greater improvement from baseline HRSD to outcome HRSD with drug treatment can be normalised away by a greater variance in this group - and it is hardly inconceivable that outcome HRSD variance would increase with an effective, but non-uniform, response to drug treatment.

It is also worth noting that the main conclusion of this paper, at least as far as the media is concerned, that newer anti-depressants shouldn't be prescribed in 'mild' or 'moderate' depression is based on only a single 'moderate' study and extrapolation of the regression line, since the rest of the studies were in 'severe' or 'very severe' depression.

2 comments:

pj said...

Putting the limitations of NICE's arbitrary criteria for 'clinical significance' to one side, I'm not at all sure how Kirsch et al arrived at an effect size of 1.8 HRSD points (my own analyses, and frankly just eye-balling the data, suggests a figure more like 2.8 points overall, with both paroxetine and venlafaxine exceeding three points of HRSD change).

By using the rather odd measure of HRSD change score normalised as the standardised mean difference for each group separately their results look rather different to what we would expect from analysing the raw HRSD change scores, and it is not clear that they can be compared to the NICE criterion of d > .5.

Adopting this measure means that a greater improvement from baseline HRSD to outcome HRSD with drug treatment can be normalised away by a greater variance in this group - and it is hardly inconceivable that outcome HRSD variance would increase with an effective, but non-uniform, response to drug treatment.

It is also worth noting that the main conclusion of this paper, at least as far as the media is concerned, that newer anti-depressants shouldn't be prescribed in 'mild' or 'moderate' depression is based on only a single 'moderate' study and extrapolation of the regression line, since the rest of the studies were in 'severe' or 'very severe' depression.

James Wimberley said...

Doesn't the NICE criterion of "clinical significance" run against its core mandate of promoting cost-effective treatment?

Aspirin is dirt cheap; so it makes sense to take it preventively against heart disease at a very low positive number for effectiveness. On the other hand, if a treatment for depression costs £20,000 it would be reasonable to expect a very large difference against the second-best treatment.

But NICE should incorporate cost-effectiveness explicitly using some indicator of differential quality of life per pound, not by inventing arbitrary cutoffs for clinical effectiveness.