I have been wondering about the frequent use and alarming rhetorical power of the word "benchmark". It often appears in the phrase "benchmark model," which is inconvenient, because I want to contrast benchmarks and models and don't want to write about the difference between benchmark models and other models.

Here I use "hypothesis" to refer to a collection of statements which we think might be true, such that we are eager to find out if they are all true, "model" for a collection of statements which we know are false but which might be a useful approximation to the truth, and benchmark for a model, which we wish to use only by contrasting it with models which we think might be useful approximations to the truth.

I imagine hopes followed by disappointments in the following order.

1) (compound complex) statement P might be true and P implies Q which we can observe.

2) Q is false so P isn't true, but P might still be a useful approximation to the truth because other implications of P are approximately true.

3) All the attempts to use P to approximate reality have failed, because each implication is far from the truth. P has been modified every time we try to use it, so the implication (which would be useful if correct but which is incorrect) is eliminated. We can fit and observed pattern after observing but continually fail to predict anything correctly. Work starting with P shares the fault of totally undisciplined empiricism which can describe but not forecast.

4) However, P is a useful benchmark. We can understand each of the stylized facts by remembering why each proves P false by noting how P had to be modified to fit the fact.

I think macroeconomics is reaching the 4th stage. The DSGE models which have dominated academic work for decades are based on assumptions which ( it is now asserted) were always assumed to be false. They are not especially useful for forecasting (and it is now asserted that they were never meant to be used to forecast). They offer limited guidance for policy in a crisis, because the crisis occurs exactly when one of the standard assumptions failed. However, they are still used as benchmarks. New models are presented as modifications of a standard model. One modification is made per article. Insights are obtained, because the modified assumption must cause the difference in results between the benchmark model and the new model.

My view is that the claim that a something is a useful benchmark might be false.

In fact, I think it is similar to the claim that a model is a useful approximation to reality. A model is a useful approximation if it gives approximately accurate conditional forecasts. It is used by calculating what the outcomes caused by different policies would be if the model were the truth. It is a useful approximately if the conditional predictions of outcomes conditional on policies are approximately accurate. The useful model is used to understand approximately how things would be different if different policies were implemented. Similarly a benchmark model is used to understand how things would be different if different assumptions were true. So we determine the effect of, say, some financial friction by comparing a new DSGE model with the financial friction to the standard DSGE model without it. Again the effort is to see how changing something changes outcomes. The difference might be that policy makers can't really eliminate the financial friction, so the actual outcome is compared to something which can be imagined but not achieved. However, the claims are roughly equally strong. Blanchard discusses considering a policy and considering a distortion as if they were the same sort of considering. "They can be useful upstream, before DSGE modeling, as a first cut to think about the effects of a particular distortion or a particular policy".

I think the choice of a benchmark is important because one modification is considered at a time. If implications were a linear function of assumptions, then it wouldn't matter from to which model one made a change. But, they aren't. The way in which an unrealistic DSGE model differs from the same model with a financial friction can be completely different from the way in which the real world would be different if a financial friction were eliminated.

But I think it there is a more important problem with accepting a DSGE model as at least a useful benchmark. The result has been that the vast majority of models in the literature share many of the implications of the benchmark model. So, for example, if the benchmark model has Ricardian equivalence, so do most of the modified models. The result is that if one surveys the literature and attempts to see what it seems to imply about the effects of the timing of lump sum taxes, it sure seems to imply there are probably no such effects. Most models imply no effect. The possibility of improving outcomes with temporary lump sum tax cuts is not discussed. When such cuts were proposed in the USA in 2009 (as part of the ARRA stimulus bill) many economists argued that policy makers were ignoring the results of decades of academic research. In fact, they were ignoring the implication of the standard benchmark model which was used, just as a benchmark, in spite of its poor performance.

This is the same pattern seen following the stronger hope that a model might be a useful approximation. The model is introduced with the expressed hope that it might be a useful approximation. Implications are derived. They turn out to be false. It is noted that models are false by definition, and that other implications of the model might be useful approximations. After years or decades, the model is no longer used by specialists in the field. However, it is still presented to outsiders as a useful first order approximation when it isn't. In this context "first order" means "according to the first model developed by my school of thought".

In both cases, actual practical implications are derived through a process which is completely invulnerable to evidence.

I am writing this, because the more diplomatic critics of mainstream academic macroeconomics insist that the models, which they find unsatisfactory, are useful benchmarks.

an example from a not so diplomatic critic
. I think this claim is made without any consideration of the possibility that it might be false, and, indeed, a damaging falsehood. It is the least one can say if one isn't willing to tell people that they have wasted decades of their working life. But that doesn't mean that it isn't more than one should say.

A simple example illustrating the danger of changing one assumption at a time. The model is just the original Lucas supply function. The idea is that output is chosen by suppliers who don't observe the price level, so it is equal to the actual price level minus the rational forecast of the price level. This implies that output is a white noise and the location of the distribution of output doesn't depend on the behavior of the price level and therefore doesn't depend on monetary policy. With a standard assumption (or approximation) it implies that the expected value of output conditional on data available to agents is a constant which doesn't depend on monetary policy. This is the policy ineffectiveness proposition which lead Sargent and Wallace to note that, in their model, the optimal policy was to set the inflation rate to some desired target and ignore everything else. Notably this is the policy mandate of the European Central Bank. There are two counter arguments, neither of which amounts to much. The first, is that agents in the model are assumed to have rational expectations and so automatically know the policy rule. It is much more reasonable to assume that agents are boundedly rational and learn the policy rule. It was correctly argued that, given the other assumptions, this learning will have only temporary effects and that the rational expectations assumption will become true in the long run. It was later argued (based on massive evidence) that the current unemployment rate affects the future non accelerating inflation rate of unemployment, that is, that cyclical unemployment becomes structural, that is there ther is hysteresis. In this case, supply depends not only on price level prediction errors but also on the time varying natural rate.
It was correctly argued that, in this model, the optimal policy was to target inflation -- the expected level of output didn't depend on policy. Here, in passing, it is worth noting that the additional assumptions mentioned above which were required to get from "location" to "expected value" become critical [Cite Pelloni et al].

But consider a newly installed monetary authority setting policy for an economy populated by boundedly rational agents who have to learn the policy rule. The authority should think what would happen she were less of an inflation hawk that people expect (not with rational expectations but with the actual beliefs of the boundedly rational agents in the economy). The result would be temporarily higher output while agents learn. This would cause permanently higher output because of hysteresis. Alone each of boundedly rational learning and hysteresis do not change the optimal policy. Together they change everything. The rule that only one change in the benchmark model is considered at a time can prevent people from seeing this. In fact, I think it has prevented most macroeconomists from seeing this.

OK Amateur partisan intellectual history after the jump.

A testable hypothesis always includes the core hypothesis of interest and auxiliary hypotheses required to obtain testable predictions (so Newton's model of the solar system includes the core hypotheses of his law of gravity and laws of motion and the auxiliary hypotheses that the sun and planets are rigid spheres and that the effects of all forces but gravity are negligeable). The problem is that the so called core hypotheses of the PIH, REH and EMH are not such thing. They are, in fact, always the same non-hypothesis that, ex poste one can find some utility fuction such that the actions of agents are consistent with rational maximization of the expected value of that utility functoin. This is true, because it must be true. It is agreed (and easily demonstrated) that the assumption that agents maximize something has no implications at all without some further assumptions about what they maximize. The core hypothesis is not falsifiable. If rejection due to failure of auxiliary hypotheses is not considered a reason to abandon the research program, then the research program is completely invulnerable to evidence.

This is a deadly problem, but I want to write about a different less important problem.

I am very irritated by the phrase "all models are false by definition". It mocks model testers who have demonstrated that some model has false implications. The implication is that the model testers misunderstood the aim of the model developers, incorrectly perceiving a model to be a hypothesis. Foolish salt water economists decided for some silly reason that the permanent income hypothesis, the rational expectations hypothesis and the efficient markets hypothesis were hypotheses. I claim that this shows bad faith. A statement is a hypothesis (with the associated scientific dignity) until it is proven false, then it turns out that it was always a model and the people who proved the statement false are silly.

The repeated use of the word "hypothesis" in the 50s 60s and 70s strongly suggests that the equations in question were not originally considered parts of models which were false by definition. Thomas Sargent's phrase "take a model seriously" sure seems to imply "treat a model as a null hypothesis." And, in fact Sargent once said (original pdf download here) that Lucas and Prescott were enthusiastic about hypothesis testing until he falsified too many of their hypotheses, and both independently said exactly that.

My recollection is that Bob Lucas and Ed Prescott were initially very enthusiastic about rational expetations econometrics. After all, it simply involved imposing on ourselves the same high standards we had criticized the Keynesians for failing to live up to. But after about five years of doing likelihood ratio tests on rational expectations models, I recall Bob Lucas and Ed Prescott both telling me that those tests were rejecting too many good models. The idea of calibration is to ignore some of the probabilistic implications of your model but to retain others.