The Value of Replication

Daryl Bem is a respected psychology researcher who decided to try his hand at parapsychology. Last year he published a series of studies in which he claimed evidence for precognition — for test subjects being influenced in their choices by future events. The studies were published in a peer-reviewed psychology journal, the Journal of Personality and Social Psychology. This created somewhat of a controversy, and was deemed by some to be a failure of peer-review.

While the study designs were clever (he simply reversed the direction of some standard psychology experiments, putting the influencing factor after the effect it was supposed to have), and the studies looked fine on paper, the research raised many red flags — particularly in Bem’s conclusions.

The episode has created the opportunity to debate some important aspects of the scientific literature. Eric-Jan Wagenmakers and others questioned the p-value approach to statistical analysis, arguing that it tends to over-call a positive result. They argue for a Bayesian analysis, and in their re-analysis of the Bem data they found the evidence for psi to be “weak to non-existent.” This is essentially the same approach to the data that we support as science-based medicine, and the Bem study is a good example of why. If the standard techniques are finding evidence for the impossible, then it is more likely that the techniques are flawed rather than the entire body of physical science is wrong.

Now another debate has been spawned by the same Bem research — that involving the role and value of exact replication. There have already been several attempts to replicate Bem’s research, with negative results: Galak and Nelson, Hadlaczky, and Circee, for example. Others, such as psychologist Richard Wiseman, have also replicated Bem’s research with negative results, but are running into trouble getting their studies published — and this is the crux of the new debate.

According to Wiseman, (as reported by The Psychologist, and discussed by Ben Goldacre) the Journal of Personality and Social Psychology turned down Wiseman’s submission on the grounds that they don’t publish replications, only “theory-advancing research.” In other words — strict replications are not of sufficient scientific value and interest to warrant space in their journal. Meanwhile other journals are reluctant to publish the replication because they feel the study should go in the journal that published the original research, which makes sense.

This episode illustrates potential problems with the  scientific literature. We often advocate at SBM that individual studies can never be that reliable — rather, we need to look at the pattern of research in the entire literature. That means, however, understanding how the scientific literature operates and how that may create spurious artifactual patterns.

For example, I recently wrote about the so-called “decline effect” — a tendency for effect sizes to shrink or “decline” as research on a phenomenon progresses. In fact, this was first observed in the psi research, as the effect is very dramatic there — so far, all psi effects have declined to non-existence. The decline effect is likely a result of artifacts in the literature. Journals are more inclined to publish dramatic positive studies (“theory-advancing research”), and are less interested in boring replications, or in initially negative research. A journal is unlikely to put out a press release that says, “We had this idea, and it turned out to be wrong, so never-mind.” Also, as research techniques and questions are honed, research results are likely to become closer to actual effect sizes, which means the effect of researcher bias will be diminished.

If the literature itself is biased toward positive studies, and dramatic studies, then this would further tend to exaggerate apparent phenomena — whether it is the effectiveness of a new drug or the existence of anomalous cognition. If journals are reluctant to publish replications, that might “hide the decline” (to borrow an inflammatory phrase) — meaning that perhaps there is even more of a decline effect if we consider unpublished negative replications. In medicine this would be critical to know — are we basing some treatments on a spurious signal in the noise of research.

There have already been proposals to create a registry of studies, before they are even conducted (specifically for human research), so that the totality of evidence will be transparent and known — not just the headline-grabbing positive studies, or the ones that meet the desires of the researchers or those funding the research. This proposal is primarily to deal with the issue of publication bias — the tendency not to publish negative studies.

Wiseman now makes the same call for a registry of trials before they even begin to avoid the bias of not publishing replications. In fact, he has taken it upon himself to create a registry of attempted replications of Bem’s research.

While this may be a specific fix for replications for Bem’s psi research — the bigger issues remain. Goldacre argues that there are systemic problems with how information filters down to professionals and the public. Reporting is highly biased toward dramatic positive studies, while retractions, corrections, and failed replications are quiet voices lost in the wilderness of information.

Most readers will already understand the critical value of replication to the process of science. Individual studies are plagued by flaws and biases. Most preliminary studies turn out to be wrong in the long run. We can really only arrive at a confident conclusion when a research paradigm produces reliable results in different labs with different researchers. Replication allows for biases and systematic errors to average out. Only if a phenomenon is real should it reliably replicate.

Further — the excuse by journals that they don’t have the space now seems quaint and obsolete, in the age of digital publishing. The scientific publishing industry needs a bit of an overhaul, to fully adapt to the possibilities of the digital age and to use this as an opportunity to fix some endemic problems. For example, journals can publish just abstracts of certain papers with the full articles available only online. Journals can use the extra space made available by online publishing (whether online only or partially in print) to make dedicated room for negative studies and for exact replications (replications that also expand the research are easier to publish). Databases and reviews of such studies can also make it as easy to find and access negative studies and replications as it is the more dramatic studies that tend to grab headlines.


The scientific endeavor is now a victim of its own success, in that research is producing a tsunami of information. The modern challenge is to sort through this information in a systematic way so that we can find the real patterns in the evidence and reach reliable conclusions on specific questions. The present system has not fully adapted to this volume of information, and there remain obsolete practices that produce spurious apparent patterns in the research. These fake patterns of evidence tend to be biased toward the false positive — falsely concluding that there is an effect when there really isn’t — or at least in exaggerating effects.

These artifactual problems with the literature as a whole combine with the statistical flaws in relying on the p-value, which tends to over-call positive results as well. This problem can be fixed by moving to a more Bayesian approach (considering prior probability).

All of this is happening at a time when prior probability (scientific plausibility) is being given less attention than it should, in that highly implausible notions are being seriously entertained in the peer-reviewed literature. Bem’s psi research is an excellent example, but we deal with many other examples frequently at SBM, such as homeopathy and acupuncture. Current statistical methods and publication biases are not equipped to deal with the results of research into highly implausible claims. The result is an excess of false-positive studies in the literature — a residue that is then used to justify still more research into highly implausible ideas. These ideas can never quite reach the critical mass of evidence to be generally accepted as real, but they do generate enough noise to confuse the public and regulators, and to create an endless treadmill of still more research.

The bright spot is that highly implausible research has helped to highlight some of these flaws in the literature. Now all we have to do is fix them.

Posted in: Neuroscience/Mental Health

Leave a Comment (58) ↓