What is the reproducibility crisis in science and what can we do about it?

What is the reproducibility crisis
in science and what can we do
about it?
Dorothy V. M. Bishop
Professor of Developmental Neuropsychology
University of Oxford
@deevybee

What is the problem?
“There is increasing concern about the
reliability of biomedical research, with recent
articles suggesting that up to 85% of
research funding is wasted.”
Bustin, S. A. (2015). The reproducibility of
biomedical research: Sleepers awake!
Biomolecular Detection and
Quantification
2005. PLoS Medicine, 2(8), e124. doi:
10.1371/journal.pmed.0020124

Generate
and specify
hypotheses
Design
study
Collect data
Analyse
data & test
hypotheses
Interpret
data
Publish or
conduct
next
experiment
Hypothetico-deductive scientific method
based on original by Chris Chambers

Generate
and specify
hypotheses
Design
study
Collect data
Analyse
data & test
hypotheses
Interpret
data
Publish or
conduct
next
experiment
How common?

Which Article Should You Write?
There are two possible articles you can write: (a) the article you planned to
write when you designed your study or (b) the article that makes the most sense
now that you have seen the results. They are rarely the same, and the correct
answer is (b).
re Data Analysis: Examine them from every angle. Analyze the sexes separately.
Make up new composite indexes. If a datum suggests a new hypothesis, try to
find additional evidence for it elsewhere in the data. If you see dim traces of
interesting patterns, try to reorganize the data to bring them into bolder relief. If
there are participants you don’t like, or trials, observers, or interviewers who
gave you anomalous results, drop them (temporarily). Go on a fishing expedition
for something— anything —interesting.
Writing the Empirical Journal Article
Daryl J. Bem
The Compleat Academic: A Practical Guide for the Beginning Social
Scientist, 2nd Edition. Washington, DC: American Psychological
Association, 2004.
“This book provides invaluable guidance that will help new academics plan,
play, and ultimately win the academic career game.”
Explicitly advises
HARKing!

Generate
and specify
hypotheses
Design
study
Collect data
Analyse
data & test
hypotheses
Interpret
data
Publish or
conduct
next
experiment
p-hacking
P-hacking: doing many tests and only reporting the
significant ones. Collecting extra data or removing
outliers to push ‘nearly significant’ results over
boundary.
How common?

John et al, 2012 –survey of psychologists

Generate
and specify
hypotheses
Design
study
Collect data
Analyse
data & test
hypotheses
Interpret
data
Publish or
conduct
next
experiment
p-hacking
Low
statistical
power
Sample size too small
to detect real effect

Button KS et al. 2013. Power failure: why small sample size
undermines the reliability of neuroscience. Nature Reviews
Neuroscience 14:365-376.
Median power of studies included in
neuroscience meta-analyses

Generate
and specify
hypotheses
Design
study
Collect data
Analyse
data & test
hypotheses
Interpret
data
Publish or
conduct
next
experiment
p-hacking
Low
statistical
power
Publication
bias
Null findings don’t get
published – literature
distorted
Fanelli, 2010: 92% papers
report positive findings

Generate
and specify
hypotheses
Design
study
Collect data
Analyse
data & test
hypotheses
Interpret
data
Publish or
conduct
next
experiment
p-hacking
Low
statistical
power
Publication
bias
Methods to avert bias
not reported
MacLeod et al, 2015: in
vivo research, only around
25% papers reported
randomisation/blinding
Failure to
control for
bias

Generate
and specify
hypotheses
Design
study
Collect data
Analyse
data & test
hypotheses
Interpret
data
Publish or
conduct
next
experiment
p-hacking
Low
statistical
power
Publication
bias
Failure to
control for
bias
Poor quality
control, e.g.
misidentified
cell lines/
reagents

Bustin (2015) on RNA biomarkers:
“molecular techniques can be unfit for purpose”
Poor fidelity of reagents/cell lines

1956
De Groot
Failure to distinguish between
hypothesis-testing and
hypothesis-generating
(exploratory) research
-> misuse of statistical tests
Historical timeline: concerns about reproducibility

1956
De Groot
1975
Greenwald
“As it is functioning in at least some areas of
behavioral science research, the research-
publication system may be regarded as a
device for systematically generating and
propagating anecdotal information.”

1956
De Groot
1975
Greenwald
The “file drawer” problem
1979
Rosenthal

1956
De Groot
1975
Greenwald
1987
Newcombe
“Small studies continue to be carried out
with little more than a blind hope of
showing the desired effect. Nevertheless,
papers based on such work are submitted
for publication, especially if the results
turn out to be statistically significant.”
1979
Rosenthal

1956
De Groot
1975
Greenwald
1987
Newcombe
1993
Dickersin
& Min
Clinical trials with ‘significant’ results substantially more
likely to be published. “Most unpublished trials remained
so because investigators thought the results were ‘not
interesting’ or they ‘did not have enough time’”
1979
Rosenthal

1956
De Groot
1975
Greenwald
1987
Newcombe
1993
Dickersin
& Min
“The misidentified cell lines reported here have already
been unwittingly used in several hundreds of potentially
misleading reports, including use as inappropriate tumor
models and subclones masquerading as independent
replicates.”
1999
Macleod
et al
1979
Rosenthal

Why is this making headlines now?
• Increase in studies quantifying the problem
• Concern from those who use research:
• Doctors and Patients
• Pharma companies
• Social media
“It really is striking just for how long there have been reports about the poor
quality of research methodology, inadequate implementation of research
methods and use of inappropriate analysis procedures as well as lack of
transparency of reporting. All have failed to stir researchers, funders,
regulators, institutions or companies into action”. Bustin, 2014

Failure to appreciate power of ‘the prepared mind’
Natural instinct is to look for consistent evidence, not disproof
Problems caused by researchers: 1

“The self-deception comes in
that over the next 20 years,
people believed they saw
specks of light that
corresponded to what they
thought Vulcan should look
during an eclipse: round objects
crossing the face of the sun,
which were interpreted as
transits of Vulcan.”

Seeing things in complex data requires skill
Bailey and von Bonin (1951) noted problems in
Brodmann's approach — lack of observer
independency, reproducibility and objectivity
Yet have stood test of time: still used today
Brodmann areas, 1909

Seeing things in complex data requires skill
Or pareidolia
Bailey and von Bonin (1951) noted problems in
Brodmann's approach — lack of observer
independency, reproducibility and objectivity
Yet have stood test of time: still used today
Brodmann areas, 1909

Discusses failure so replicate studies on preferential
looking in babies – role of experimenter expertise

Special expertise or Jesus in toast?
How to decide
• Eradicate subjectivity from methods
• Adopt standards from industry for checking/double-
checking
• Automate data collection and analysis as far as possible
• Make recordings of methods (e.g. Journal of Visualised
Experiments)
• Make data and analysis scripts open

Failure to understand statistics (esp. p-values and power)
http://deevybee.blogspot.co.uk/2016/01/the-amazing-significo-why-researchers.html
Problems caused by researchers: 2

Gelman A, and Loken E. 2013. The garden of forking
paths: Why multiple comparisons can be a problem,
even when there is no 'fishing expedition' or 'p-hacking'
and the research hypothesis was posited ahead of
time.
www.stat.columbia.edu/~gelman/research/unpublished/p_
hacking.pdf
"El jardín de senderos que se bifurcan"

1 contrast
Probability of a
‘significant’ p-value
< .05 = .05
Large population
database used to explore
link between ADHD and
handedness
https://figshare.com/articles/The_Garden_of_Forking_Paths/2100379

Focus just on Young
subgroup:
2 contrasts at this level
Probability of a
‘significant’ p-value < .05
= .10
Large population
handedness

Focus just on Young on
measure of hand skill:
Probability of a
= .19
Large population
handedness

Focus just on Young,
Females on
Probability of a
= .34
Large population
handedness

Focus just on Young,
Urban, Females on
Probability of a
= .56
Large population
handedness

Problem exacerbated because
• Can now easily gather huge multivariate datasets
• Can easily do complex statistical analyses
Problems with exploratory analyses
that use methods that presuppose
hypothesis-testing approach

When p-hacking meets an ideological
agenda: a particularly toxic mix

http://www.snopes.com/medical/disease/cdcwhistleblower.asp

Solutions
a. Using simulated datasets to give insight
into statistical methods

Illustrated with field of ERP/EEG
• Flexibility in analysis in terms of:
• Electrodes
• Time intervals
• Frequency ranges
• Measurement of peaks
• etc, etc
• Often see analyses with 4- or 5-way ANOVA (group x side x
site x condition x interval)
• Standard stats packages correct p-values for N levels
WITHIN a factor, but not for overall N factors and
interactions
.
Cramer AOJ, et al 2016. Hidden multiplicity in exploratory multiway ANOVA: Prevalence and
remedies. Psychonomic Bulletin & Review 23:640-647

Solutions
b. Distinguish exploration from hypothesis-
testing analyses
• Subdivide data into exploration and replication
sets.
• Or replicate in another dataset

Solutions
c. Masked data
Comparison of coronary care units vs treatment at home
From Ben Goldacre’s blog:
http://www.badscience.net/2010/04/righteous-mischief-from-archie-cochrane/
Archie Cochrane

Solutions
c. Masked data
MacCoun R., Perlmutter S. 2015 Hide results to seek the truth. Nature 526, 187-189.
“...temporarily and judiciously removing data labels and altering data
values to fight bias and error”

Solutions
d. Preregistration of analyses

https://www.technologyreview.com/s/601348/merck-wants-its-money-back-if-
university-research-is-wrong/
Solutions
e. Funding contingent on adoption of reproducible practices

• Reluctance to collaborate with competitors
• Reluctance to share data
• Fabricated data
Problems caused by researchers. 3
Solutions to these may require changes to incentive structures, which
leads us to....

http://deevybee.blogspot.co.uk/
Problems caused by journals
• More concern for newsworthiness than methods
• Won’t publish replications (or failures to replicate)
• Won’t publish ‘negative’ findings

• Reward those with most grant income
• Reward according to journal impact factor
Problems caused by institutions

Income to institution increases with the
amount of funding and so….
• The system
encourages us to
assume that:
• Big grant is better
than small grant
• Many grants are
better than one grant
51
"This is Dr Bagshaw, discoverer of the
infinitely expanding research grant“
©Cartoonstock

This is counterproductive because
• Amount of funding needed to do research is not a
proxy for value of that research
• Some activities intrinsically more expensive
• Does not make sense to disfavour research areas
that cost less
52
Daniel Kahneman

Furthermore....
• Desperate scramble for
research funds leads to
researchers being
overcommitted ->
poorly conducted
studies
• Ridiculous amount of
waste due to the
‘academic backlog’
53

Journal impact factor as measure
of quality
• Mean number of citations to
articles published in any given
journal in the two preceding years
• Originally designed to help
libraries decide on subscriptions
• Now often used as proxy for
quality of an article
54
Eugene Garfield

Problems with journal impact factors
• Impact factor not a good
indication of the citations for
individual articles in the
journal, because distribution
very skewed
• Typically, around half the
articles have very few
citations
55
http://www.dcscience.net/colquhoun-nature-impact-2003.pdf
N citations for sample of papers
in Nature

Problems caused by employers
• Reward research reproducibility over impact
factor in evaluation
• Consider ‘bang for your buck’ rather than
amount of grant income
• Reward those who adopt open science practices
Solutions for institutions
Nat Biotech, 32(9), 871-873. doi: 10.1038/nbt.3004
Marcia McNutt
Science 2014 • VOL 346 ISSUE 6214

Problems caused by funders
• Don’t require that all data reported
Though growing interest in data sharing
• No interest in funding replications
• No interest in funding systematic reviews
Problems caused by funders

http://www.acmedsci.ac.uk/policy/policy-projects/reproducibility-and-reliability-of-biomedical-research/symposium-
resources-links/
Resources

What is the reproducibility crisis in science and what can we do about it?

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to What is the reproducibility crisis in science and what can we do about it?

Similar to What is the reproducibility crisis in science and what can we do about it? (20)

More from Dorothy Bishop

More from Dorothy Bishop (20)

Recently uploaded

Recently uploaded (20)

What is the reproducibility crisis in science and what can we do about it?