A surge of p-values between 0.040 and 0.049 in recent decades (but negative results are increasing rapidly too)

Joost de Winter; Dimitra Dodou

doi:10.7287/peerj.preprints.447v2

A surge of p-values between 0.040 and 0.049 in recent decades (but negative results are increasing rapidly too)

Joost de Winter , Dimitra Dodou

BioMechanical Engineering, Delft University of Technology, Delf, The Netherlands

DOI: 10.7287/peerj.preprints.447v2

Published: 2014-07-27
Accepted: 2014-07-26

Subject Areas: Science Policy, Statistics
Keywords: significant differences, bias, physical sciences, biological sciences, social sciences, science policy

Copyright: © 2014 de Winter et al.
Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ PrePrints) and either DOI or URL of the article must be cited.

Cite this article: de Winter J, Dodou D. 2014. A surge of p-values between 0.040 and 0.049 in recent decades (but negative results are increasing rapidly too) PeerJ PrePrints 2:e447v2 https://doi.org/10.7287/peerj.preprints.447v2

Abstract

It is known that statistically significant results are more likely to be published than results that are not statistically significant. However, it is unclear whether negative results are disappearing from papers, and whether there exists a ‘hierarchy of sciences’ with the social sciences publishing more positive results than the physical sciences. Using Scopus, we conducted a search in the abstracts of papers published between 1990 and 2014, and calculated the percentage of papers reporting marginally positive results (i.e., p-values between 0.040 and 0.049) versus the percentage of papers reporting marginally negative results (i.e., p-values between 0.051 and 0.060). The results indicate that negative results are not disappearing, but have actually become 4.3 times more prevalent since 1990. Positive results, on the other hand, have become 13.9 times more prevalent since 1990. We found no consistent support for a ‘hierarchy of sciences’. However, we did find large differences in reporting practices between disciplines, with the reporting of p-values being 60.6 times more frequent in the biological sciences than in the physical sciences. We argue that the observed longitudinal trends may be caused by negative factors, such as an increase of questionable research practices, but also by positive factors, such as an increasingly quantitative research focus.

Author Comment

This is version 2 of an earlier preprint with some corrections in figure legends.

Supplemental Information

Datasets and Matlab script

DOI: 10.7287/peerj.preprints.447v2/supp-1

Download

Feedback on this revision

0

3554 days ago - Daniele Fanelli

Reply to de Winter and Dodou (2014): Growing bias and the hierarchy are actually supported, despite different design, errors, and disconfirmation-biases.

I appreciate the efforts that de Winter and Dodou (2014) have put into replicating and challenging claims made by Fanelli (2010, 2012), as well as those of Pautasso (2010). This is how all sciences should make progress, and it is therefore both a duty and an honour to respond to this challenge.

The results presented are largely in agreement with claims by Fanelli (2012 and 2010), but this fact is obfuscated by a somewhat selective interpretation of findings, reinforced by differences in study design, and major flaws in the sampling and analytical design.

FLAWS IN INTERPRETATION:

1) Fanelli (2012) claimed that negative results are disappearing in percentage, which is exactly what is found here. Even de Winter and Dodou (2014) quote Fanelli (2012) as using percentage figures, so I am quite baffled as to why they consider their results at odds with mine. For the record, the absolute number of negative results in Fanelli (2012) did not show a decline, and it was never claimed in the paper that it did.

An absolute increase in the number of both positives and negatives is generally to be expected, since the annual number of records added to databases has regularly increased. In the discussion de Winter and Dodou perceive a 13.9-fold increase in positives as “equally staggering” as a 4.3-fold increase in negatives. This is a rather surprising lack of enthusiasm for a result that is possibly more extreme than what both Fanelli (2012) and Pautasso (2010) had reported.

2) The Hierarchy of the Sciences seems also remarkably supported by de Winter and Dodou, despite their claims of the contrary. By their own admission, the rate of increase in positive results has been fastest in the social sciences and, as reported in Table 1 of their work, actually shows a hierarchy-like pattern (i.e. physical-biological-social) with all proxies.

Instead of conceding this point, de Winter and Dodou (2014) claim that the Hierarchy theory is refuted because the ratio of non-significant results is similar across all domains when based on p-values (Figure 6), and higher in the physical sciences when based on textual reporting (Figure 7, but note how the Social-Biological difference is strong, and exactly in the direction predicted).

In reality, all that these findings show is that the overall frequency of positive and negatives is highly sensitive to the particular proxy used. Temporal changes for the same proxies, however, should logically be considered more consistent, since different practices between disciplines or countries are controlled for. This point was already made by Fanelli (2012). Even de Winter and Dodou (2014) discuss at length this issue, yet they overlook it when it comes to discussing their evidence for the Hierarchy of the Sciences.

In Figure 1 de Winter and Dodou (2014) took care in reporting data already presented by Fanelli (2012). Since Fanelli 2012 had also presented these data in graph form, the need for Figure 1 is rather unclear. In any case, Figure 1 illustrates how, even with Fanelli (2012)’s proxy, the three domains overlap, yet differ in absolute magnitude as well as steepness when controlling for other confounders. So, once again, it is unclear why the authors would consider their findings, which are visually identical to, and in some respects are actually stronger than Fanelli (2012), a refutation of this latter’s claims.

It should also be remarked that, unlike Fanelli (2012), Fanelli (2010), which made the original claim for a correlation between the Hierarchy of the Sciences correlation and reporting biases, had shown how this was only true for pure disciplines, whereas the applied disciplines showed uniformly high frequencies. Aggregating all disciplines, as de Winter and Dodou (2014) do, would risk erasing differences between domains.

3) Finally, even though de Winter and Dodou (2014) compared trends between geo-economic regions just as as Fanelli (2012) did, they spend little time comparing the two independent results. This is quite a pity, since such results are remarkably in agreement: Asian countries show an overall stronger increase than both US and EU, as Figures 9 and 10, and table 2 report.

Part of the author’s claims to have refuted previous evidence rests on the lack of statistical significance in some of their analyses. But this is where weaknesses in their study design become an important source of confusion.

FLAWS IN STUDY DESIGN

1) De Winter and Dodou (2014) use linear regression on proportion data. This is a statistical mistake, which violates the critical assumptions of normality and homoscedasticity for linear regression. Moreover, judging by the data sets they have posted online, de Winter and Dodou (2014) have applied linear regression to each proportion by year, irrespective of sample size or any other factor. Such analysis is invalid unless each data point is weighted by sample size. Furthermore, the statistical power is hugely limited, since the effective sample size corresponds to the number of years considered.

All these mistakes were avoided by Fanelli (2012), in which each data point corresponds to one paper from the sample and is analysed for multiple characteristics through logistic regression, a much more robust and powerful analysis, which models binary outcomes in the only correct way.

2) De Winter and Dodou (2014) present individual, linear regression slopes in tables, and do not even attempt to conduct multivariable analyses, for no apparent reason other than, I take the liberty to presume, a lack of familiarity with such techniques. De Winter and Dodou (2014) acknowledge that their analyses do not correct for obvious confounders, in particular the different frequencies with which countries appear in different fields, and correctly consider this a major limitation. Unfortunately, they fail to mention that Fanelli (2012) (as well as Fanelli (2010) and all other studies I conducted on these issues) avoided such limitations, precisely by using a multiple regression approach.

3) Disciplinary classification: de Winter and Dodou (2014) used the Scopus classification system. This is another major flaw, avoided by Fanelli (2012). Like most classification systems in other databases, Scopus disciplinary categories are overlapping, which means that any given paper in de Winter and Dodou’s sample is likely to fall in multiple domains, for example both in the physical and in the social sciences. This alone should make any comparison between domains severely flawed. Fanelli (2012) (as well as Fanelli (2010) and all of my studies) used the Essential Science Indicators classification system, which is mutually exclusive (no overlap between disciplines).

4) A similar mistake to the above was made by de Winter and Dodou (2014) when classifying countries. De Winter and Dodou (2014) have aggregated all the countries of all addresses, which means that each paper could appear simultaneously in the US, EU and Asian samples. This time, the mistake is rather unjustified, since Fanelli (2012) had explicitly limited the analysis to the country of the corresponding author, an attribution that is usually easy to make for any paper.

5) Finally, as de Winter and Dodou (2014) correctly discuss, their proxy differs in major ways from that used by Fanelli (2012), and partially from what Pautasso (2010) measured, too.

I deem it unnecessary to discuss this matter at length, although it alone should send a warning against any claims to refutation. Claims of support should be hedged too, of course, but there is an important asymmetry, which I discuss in the last paragraph.

In conclusion, despite substantial differences in the proxy used and study design, and despite major flaws in the sampling and analytical strategy, results presented by de Winter and Dodou (2014) are in remarkable agreement with claims made by Fanelli (2010, 2012). De Winter and Dodou (2014) are adamant of the contrary, and in several passages seem to betray a “disconfirmation bias” against my findings.

Hostile replications are a positive force in science, so by no means I wish to criticize de Winter and Dodou (2014)’s scepticism towards my results, or discourage other researchers from attempting similar replications.

A more balanced interpretation would make scientific self-correction more efficient. However, the fact that their expectations are made explicit, together with their meticulous reporting of methods and results, is an example of how scientific disputes should be conducted in all fields.

When independent studies, using different methods and performed by researchers who are sceptical of previous claims, find completely different results, the lack of agreement is easily explained away as an effect of biased methodological choices on one or both sides. Conversely, however, when under the same conditions studies find patterns that are, to any extent, in agreement, this is a strong suggestion that the underlying phenomena are real, because they are measurable despite all a-priori biases and methodological degrees of freedom.

I therefore thank de Winter and Dodou (2014) for offering results that will help the scientific community get to the bottom of important and controversial problems.

REFERENCES

de Winter and Dodou (2014). A surge of p-values between 0.040 and 0.049 in recent decades (but negative results are increasing rapidly too). PeerJ PrePrints doi:10.7287/peerj.preprints.447v1

Fanelli D (2010). "Positive" results increase down the Hierarchy of the Sciences. PLoS ONE - DOI:10.1371/journal.pone.0010068

Fanelli D (2012) Negative results are disappearing from most disciplines and countries. Scientometrics - DOI:10.1007/s11192-011-0494-7

Pautasso, M. (2010). Worsening file-drawer problem in the abstracts of natural, medical and social science databases. Scientometrics, 85(1), 193–202. doi:10.1007/s11192-010-0233-5.

For further literature and information please go to: danielefanelli.com

3550 days ago - Joost de Winter

Dear Dr. Fanelli. Our reply can be found here: https://sites.google.com/site/jcfdewinter/Fanelli_reply.pdf

0

3553 days ago - Jan De Ruiter

This paper addresses a very urgent and important question. A high(er) percentage of results that is “just significant” could serve as an important warning signal for an increased use of questionable research practices.

I do have a few questions about the methodology used in this paper.

Of course, I see the value in automatic searches, in that it is more “objective” (in the sense of “less potential for human bias”) and that one can inspect a much larger database, but it is important that the used search methodology does not introduce artefacts that could give rise to plausible alternative explanations for the findings. Specifically, I wonder if at least some of the results could not have been caused by developments and changes in reporting style. There may be many entirely non-scientific reasons to report “p = “ instead of “p < “ and “p > “. Changes in style, or new requirements by journals, etc. My worries are strengthened by the low absolute numbers of papers found (the crucial comparisons are of course presented in proportions). If the actual number of papers that base their empirical claims on p-values between 0.04 and 0.05 is a lot larger than the estimated number of such articles in this study (which I strongly suspect) it is not guaranteed that the relation between the estimated and actual number of papers is linear enough for the resulting proportional patterns to remain largely the same.

One way to at least partially tackle this potential problem is to take a random sample of papers from the database and inspect them by hand, to check the reliability of the automatic searches. From the random sample one could for instance construct a signal detection theory based matrix with false positives (thought it was a case to be counted, but it wasn't) and false negatives (did not think it was a case to be counted, but it was). This way, the validity of the method can be checked without having to hand-inspect all the papers in the database.

Best regards, Jan de Ruiter, Bielefeld University

3551 days ago - Joost de Winter

Dear Dr. De Ruiter. Our reply can be found here: https://sites.google.com/site/jcfdewinter/DeRuiter_reply.pdf

0

3549 days ago - Fred Hasselman

Dear dr. de Winter and Dodou,

I have read your response to dr. Fanelli's commentary and would like to point out three issues that have to do with the "flawed methods" section.

1. The choice for an analysis is not a mere matter of model selection. Each model makes assumptions about the distribution of variables and when those assumptions are violated there indeed is a right or wrong. Modelling time as a linear predictor is a poor choice, modelling counts assuming normality as well, visually evaluating whether heteroscedasticity is a problem as well. The latter can be easily verified by a multilevel modelling strategy. The problem of cross-classification of articles between disciplines can also be adressed in a model. Moreover, your qualification of logistic regression as being 'more complex' and the choice between the two regression models as being a matter of parsimony are just not valid arguments to defend the choice of linear regression for these variables, even if the results would be similar. Complexity is in the eye of the beholder and parsimony would in this context probably refer to the simplest model within a type of regression model.

2. I find it odd that positive result reporting bias in physics can be characterised by p-values in the ranges reported in the Venn diagrams. In particle physics the 'discovery level' for a new particle is at 5 Sigma (about p=0.0000003) and the 'evidence level' at 3 Sigma (about p=0.003). If you want to indicate bias levels you should look for marginals around those levels. I'd be surprised if those physics studies reporting p-values around .05 were written by physicists, or, if they have anything to do with actual testing of physical theories. The fact that they may have been published in a physics journal does not suffice.

3. Most importantly, p-values do not reflect the claims that researchers make in the discussion section of their articles. From my own post publication peer review studies on validity of inferences in social science I learned that authors will claim a predicted result was evidenced at levels as high as p=.09, often effects found in separate tests are misinterpreted as corroborations of interactions that were predicted, or, 1 success out of N tests is interpreted as evidencing a prediction by a theory. That kind of behaviour, one does not see in physics very much.

0

3548 days ago - Joost de Winter

Reply to comment by Hasselman

By De Winter & Dodou

We appreciate your comments.

1) Our analysis showed that differences between disciplines and regions are inconsistent, as these differences depend on the type of string search (e.g., textual reporting vs. numerical p-values vs. p </> 0.05). This suggests that there are (cultural) differences in research and reporting styles between disciplines/regions. It is implausible that a logistic regression could be helpful in reaching different, more accurate, conclusions.

When observing the figures in our paper (e.g., Figs. 4, 6, 8, 9 10, S2, S9, and to a lesser extent S7 and S11), one can see that the linear fit is overall quite good (e.g., the Pearson correlation between publication year and ratio in Fig. 4 is 0.986). It is implausible that some nonlinear model could provide a boost to statistical power or dramatically alter our conclusions.

We are aware that various refined statistical techniques exist, but we do not believe that such techniques are to be recommended. On the contrary: It is well known that measurement error in a predictor or criterion variable (which inevitably will be present when performing a manual coding of papers), or multicollinearity, can easily cause the outcome of a multiple regression analysis to be misleading. Such issues are even more pertinent when considering the uneven sample sizes between publication years, disciplines, and regions.

Numerous methodologists have cautioned against the false-positive risk associated with complex/flexible statistical modelling. We find it somewhat ironic that some commentators on our paper advocate complex (or even stepwise) regression analyses to seek support for the idea that certain disciplines/regions produce more (false) positives than other disciplines/regions. If one wants to claim that e.g., Asian researchers produce more positives than Europeans, or if one wants to claim that the social sciences produces more positives than the physical sciences, then this effect needs to be robust and replicable, in our opinion. If the effect only appears in some yet-to-be-defined complex multilevel model, then this effect is probably not very trustworthy.

2) We found that p-values in the range 0.040 to 0.060 occur 60 times more frequently in abstracts of biological sciences papers than in the abstracts of physical sciences papers. Actually, the difference is a factor 500 when taking into account ‘pure’ physical sciences papers, and probably even larger when excluding faults (i.e., “P = 0.048” not being a p-value) that occasionally occur in the physical sciences (see our reply to Dr. De Ruiter).

In other words, p-values in the 0.040–0.060 range almost never occur in abstracts of physics papers. It seems that physicists have alternative ways of communicating statistical significance, or simply do not use null hypothesis significance testing. Indeed, it might also be the case that physicists adopt lower alpha values than the biological/social sciences. (Note, however, that low alpha values are also used in the biological sciences, see e.g., http://en.wikipedia.org/wiki/Manhattan_plot).

Summarizing, the physical sciences use totally different methods to test hypotheses than the biological (and social) sciences. Again, our interpretation is that the ‘hierarchy of sciences’ should not be defined in terms of (relatively small) differences in ‘excess significance bias’, but rather in terms of the enormous differences in hypothesis-testing methods and reporting practices between disciplines.

3) Our p-value searches provide only one view of the elephant. We nowhere claim that our p-value searches provide an exhaustive analysis of the semantic content of papers (see also our extensive discussion on limitations, in our working paper). A manual analysis of papers might be able to resolve some of your concerns, but as indicated in our working paper and in our reply to Dr. De Ruiter, a manual assessment has serious drawbacks, such as subjectivity and risk of bias.

A surge of p-values between 0.040 and 0.049 in recent decades (but negative results are increasing rapidly too)

Abstract

Author Comment

Supplemental Information

Datasets and Matlab script

Feedback on other revisions

Feedback on this revision

0

0

0

0

Add your feedback

Supplemental Information

Datasets and Matlab script

Feedback on other revisions

Feedback on this revision

0

0

0

0

Add your feedback

Top referrals unique visitors

Share this preprint

Metrics

Download article