Google Scholar is a serious alternative to Web of Science

Argues that Google Scholar needs to be treated as a serious alternative data source for citation analysis

[This post was reposted at the LSE Impact blog and became one of the top-12 most popular posts of the year, clocking up more than a 1,000 Facebook shares and nearly 300 LinkedIn shares - the largest number of any of the 12 posts].

Publish or Perish uses Google Scholar as one of its data sources (the other being Microsoft Academic). Many bibliometricians and university administrators are fairly conservative in their approach to citation analysis. It is not unusual to see them prefer the Web of Science (ISI for short) as “the gold standard” and discard Google Scholar out of hand, simply because they have heard some wild-west stories about its “overly generous” coverage. These stories are typically based one or more of the following misconceptions, which I will dispute below.

  • First, the impression that everything “on the web” citing an academic’s work counts as a citation.
  • Second, the assumption that any publication that is not listed in the Web of Science is not worth considering at all.
  • Third, a general impression that citation counts in Google Scholar are completely unreliable.

Not everything published on the Internet counts in Google Scholar

Some academics are under the misplaced impression that anything posted on the Internet that includes references will be counted in Google Scholar. This might also be the source behind the misconception that one can put simply put phantom papers online to improve one’s citation count. However, Google Scholar only indexes scholarly publications. As their website indicates “We work with publishers of scholarly information to index peer-reviewed papers, theses, preprints, abstracts, and technical reports from all disciplines of research.”

Some non-scholarly citations, such as student handbooks, library guides or editorial notes slip through. However, incidental problems in this regard are unlikely to distort cita­tion metrics, especially robust ones such as the h-index. Hence, although there might be some overestimation of the number of scholarly cita­tions in Google Scholar, for many disciplines this is preferable to the very significant and systematic under-estimation of scholarly citations in ISI or Scopus. Moreover, as long as one compares like with like, i.e. compares citation records for the same data source, this should not be a problem at all.

Non-ISI listed publication can be high-quality publications

There is also a frequent assumption amongst research administrators that ISI listing is a stamp of quality and that hence one should ignore non-ISI listed publications and citations. There are two problems with this assumption. First, ISI has a bias towards Science, English-language and North American journals. Second, ISI almost completely ignores a vast majority of publications in the Social Sciences and Humanities.

  • ISI journal listing is very incomplete in the Social Sciences & Humanities: ISI’s listing of journals is much more comprehensive in the Sciences than in the Social Sciences and Humanities. Butler (2006) analysed the distribution of publication output by field for Australian universities between 1999 and 2001. She found that whereas for the Chemical, Biological, Physical and Medical/Health sciences between 69.3% and 84.6% of the publications were published in ISI listed journals, this was the case for only 4.4%-18.7% of the publica­tions in the Social Sciences such as Management, History Education and Arts. Many high-quality journals in the field of Economics & Business are not ISI listed. Only 30%-40% of the journals in Accounting, Marketing and General Management & Strategy listed on my Journal Quality List (already a pretty selective list) are ISI listed. There is no doubt that – on average – journals that are ISI listed are perceived to be of higher quality. However, there are a very substantial number of non-ISI indexed journals that have a higher than average h-index.
  • ISI has very limited coverage of non-journal publications: Second, even in the Cited Reference search ISI only includes citations in ISI listed journals. In the General Search function it almost completely ignores any publications that are not in ISI-listed journals. As a result a vast majority of publications and citations in the Social Sciences & Huma­ni­ties, as well as in Engineering & Computer Science, are ignored. In the Social Sciences and Huma­nities this is mainly caused by an almost complete neglect of books, book chapters, publications in languages other than English, and publications in non-ISI listed journals. In Engineering and Computer Science, this is mostly caused by a neglect of conference proceedings. ISI has recently introduced conference proceedings in their database. However, it does not provide any details of which conferences are covered beyond listing some disciplines that are covered. I was unable to find any of my own publications in conference proceedings. As a result ISI very seriously underestimates both the number of publications and the num­ber of citations for academics in the Social Sciences & Humanities and in Engineering & Computer Science.

flaws

Google Scholar’s flaws have been played up far too much

Peter Jacsó, a prominent academic in Information and Library Science, has published several rather critical articles about Google Scholar (see e.g. Jacsó, 2006a/b). When confronted with titles such as “Dubious hit counts and cuckoo’s eggs” “Deflated, inflated and phantom citation counts”, Deans, academic administrators and tenure/­promotion committees could be excused for assuming Google Scholar provides unreliable data.

However, the bulk of Jacsó’s (2006b) critique is levelled at Google Scholar’s inconsistent number of results for keyword searches, which are not at all relevant for the author and journal impact searches that most academics use Publish or Perish for. For these type of searches, the following caveats are important.

  • Citation metrics are robust and insensitive to occasional errors: In addition, most of the metrics used in Publish or Perish are fairly robust and insensitive to occasional errors as they will not generally change the h-index or g-index and will only have a minor impact on the number of citations per paper. There is no doubt that Google Scholar’s automatic parsing occasionally provi­des us with nonsensical results. However, these errors do not appear to be as frequent or as important as implied by Jacsó’s articles. They also do not generally impact the results of author or journal queries much, if at all.
  • Google Scholar parsing has improved significantly: Google Scholar has also significantly improved its parsing since the errors were pointed out to them. However, many academics are still referring to Jacsó’s 2006 articles as convincing arguments against any use of Google Scholar. I would argue this is inappropriate. As academics, we are only all too well aware that all of our research results include a certain error margin. We cannot expect citation data to be any different.
  • Google Scholar errors are random rather than systematic: What is most important is that errors are random rather than systematic. I have no reason to believe that the Google Scholar errors identified in Jacsó’s articles are anything else than random. Hence they will not normally advantage or disadvantage individual academics or journals.
  • ISI and Scopus have systematic errors of coverage: In contrast, commercial databases such as ISI and Scopus have systematic errors as they do not include many journals in the Social Sciences and Humanities, nor have good coverage of conferences proceedings, books or book chapters. Therefore, although it is always a good idea to use multiple data-sources, rejecting Google Scholar out of hand because of presu­med parsing errors is not rational. Nor is presuming ISI is error-free simply because it charges high subscription fees.

Conclusion

As I have argued in the past, Google Scholar and Publish or Perish have democratised citation analysis. Rather than leaving it in the hands of those with access to commercial data-bases with high subscription fees, everyone with a computer and internet access can now run their own analyses. If you'd like to know more about this, have a look at this presentation.

Related Videos