Dec 3, 2017 7:00 AM

It's Gonna Get a Lot Easier to Break Science Journal Paywalls

Scientific search engines are the Napster of academic papers—and they're only getting more powerful.

Scientific Search Engines Are Getting More Powerful

Anurag Acharya’s problem was that the Google search bar is very smart, but also kind of dumb. As a Googler working on search 13 years ago, Acharya wanted to make search results encompass scholarly journal articles. A laudable goal, because unlike the open web, most of the raw output of scientific research was invisible—hidden behind paywalls. People might not even know it existed. “I grew up in India, and most of the time you didn’t even know if something existed. If you knew it existed, you could try to get it,” Acharya says. “‘How do I get access?’ is a second problem. If I don’t know about it, I won’t even try.”

Acharya and a colleague named Alex Verstak decided that their corner of search would break with Google tradition and look behind paywalls—showing citations and abstracts even if it couldn’t cough up an actual PDF. “It was useful even if you did not have university access. That was a deliberate decision we made,” Acharya says.

Then they hit that dumbness problem. The search bar doesn’t know what flavor of information you’re looking for. You type in “cancer;” do you want results that tell you your symptoms aren’t cancer (please), or do you want the Journal of the American Medical Association? The search bar doesn’t know.

Acharya and Verstak didn't try to teach it. Instead, they built a spinoff, a search bar separate from Google-prime that would only look for journal articles, case law, patents—hardcore primary sources. And it worked. “We showed it to Larry [Page] and he said, ‘why is this not already out?’ That’s always a positive sign,” Acharya says.

Today, even though you can’t access Scholar directly from the Google-prime page, it has become the internet’s default scientific search engine—even more than once-monopolistic Web of Science, the National Institutes of Health’s PubMed, and Scopus, owned by the giant scientific publisher Elsevier.

But most science is still paywalled. More than three quarters of published journal articles—114 million on the World Wide Web alone, by one (lowball) estimate—are only available if you are affiliated with an institution that can afford pricey subscriptions or you can swing $40-per-article fees. In the last several years, though, scientists have made strides to loosen the grip of giant science publishers. They skip over the lengthy peer review process mediated by the big journals and just … post. Review comes after. The paywall isn’t crumbling, but it might be eroding. The open science movement, with its free distribution of articles before their official publication, is a big reason.

Another reason, though, is stealthy improvement in scientific search engines like Google Scholar, Microsoft Academic, and Semantic Scholar—web tools increasingly able to see around paywalls or find articles that have jumped over. Scientific publishing ain’t like book publishing or journalism. In fact, it’s a little more like music, pre-iTunes, pre-Spotify. You know, right about when everyone started using Napster.

Before World War II most scientific journals were published by small professional societies. But capitalism’s gonna capitalism. By the early 1970s the top five scientific publishers—Reed-Elsevier, Wiley-Blackwell, Springer, and Taylor & Francis—published about 20 percent of all journal articles. In 1996, when the transition to digital was underway and the PDF became the format of choice for journals, that number went up to 30 percent. Ten years later it was 50 percent.

Those big-five publishers became the change they wanted to see in the publishing world—by buying it. Owning over 2,500 journals (including the powerhouse Cell) and 35,000 books and references (including Gray’s Anatomy) is big, right? Well, that’s Elsevier, the largest scientific publisher in the world, which also owns ScienceDirect, the online gateway to all those journals. It owns the (pre-Google Scholar) scientific search engine Scopus. It bought Mendeley, a reference manager with social and community functions. It even owns a company that monitors mentions of scientific work on social media. “Everywhere in the research ecosystem, from submission of papers to research evaluations made based on those papers and various acts associated with them online, Elsevier is present,” says Vincent Larivière, an information scientist at the University of Montreal and author of the paper with those stats about publishing I put one paragraph back.

The company says all that is actually in the service of wider dissemination. “We are firmly in the open science space. We have tools, services, and partnerships that help create a more inclusive, more collaborative, more transparent world of research,” says Gemma Hersh,¹ Elsevier’s vice president for open science. “Our mission is around improving research performance and working with the research community to do that.” Indeed, in addition to traditional, for-profit journals it also owns SSRN, a preprint server—one of those places that hosts unpaywalled, pre-publication articles—and publishes thousands of articles at various levels of openness.

So Elsevier is science publishing’s version of Too Big to Fail. As such, it has faced various boycotts, slightly piratical workarounds, and general anger. (“The term ‘boycott’ comes up a lot, but I struggle with that. If I can be blunt, I think it’s a word that’s maybe misapplied,” Hersh says. “More researchers submit to us every year, and we publish more articles every year.”)

If you’re not someone with “.edu” in your email, this might make you a little nuts. Not just because you might want to actually see some cool science, but because you already paid for that research. Your taxes (or maybe some zillionaire’s grant money) paid the scientists and funded the studies. The experts who reviewed and critiqued the results and conclusions before publication were volunteers. Then the journal that published it charged a university or a library—again, probably funded at least in part by your taxes—to subscribe. And then you gotta buy the article? Or the researcher had to pony up $2,000² to make it open access?

Now, publishers like Elsevier will say that the process of editing, peer-reviewing, copy editing, and distribution are a major, necessary value add. And look at the flip side: so-called predatory journals that charge authors to publish nominally open-access articles with no real editing or review (that, yes, show up in search results). Still, the scientific publishing business is a $10 billion-a-year game. In 2010, Elsevier reported profits of $1 billion and a 35 percent margin. So, yeah.

In that early-digital-music metaphor, the publishers are the record labels and the PDFs are MP3s. But you still need a Napster. That’s where open-science-powered search engines come in.

A couple years after Acharya and Verstak built Scholar, a team at Microsoft built their own version, called Academic. It was at the time a much, let’s say, leaner experience, with far fewer papers available. But then in 2015, Microsoft released a 2.0, and it’s a killer.

Microsoft’s communication team declined to make any of the people who run it available, but a paper from the team at Microsoft Research lays the specs out pretty well: It figures out the bibliographic data of papers and combines that with results from Bing. (A real search engine that exists!) And you know what? It’s pretty great. It sees 83 million papers, not so far from estimations of the size of Google’s universe, and does the same kind of natural-language queries. Unlike Scholar, people can hook into Microsoft Academic’s API and see its citation graph, too.

Even as recently as 2015, scientific search engines weren’t much use to anyone outside universities and libraries. You could find a citation to a paper, sure—but good luck actually reading it. Even though more overt efforts to subvert copyright like Sci-Hub are falling to lawsuits from places like Elsevier and the American Chemical Society, the open science movement gaining is momentum. PDFs are falling off virtual trucks all over the internet—posted on university web sites or places like ResearchGate and Academia.edu, hosts for exactly this kind of thing—Scholar’s and Academic’s first sorties against the paywall have been joined by reinforcements. It’s starting to look like a siege.

For example the Chan Zuckerberg Initative, philanthropic arm of the founder of Facebook, is working on something aimed at increasing access. The founders of Mendeley have a new, venture-backed PDF finder called Kopernio. A browser extension called Unpaywall roots around the web for free PDFs of articles.

A particularly novel web crawler comes from the non-profit Allen Institute for Artificial Intelligence. Semantic Scholar pores over a corpus of 40 million citations in computer science and biomedicine, and extracts the tables and charts as well as using machine learning to infer meaningful cites as “highly influential citations,” a new metric. Almost a million people use it every month.

“We use AI techniques, particularly natural language processing and machine vision, to process the PDF and extract information that helps readers decide if the paper is of interest,” says Oren Etzioni, CEO of the Allen Institute for AI. “The net effect of all this is that more and more is open, and a number of publishers … have said making content discoverable via these search engines is not a bad thing.”

Even with all these increases in discoverability and access, the technical challenges of scientific search don’t stop with paywalls. When Acharya and Verstak started out, Google relied on PageRank, a way to model how important hyperlinks between two web pages were. That’s not how scientific citations work. “The linkage between articles is in text. There are references, and references are all approximate,” Acharya says. “In scholarship, all your citations are one way. Everybody cites older stuff, and papers never get modified.”

Plus, unlike a URL, the location or citation for a journal article is not the actual journal article. In fact, there might be multiple copies of the article at various locations. From a perspective as much philosophical and bibliographical, a PDF online is really just a picture of knowledge, in a way. So the search result showing a citation might also attach to multiple versions of the actual article.

That’s a special problem when researchers can post pre-print versions of their own work but might not have copyright to the publication of record, the peer-reviewed, copy-edited version in the journal. Sometimes the differences are small; sometimes they’re not.

Why don’t the search engines just use metadata to understand what version belongs where? Like when you download music, your app of choice automatically populates with things like an image, the artist’s name, the song titles...the data about the thing.

The answer: metadata LOL. It’s a big problem. “It varies by source,” Etzioni says. “A whole bunch of that information is not available as structured metadata.” Even when there is metadata, it’s in idiosyncratic formats from publisher to publisher and server to server. “In a surprising way, we’re kind of in the dark ages, and the problem just keeps getting worse,” he says. More papers get published; more are digital. Even specialists can’t keep up.

Which is why scientific search and open science are so intertwined and so critical. The reputation of a journal and the number of times a specific paper in that journal gets cited are metrics for determining who gets grants and who gets tenure, and by extension who gets to do bigger and bigger science. “Where the for-profit publishers and academic presses sort of have us by the balls is that we are addicted to prestige,” says Guy Geltner, a historian at the University of Amsterdam, open science advocate, and founder of a new user-owned social site for scientists called Scholarly Hub.

The thing is, as is typical for Google, Scholar is as opaque about how it works and what it finds. Acharya wouldn’t give me numbers of users or the number of papers it searches. (“It’s larger than the estimates that are out there,” he says, and “an order of magnitude bigger than when we started.) No one outside Google fully understands how the search engine applies its criteria for inclusion,³ and indeed Scholar hoovers up way more than just PDFs of published or pre-published articles. You get course syllabi, undergraduate coursework, PowerPoint presentations … actually, for a reporter, it’s kind of fun. But tricky.

That means the citation data is also obscure, which makes it hard to know what Scholar’s findings mean for science as a whole. Scholar may be a low-priority side-project (please don’t kill it like you killed Reader!) but maybe that data is going to be valuable someday. Elsevier obviously thinks it’s useful.

The scientific landscape is shifting. "If you took a group of academics right now and asked them to create a new system of publishing, nobody would suggest what we're currently doing," says David Barner, a psychologist at UC San Diego and open science advocate. But change, Barner says, is hard. The people who'd make those changes are already overworked, already volunteering their time.

Even Elsevier knows that change is coming. “Rather than scrabble around in one of the many programs you’ve mentioned, anyone can come to our Science and Society page, which details a host of programs and organizations we work with to cater through every scenario where somebody wants access,” Hersh says. And that’d be to the final, published, peer-reviewed version—the archived, permanent version of record.

Digital revolutions have a way of #disrupting no matter what. As journal articles get more open and more searchable, value will come from understanding what people search for—as Google long ago understood about the open web. “We’re a high quality publisher, but we’re also an information analytics company, evolving services that the research community can use,” Hersh says.

Because reputation and citation are core currencies to scientists, scientists have to be educated about the possibilities of open publication at the same time as prestigious, reputable venues have to exist. Preprints are great, and the researchers maintain copyright to them, but it’s also possible that the final citation-of-record could be different after it goes through review. There has to be a place where primary scientific work is available to the people who funded it, and a way for them to find it.

Because if there isn’t? “A huge part of research output is suffocating behind paywalls. Sixty-five of the 100 most cited articles in history are behind paywalls. That’s the opposite of what science is supposed to do,” Geltner says. “We’re not factories producing proprietary knowledge. We’re engaged in debates, and we want the public to learn from those debates.”

I'm sensitive to the irony of a WIRED writer talking about the social risks of a paywall, though I'd draw a distinction between paying a journalistic outlet for its journalism and paying a scientific publisher for someone else's science.

An even more critical difference, though, is that a science paywall does more than separate gown from town. When all the solid, good information is behind a paywall, what’s left outside in the wasteland will be crap—propaganda and marketing. Those are always free, because people with political agendas and financial interests underwrite them. Understanding that vaccines are critical to public health and human-driven carbon emissions are un-terraforming the planet cannot be the purview of the one percent. “Access to science is going to be a first-world privilege,” Geltner says. “That’s the opposite of what science is supposed to be about.”

¹ UPDATE 12/3/17 11:55 AM Corrected the spelling of this name. ² UPDATE 12/4/17 1:25 PM Removed the word "another;" researchers sometimes pay to make their own articles open-access. ³ UPDATE 12/4/17 1:25 PM Clarified to show that Google publishes inclusion criteria.