If Computers Are So Smart, How Come They Can’t Read?

Deep learning excels at learning statistical correlations, but lacks robust ways of understanding how the meanings of sentences relate to their parts.
Qatar library
Reading isn’t just about statistics, it’s about synthesizing knowledge.Photo: Karim Jaafar/AFP/Getty Images

At TED, in early 2018, the futurist and inventor Ray Kurzweil, currently a director of engineering at Google, announced his latest project, “Google Talk to Books,” which claimed to use natural language understanding to “provide an entirely new way to explore books.” Quartz dutifully hyped it as “Google’s astounding new search tool [that] will answer any question by reading thousands of books.”

If such a tool actually existed and worked robustly, it would be amazing. But so far it doesn’t. If we could give computers one capability that they don’t already have, it would be the ability to genuinely understand language. In medicine, for example, several thousand papers are published every day; no doctor or researcher can possibly read them all. Drug discovery gets delayed because information is locked up in unread literature. New treatments don’t get applied, because doctors don’t have time to discover them. AI programs that could synthesize the medical literature–or even just reliably scan your email for things to add to your to-do list—would be a revolution.

This article is adapted from Rebooting AI: Building Artificial Intelligence We Can Trust, by Gary Marcus and Ernest Davis. Marcus is Founder and CEO of Robust.AI and a professor emeritus at NYU. Davis is a professor of computer science at NYU.

Pantheon

But drill down into tools like Google Talk to Books (GTB) and you quickly realize we are nowhere near genuine machine reading yet. When we asked GTB, “Where did Harry Potter meet Hermione Granger?” only six of the 20 answers were even about Harry Potter; most of the rest were about other people named Harry or on completely unrelated topics. Only one mentioned Hermione, and none answered the question. When we asked GTB, “Who was the oldest Supreme Court justice in 1980?" we got another fail. Any reasonably bright human could go to Wikipedia’s list of Supreme Court justices and figure out that it was William Brennan. Google Talk to Books couldn’t; no sentence in any book that it had digested spelled out the answer in full, and it had no way to make inferences beyond what was directly spelled out.

The most telling problem, though, was that we got totally different answers depending on how we asked the question. When we asked GTB, “Who betrayed his teacher for 30 pieces of silver?” a famous incident in a famous story, only three out of the 20 correctly identified Judas. Things got even worse as we strayed from the exact wording of “pieces of silver.” When we asked a slightly less specific questions, “Who betrayed his teacher for 30 coins?” Judas only turned up in one of the top 20 answers; and when we asked “Who sold out his teacher for 30 coins?” Judas disappeared from the top 20 results altogether.


To get a sense for why robust machine reading is still such a distant prospect, it helps to appreciate—in detail—what is required even to comprehend a children’s story.

Suppose that you read the following passage from Farmer Boy, a children’s book by Laura Ingalls Wilder. Almanzo, a 9-year-old boy, finds a wallet (then called a “pocketbook”) full of money dropped in the street. Almanzo’s father guesses that the pocketbook might belong to Mr. Thompson, and Almanzo finds Mr. Thompson at one of the stores in town.

Almanzo turned to Mr. Thompson and asked, “Did you lose a pocketbook?” Mr. Thompson jumped. He slapped a hand to his pocket, and fairly shouted.

“Yes, I have! Fifteen hundred dollars in it, too! What about it? What do you know about it?”

“Is this it?” Almanzo asked.

“Yes, yes, that’s it!” Mr. Thompson said, snatching the pocketbook. He opened it and hurriedly counted the money. He counted all the bills over twice. … Then he breathed a long sigh of relief and said, “Well, this durn boy didn’t steal any of it.”

A good reading system would be able to answer questions like these:

• Why did Mr. Thompson slap his pocket with his hand?

• Before Almanzo spoke, did Mr. Thompson realize that he had lost his wallet?

• What was Almanzo referring to when he asked, “Is this it?”

• Was all of the money still in the wallet?

All of these questions are easy for people. But no AI yet devised comes close—because each of these questions requires a reader to follow a chain of inferences that are only implicit in the story, and current techniques do not carry out inference in this sense. What is implicit is largely outside their scope. Such chains of reasoning often demand that the reader put together background knowledge about people and objects and, more generally, about how the world works. No current system has a broad enough fund of general knowledge to do this well.

Take question one, for example. Before Almanzo speaks, Mr. Thompson doesn’t know he has lost the wallet and assumes that he has got the wallet in his pocket. When Almanzo asks him whether he has lost a wallet, Thompson realizes he might in fact have lost his wallet. It is to test this possibility—the wallet might be lost—that Thompson slaps his pocket. Since the wallet isn’t where he usually keeps it, Thompson concludes that he lost his wallet. Consciously or unconsciously, as you digested the story of Almanzo and the wallet, you probably drew on most or all of the following kinds of facts:

• People can drop things without realizing it. [Knowledge about the relation of events and people’s mental states.]

• People often carry their wallet in their pocket. [Knowledge about how people use objects.]

• If people assume that something important to them is true, and they find out that it might not be true, then it is often urgent for them verify it. [Knowledge about the kinds of things that are psychologically important to people.]

• You can often find out whether something is inside your pocket by feeling the pocket on the outside. [This combines knowledge about how different objects (hands, pockets, wallets) interact with one another with knowledge about how the senses work.]

The currently popular approach to AI doesn’t do any of that; instead of representing knowledge, it just represents probabilities, mainly of how often words tend to co-occur in different contexts. This means you can generate strings of words that sound humanlike, but there’s no real coherence there.

To illustrate what we mean by this, we fed the Almanzo story into GPT-2, a state-of-the-art text-generation system created by OpenAI, allegedly so good that OpenAI couldn’t safely share it with the public, despite its mission to make AI freely available. In went the text we presented above, ending with Thompson’s discovery that all his money was still there; out came this:

It took a lot of time, maybe hours, for him to get the money from the safe place where he hid it, so he brought it back in a bundle and left it on the table. Then he noticed the money that had been hiding in the bed, and began walking toward the bed with a large bundle.

The words are fluent, but the ideas are incoherent. If Mr Thompson just got his wallet back, how come the money is suddenly hidden in a safe place? The system has glommed on to the statistical correlation between wallets and safe places, but entirely lost the thread of the story. (You can try this yourself; every continuation will be different; few of them will be coherent.) No current system is any better.


How did we get into this mess?

Current AI systems are largely powered by a statistical technique called deep learning, and deep learning is very effective at learning correlations, such as correlations between images or sounds and labels. But deep learning struggles when it comes to understanding how objects like sentences relate to their parts (like words and phrases).

Why? It’s missing what linguists call compositionality: a way of constructing the meaning of a complex sentence from the meaning of its parts. For example, in the sentence "The moon is 240,000 miles from the Earth," the word moon means one specific astronomical object, Earth means another, mile means a unit of distance, 240,000 means a number, and then, by virtue of the way that phrases and sentences work compositionally in English, 240,000 miles means a particular length, and the sentence "The moon is 240,000 miles from the Earth" asserts that the distance between the two heavenly bodies is that particular length.

Surprisingly, deep learning doesn’t really have any direct way of handling compositionality; it just has information about lots and lots of complex correlations, without any structure. It can learn that dogs have tails and legs, but it doesn’t know how they relate to the life cycle of a dog. Deep learning doesn’t recognize a dog as an animal composed of parts like a head, a tail, and four legs, or even what an animal is, let alone what a head is, and how the concept of head varies across frogs, dogs, and people, different in details yet bearing a common relation to bodies. Nor does deep learning recognize that a sentence like "The moon is 240,000 miles from the Earth" contains phrases that refer to two heavenly bodies and a length.

At the same time, deep learning has no good way to incorporate background knowledge. A system can learn to predict that the words wallet and safe place occur in similar kinds of sentences ("He put his money in the wallet," "He put his money in a safe place"), but it has no way to relate that to the fact that people like to protect their possessions.

In the language of cognitive psychology, what you do when you read any text is build up a cognitive model of the meaning of what the text is saying. As you read the passage from Farmer Boy, for example, you gradually build up a mental representation—internal to your brain—of all the people, objects, and incidents in the story and the relations among them: Almanzo, the wallet, and Mr. Thompson, and also the events of Almanzo speaking to Mr. Thompson, Mr. Thompson shouting and slapping his pocket, Mr. Thompson snatching the wallet from Almanzo, and so on. It’s only after you’ve read the text and constructed the cognitive model that you do whatever you do with the narrative—answer questions about it, translate it into Russian, illustrate it, or just remember it for later.

Ever since 2013, when DeepMind built a system that played Atari games—often better than humans—without cognitive models, and sold themselves to Google for more than half a billion dollars, cognitive models have gone out of fashion. But what works for games with their fixed rules and limited options doesn’t work for reading. The simulated prose of the cognitive-model-free GPT-2 is entertaining, but it's a far cry from genuine reading comprehension.

That’s because, in the final analysis, statistics are no substitute for real-world understanding. Instead, there is a fundamental mismatch between the kind of statistical computation that powers current AI programs and the cognitive-model construction that would be required for systems to actually comprehend what they are trying to read.

We don’t think it is impossible for machines to do better. But mere quantitative improvement—with more data, more layers in our neural networks, and more computers in the networked clusters of powerful machines that run those networks—isn’t going to cut it.

Instead, we believe it is time for an entirely new approach that is inspired by human cognitive psychology and centered around reasoning and the challenge of creating machine-interpretable versions of common sense.

Reading isn’t just about statistics, it’s about synthesizing knowledge: combining what you already know with what the author is trying to tell you. Kids manage that routinely; machines still haven’t.


From Rebooting AI: Building Artificial Intelligence We Can Trust, by Gary Marcus and Ernest Davis. Copyright © 2019 by Gary Marcus and Ernest Davis. Reprinted by permission of Pantheon Books, an imprint of the Knopf Doubleday Publishing Group, a division of Penguin Random House LLC.


When you buy something using the retail links in our stories, we may earn a small affiliate commission. Read more about how this works.


More Great WIRED Stories