The Best AI Still Flunks 8th Grade Science

We’re a long way from machines that can carry on a real conversation. We’re even a long way from machines that can take a basic science test.
Web
Then One/WIRED

In 2012, IBM Watson went to medical school. So said The New York Times, announcing that the tech giant's artificially intelligent question-and-answer machine had begun a "stint as a medical student" at the Cleveland Clinic Lerner College of Medicine.

This was just a metaphor. Clinicians were helping IBM train Watson for use in medical research. But as metaphors go, it wasn't a very good one. Three years later, our artificially intelligent machines can't even pass an eighth-grade science test, much less go to medical school.

So says Oren Etzioni, a professor of computer science at the University of Washington and the executive director of the Allen Institute for Artificial Intelligence, the AI think-tank funded by Microsoft co-founder Paul Allen. Etzioni and the non-for-profit Allen Institute recently ran a contest, inviting nearly 800 teams of researchers to build AI systems that could take an eighth grade science test, and today, the Institute released the results: The top performers successfully answered about 60 percent of the questions. In other words, they flunked.

For Etzioni, this five-month-long contest serves as a reality check for the state of artificial intelligence. Yes, thanks to the rise of deep neural networks, networks of hardware and software that approximate the web of neurons in the human brain, companies like Google and Facebook and Microsoft have achieved human-like performance in identifying images and recognizing spoken words, among other tasks. But we're still a long way from machines that can really think, from AI that can carry on a real conversation, even from systems that can pass a basic science test.

Whither Watson?

You might say that, way back in 2011, IBM Watson beat the best humans on Earth at Jeopardy!, the venerable TV trivia game show. And it did. Google just built a system that could top a professional at the ancient game of Go. But for a machine, these are somewhat easier tasks than taking a science test. "Jeopardy! is [about] finding a single fact, while I would imagine---and hope---that 8th-grade science asks students to solve problems that require several steps, and combine multiple facts to show understanding," says Chris Nicholson, CEO and founder of AI startup Skymind.

The Allen Institute's science test includes more than just trivia. It asks that machines understand basic ideas, serving up not only questions like "Which part of the eye does light hit first?" but more complex questions that revolve around concepts like evolutionary adaptation. "Some types of fish live most of their adult lives in salt water but lay their eggs in freshwater," one question read. "The ability of these fish to survive in these different environments is an example of [what]?"

These were multiple-choice questions---and the machines still couldn't pass, despite using state-of-the-art techniques, including deep neural nets. "Natural language processing, reasoning, picking up a science textbook and understanding---this presents a host of more difficult challenges," Etzioni says. "To get these questions right requires a lot more reasoning."

Yes, most of the contestants were academics, independent researchers, or computer scientists outside the largest tech companies. But Etzioni isn't sure the tech giants would preform all that much better, despite employing some of the top researchers in the field. "It's entirely possible that the scores would have gone higher had companies like Google and others put their 'big guns' to work," he says. "[But] the 'wisdom of the crowds' is quite powerful and there some very talented folks engaged in these contests." Chaim Linhart, an Israeli researcher who participated in the competition, agrees. "In most competitions, I think the winning models are very specific to the test dataset, so even companies that work in the same domain don't necessarily have a significant advantage," he says.

What about Watson? According to Etzioni, IBM declined to participate (the company says it has turned its attentions away from contests like this and towards "real world" applications). But Watson is perhaps not the best litmus test. Watson was good at Jeopardy!. That's what it was built for. But today, Watson is really just a brand name for a wide range of AI tools offered by IBM, and those tools aren't necessarily state of the art.

Back to Work

Etzioni's eighth grade science test is really a test of natural language understanding---how well a machine understands the natural way humans speak and write. IBM's services do include natural language processing, but since Watson's arrival, this kind of tech has received a new boost from deep neural nets. Just as you can teach a neural net to recognize a cat by feeding it myriad cat photos, you can teach it to understand natural language using mountains of digital dialogue. Google, for instance, has used neural nets to build a chatbot that debates the meaning of life.

But this chatbot wasn't completely convincing. As it stands, the state of the art lies beyond any one technology. "So far, there is no universal method," says Dutch researcher Benedikt Wilbertz, another participant in the Allen AI contest. "This challenge needed its own mix of machine learning and [other] AI tools." Indeed, the top participants in the Allen AI challenge used deep learning as well as various other techniques. And the end result was still well below perfect.

Doug Lenat, who runs an AI project called Cyc, says that teaching today's machines to take basic science tests doesn't even make much sense. We should be striving for something more---something much further out. "If you're talking about passing multiple choice science tests, I always felt that was not actually the test AI should be aiming to pass," he says. "The focus on natural language understanding----science tests, and so on---is something that should follow from a program being actually intelligent. Otherwise, you end up hitting the target but producing the veneer of understanding." In other words, a machine that passes an eighth grade science test isn't all that smart.

So, we've yet to build a machine that's even sorta close to real intelligence. But work will continue.