Pixar Vets Reinvent Speech Recognition So It Works for Kids

Oren Jacob and his daughter Tobey had just finished Skyping with some family members when his daughter, then 7-years-old, picked up Jacob’s phone and asked if she could call her American Girl doll.
pixargameinline
ToyTalk

Oren Jacob and his daughter, Toby, had just finished a Skype call. They'd been chatting with some other family members on Jacob's smartphone, and it was still sitting on the table in front of them, when the 7-year-old Toby picked it up and asked if she could call her American Girl doll. Jacob paused before answering. "No, you can't," he said. "But let me get back to you on that."

Having spent 20 years of his career at Pixar, including a stint as chief technology officer, Jacob had worked on films like Toy Story and Finding Nemo, using technology to animate some of the most iconic movie characters of recent years. But on that day back in 2011, his daughter touched on something he hadn't thought about before.

Though characters like Woody and Buzz Lightyear are wonderfully realistic and lovable, the relationship that kids have with them is largely one-sided. Kids can hear these characters talk---not only through movies, but games, toys, and other movie merchandise---but they can't engage them. They can't really carry on a conversation with Woody or Buzz.

>Jacob paused before answering. 'No, you can't,' he said. 'But let me get back you on that.'

It was this idea that inspired Jacob to team up with his former Pixar colleague, Martin Reddy, and launch a new company, ToyTalk. The San Francisco-based outfit develops mobile games that let kids have conversations with animated characters---dialogues that can last for hours. The most recent game, SpeakaLegend, which lets kids chat with mythical creatures like dragons and unicorns, launched Thursday in the App Store.

Oren Jacob.

ToyTalk

These apps are rather clever in their own right, but what could potentially turn ToyTalk into a Pixar-like company is the technology it built to power them all. Known as PullString, it's equal parts speech recognition engine and script writing tool, and it's quite a departure from other speech rec tools developed by the likes of Microsoft, Google, and Apple. It's tailored specifically to kids, whose sentence structure, pitch, and vocal tone have posed challenges for traditional tools.

Having applied PullString to its own games, ToyTalk is hoping to license the technology to other companies in the toy industry and beyond. And for many in the industry, this could not only reinvent kids entertainment, but also significantly change speech recognition as we know it.

The Way Kids Communicate

The race to develop superior speech technology has never been more cutthroat. For proof, check out Microsoft’s recent marketing campaign, pitting its virtual assistant, Cortana, against Siri.

Speech capability is becoming a selling point not only for phones, but for video game consoles, televisions, and even refrigerators. But as these companies push their speech-enabled devices into our pockets and our homes, they're ignoring perhaps the most important population of potential customers: children.

"The way kids talk and communicate is very different from how adults do, both in terms of how they use language and the fundamental frequencies that come out of their throats," says Gary Clayton, former chief creative officer of the leading speech recognition company, Nuance.1 "But pretty much every other speech recognition technology out there is just horrible with kids."

But as he points out, the way today's children use technology will likely dictate the tech landscape for decades to come. If you can get kids hooked on speech technology young, they'll stay with it forever. "Oren's not only building his own business," Clayton says, "he's building speech technology from the ground up."

A Bit of Trickery

When Jacob and Reddy began working on ToyTalk's first app in the summer of 2011, Apple had yet to announce Siri to the public. And while speech recognition technology did exist at the time, the field was far less mature than it is today. What's more, their task was harder than Apple's.

They weren't simply trying to build technology that could understand a question and search the web for an answer. They wanted to build technology that could truly indulge a child's whimsical imagination by holding a sustained conversation.

Kids don't want to ask a monkey character in a game what the weather will be on Tuesday. They want to sing him a song or ask him about life in the zoo. That meant Jacob and Reddy had to build a system that could not only understand what kids were saying, but could also predict what the kids might say, so the characters would always have an answer at the ready.

Developing such technology required a bit of Oz-ian wizardry. In the early days, the founders set up a playroom in downtown San Francisco and invited parents---hundreds of them---to bring their kids over to sample a mockup of their app. While the kids played downstairs, Jacob and Reddy would run a Skype call to a room upstairs, where, unbeknownst to the kids, they would carry on conversations in the voices of the characters. "We were basically doing live improv for kids, which is exhausting," Jacob says. "After 40 minutes, we'd be on the floor twitching."

After a few months, the founders covered their video feeds from the room, so they could only comment on what they heard, and not what they saw. Then they cut the Skype audio too, sending whatever the kids said off to a third-party speech recognition engine. The people upstairs would then respond to what they read on the raw, and often cryptic, transcript from this engine. Finally, the founders wrote every conceivable response they could think of on post-it notes, lined the walls with them, and restricted their responses to only what was on the wall.

Once that was going smoothly, they took the final step, using their extended research to build PullString and remove the human intermediary altogether.

Learning on the Job

What they learned is that the speech rec technology needed to be more accurate than standard engines. As Clayton explains, kids' voices are higher and ever changing. Their sentence structure is unpredictable and at times, chaotic. They draw out vowels and fumble certain sounds altogether. Today's speech recognizers, he says, just don't have room for such variety.

While ToyTalk uses existing third party technology for its raw speech recognition, it works with those partners to develop better recognition models using ToyTalk's own data. Now, ToyTalk has a trove of some 20 million children’s utterances, which Jacob believes is the largest database of kids conversation in the world. The data is anonymized, and parents must give their consent via email before kids can play, but once they do, that data belongs to ToyTalk. The more kids play, the bigger that trove becomes and the smarter PullString gets.

At the same time, the company needed an automated way to respond to what the system was hearing. In the end, they hired a handful of writers to create massive volumes of dialogue, penning several possible answers to every question. For instance, if one character asks "What's your favorite ice cream flavor?," it must have a different answer prepared for the top five ice cream flavors a child is likely to respond with.

But just as important as predicting the right answer to a question is knowing what not to talk about. A fairy should have a lot to say to a kid about ice cream. Not so much the airstrikes in Syria. "Virtual assistants are awesome when they can answer every question. In our case, it's the opposite," Jacob says. "I have to know a lot of things that I’m not able to answer, and redirect the conversation to something that is within character."

The Knock-on Effect

But what really attracted the company's investors was how well the speech rec system could learn. They're betting that all this data will soon become a valuable asset throughout the media and entertainment industry.

"We're seeing a lot of demand from all the usual suspects saying: 'We have all these characters and we know mobile is where all the action is, but we don’t have the perspective or the platforms you’ve developed,'" explains David Sze, a partner with Greylock Ventures, which has contributed to ToyTalk's $16 million in venture funding. "What they’ve built is a platform for massive scale, and there's so much demand for that right now."

Clayton agrees: "I've been in the speech business a long time, and I don’t mind going on record saying I think that kids speech is going to become extremely valuable. It’s hard to do, and these guys are really the first, the best, the most." And Jacob says some toy companies are already testing PullString to power apps based on existing characters.

But all this emphasis on PullString's potential ignores the fact that the ToyTalk team, which hails from Pixar, Disney, Zynga, and Apple, among other places, has also built some pretty neat games.

A World of Conversation

On SpeakaLegend, characters not only respond to what kids say, they respond to the things they touch on screen as well. If, for instance, a child tickles a character's belly, it might trigger a different reaction. And the characters have attitude, which is a more technically complex challenge to pull off in real time than it might appear.

Not only does the system have to understand what the kid is saying enough to generate a logical answer, it must also change the character's physicality depending on the answer. "Does the character pause? Does he interrupt you? Does he slow down?" Jacob says. "As a form of character entertainment that’s part of what we have to think about. It hopefully makes them appealing enough that you talk to them more."

So far, that strategy seems to be paying off. At a time when the typical mobile experience lasts a few minutes, if not seconds, Jacob says kids are averaging 45 minutes of playtime on ToyTalk's games. With parents' permission, the company even posts some of those conversations on its website. Warning: cute stuff ahead.

What Jacob says excites him most is the fact that this technology could give kids a whole new way to play that falls somewhere in between the playground and the imaginary friend. "I think at some deep level if we succeed, we'll inspire the imagination of kids to talk about things they might not otherwise talk about," he says.

Still, he knows that ToyTalk's future, or at least the future he imagines, depends on convincing other companies to adopt PullString on their own and capturing that market before the bigger guys get there first. "Toytalk is most successful if going forward a whole lot of kids are talking to a whole lot of characters. I hope a bunch of those are our characters and a bunch are other people’s characters, too," he says. "I want to see a world full of conversation."

1. Correction 09/25/14 12:16 PM EST An earlier version of this story mistakenly stated that Gary Clayton was chief operating officer, not chief creative officer, of Nuance.