For several years, the emergence of voice recognition has intrigued many with its potential to transform learning environments.
At the same time, skepticism has surfaced. People have struggled to envision the use cases for voice recognition tools. They’ve also observed the challenges of when children speak to Alexa or Siri in their own homes and the frustration that sometimes results when a child just isn’t understood.
The importance of getting voice recognition right is more important than ever before given the rise of the use of technology in education thanks to COVID-19, as well as the absence of in-person adults, in many cases, ready to guide students.
Patricia Scanlon, an expert in speech recognition and artificial intelligence who studied signal processing engineering, saw this challenge—and opportunity—first-hand when she observed her daughter’s own interactions with educational apps. That experience, combined with Scanlon’s research, resulted in SoapBox Labs, a Dublin, Ireland-based startup that is taking on the incredibly complex challenge of making voice technology work for children. Advances in artificial intelligence have enabled engineers and researchers like Scanlon to train voice technology to anticipate the complexity of children’s voices, which are, in so many ways, more highly variable and complex than those of adults.
I had the opportunity to speak with Scanlon to learn more about voice technology and why it could be the next step change in education for which teachers and parents are seeking.
Michael Horn: SoapBox Labs has done some really important work around voice recognition for children, and as someone with two six-year-olds in the house who talk to Siri and other devices, I know the importance of tailoring to that age and making sure that you build tools that can understand them. You’re obviously a global expert on this topic, so how did you get into this field of voice technology and working with children specifically?
Patricia Scanlon: I’m an engineer by training and have a PhD in speech recognition. I’ve held research posts at Columbia University and IBM Research in New York and then at Bell Labs for seven years after my PhD. Then back in 2013, I was observing my own daughter interacting with educational apps, particularly ones tailored to emergent readers that teach letter sounds, phonics, and decoding. I started quizzing my daughter and realized she didn’t know the answers to basic questions about what she had learned. After more probing, I realized the apps really only assessed her passive receptive skills, not her expressive skills. My daughter was answering multiple-choice questions and so wasn’t really being tested on her ability to recall or pronounce words. Having spent my whole career in speech recognition and artificial intelligence it was quickly very obvious to me what was missing. If you’re going to truly have that personalized learning experience in reading or language learning, you need voice technology that listens to the child. So this all came from me as an engineer identifying a gap based on my own daughter’s experiences at a time when voice assistants were starting to take off, but voice technology was not really embedded in any products—and certainly not any products tailored specifically for children. I realized there was a massive gap and nobody was focused on it, so that’s when I founded the company.
Horn: From a Disruptive Innovation perspective, voice technology started in simple applications like interactions with phone operators. Now, obviously, it’s gotten a lot more complicated, and one of your key insights is that children’s speech patterns are actually very different from those of adults—and so you need a very different solution. Can you dimensionalize that problem a little bit more for us?
Scanlon: Kids’ voices are physically very different. They have shorter, thinner vocal tracks, smaller vocal folds, and an immature larynx. And anybody who has worked with kids knows it’s not just a physical thing—their language and behaviors are different, too. If an adult wants to engage with Alexa or Siri or Google, they know what they need to say to get the best possible response. Kids don’t think like that, and the system starts to break down when it can’t handle the unpredictable things kids say. The implications of this are incredibly important for education. You don’t get away with a canned response in education. So we felt deeply that we needed to build a proprietary system from the ground up, just for kids.
Horn: So we aren’t really talking about voice tech for kids as a single use case. This problem requires a very different system altogether. Can you talk about SoapBox Labs and the solution that you built to start to solve that problem?
Scanlon: We felt like this is a huge problem that would take many years and a huge amount of expertise to pull off so we didn’t even try to develop a front-end content piece; there are so many amazing education resource providers who already do that. We stuck with what we are good at, which is the speech recognition/machine learning/AI piece. We then licensed our voice technology to third parties to integrate into their products and services. So SoapBox Labs is like the oil in the edtech product engine. And we always knew we didn’t want to just build a mediocre product. Before SoapBox Labs, the accuracy of the voice technology kids used was just abysmal and concerning. If you tell a kid they’re right when they’re wrong, a false positive, that’s not educational—that’s reinforcing errors. Equally telling a kid they are wrong when they are right, a false negative, damages confidence and frustrates them. Both errors can do more harm than good and screws up the whole feedback loop. And if you lose faith in something that’s scalable, that’s catastrophic. So we set ourselves a fairly high bar and focused on ages two to 12. We now have an amazing team with over 130 years of speech recognition and computational linguistics expertise in the company. People love the novelty of AI, but you need to hold back and invest a little bit deeper to make sure we’re getting that quality of experience that is on par with human evaluators, that’s what’s key. We took the time to build this technology right.
Horn: It sounds like the strategy is to partner with all the education technology providers out there so that you’re the voice tech inside of their solutions. Are you doing anything that goes directly to parents?
Scanlon: We don’t go directly to parents or even to schools or districts. We work directly with the education technology experts, and that’s actually been really enjoyable because you get to power so many different products across a range of different industries—from speech therapy dyslexia screenings to English language learning to formative and summative literacy assessments. We are able to touch all parts of the child’s reading journey as well, from decoding to fluency and comprehension. We do know that teachers don’t have the resources to be that human evaluator that can sit with the child as they read aloud all of the time. It requires human power that is just not scalable. But if we can help teachers and parents by letting the system listen to, encourage, and correct the child, and invisibly assess them all at once, it takes the stress out of the situation. Kids don’t know they’ve been assessed, and it becomes a playful and engaging experience. You get amazing data out of it that can be surfaced to the teacher. It can also be surfaced to the parent, and interventions can happen more quickly. They say it takes four times longer to intervene when the child is age eight than it does with a four or five-year-old. If you can reach children early and catch them up with their peers, that has an incredible impact that is tangible to the parents and educators.
Horn: One of your partners is Amplify, formerly known as Wireless Generation. Their original product was the simple handheld device that teachers would use to listen to students reading and figure out where they were making mistakes. It sounds like in some ways you are now powering an automated solution for Amplify to be able to do that sort of work. Is that correct?
Scanlon: We’ve loved working with Amplify, who helped move literacy assessments from egg timers and pen and paper to palm pilots, allowing assessments to be done a lot quicker and easier for teachers. We believe that voice technology is the next step change in assessment. So when COVID hit, we wanted to do something together to help teachers right away. The voice-powered assessment tool Amplify built, powered by our voice technology, allows kids to read a passage so the system can assess each kid individually, and remotely if necessary, and share that data with their teachers. The feedback surfaced to teachers allows them to quickly go through the data, see the errors, and identify each child’s reading level and their potential learning loss throughout the pandemic. Voice technology is a game-changer because you can assess frequently while kids are practicing their reading. It doesn’t have to be a separate activity.
Horn: We see a lot of hype right now around having Alexas, Siris, and Google solutions in school classrooms. How does SoapBox Labs differ from the other approaches that are out there around voice technology?
Scanlon: SoapBox Labs is voice technology. Alexa, Siri, and Google are voice assistants, meaning they are built to serve a very particular need, which is to be a virtual assistant. While they are great for asking general knowledge questions and predictable commands, they’re not customizable to different content. They’re also not designed to do pronunciation or fluency or comprehension assessment. Voice technology is so much more than an assistant. Back in 2013 when we were starting, we were still having to educate people about why voice technology is important for education. From 2015 onwards, voice assistants’ prevalence broke open the markets and normalized voice technology, which really helped us.
Horn: Sometimes people look at voice technology in education and it sounds like a nice gadget or something that’s simply nice to have. But as a teacher, maybe I wonder if I really need it. Can you talk more about why voice is actually core to learning?
Scanlon: Voice is the natural interface, and I think that’s really key. As voice technology accuracy improves over time, it makes sense that we should use our voices more and more. Most of us use technology and touchpoints throughout the day in which voice as an interface makes more sense, particularly for young kids who are pre-literate, where the alternative is buttons on screens and menu systems that they can’t read. So people have been trying to design around the fact that the child can’t read; now voice technology just cuts through all that. And then you can start having educational conversations with kids—“What’s two plus two?”, “How many legs does a horse have?”, etc. You can introduce vocabulary, you can introduce questions, they can listen to a story, and you can assess their understanding. The current voice-first generation doesn’t have the inhibitions we have around voice, so we can do that very rigorous assessment and it will unlock interaction.
Horn: What are you most concerned about as voice technology spreads in education applications? And what are you most excited about?
Scanlon: My concerns are that people will view voice technology as only voice assistants and deem it not that useful to them in the bigger education picture. The privacy aspect also has to be surfaced. Users want to know that kids’ data is secure and safe. Transparency in data usage is really key. We’d like to see this technology taken offline, and I think that’s very possible in the near future. If you think about all the use cases for a child—how many of them really need open access to the internet? They’re not ordering rideshare services or checking out sports scores. They are looking at content and having fun, engaging voice experiences. Such voice interactions do not always need to go to the cloud for processing. I think that’s where this technology is going in the future and that’s really exciting because then internet access will also be less of a barrier, in addition to addressing concerns around privacy. We can do this in a powerful and cost-efficient way, and that’s where you get scale.