Skip to main content

Thoughts on Web Speech API

Last week I spent time working with the Web Speech API for a hackathon at Khan Academy. Here are some quick lessons learned from my experience:

Web Speech is a weird partnership between browser and OS. There are similarities across browsers on a single operating system, and across operating systems with a single browser. Make sure to test as many browser/OS combinations as you can.

Voice has a big impact on how text is read. I was experimenting a lot with math expressions, and I found some voices would read with correct math terminology while others would not. For example, on macOS the default voice (Samantha) reads 3x × 5 as “three times ex five” while Gordon reads it as “three ex times five”.

The default voice on Ubuntu was really bad. Most of the voices on ChromeOS, macOS, and Windows were fine, but Ubuntu was almost unintelligible. I didn’t play around with other voice options, but it might be a showstopper if you have a significant number of Linux users.

Sometimes speechSynthesis.getVoices() returns an empty array. I couldn’t find any official documentation about this, but some articles online suggest that the browser populates that list well after the initial page load. If you are trying to populate a select field with voice options, make the request for voices as late as possible.

speechSynthesis.speak() and speechSynthesis.cancel are unpredictably asynchronous. I think this goes back to the weird partnership between browser and OS, but calling speak() doesn’t happen immediately and cancel() can “swallow” subsequent speak() calls. I needed to use timeouts with a 100ms delay in a few areas to ensure that the utterance would speak when expected.

The Web Speech API is neat, but it has a lot of rough edges. It’s unlikely that a major user-facing feature would use it alone, especially with the rapid developments in AI text-to-speech. That said, Web Speech provides a solid foundation to progressively enhance with more power TTS solutions.