Episode 89: The Future of Text to Speech with David Vazquez
Show house now. Hi, I'm calling to book a woman's haircut for a client. I'm looking for something on May 3rd.
Give me one second.
Store, what time are you looking for around?
What was that? That was an artificially intelligent entity interacting in a very specific domain with a human being a sensibly with nothing you would be aware that they were speaking with an artificial entity. And why do you think that so significant? I think it's significant because it's very impressive to see how far these systems have come in terms of mimicking human behavior sounding natural sounding a conversational. I also think it's notable on it and it was noticed by many people that the human being on the other hand wasn't aware that they were speaking with this entity. So it's one thing to interact with the natural sounding voice user interface or an assistant that you're aware of your given the opportunity to suspend disbelief in this instance that opportunity was not there so that I wasn't aware of it at all. I think that's the notable parte being over it. We're at a point. Now where most people understand the implications behind AI we live with that we interact with these systems in our daily lives so we know
When you hear a voice like that talking about setting something up a haircut. It's impressive in in frightening today. I'm excited to welcome David Vasquez to the show. But as you may have gathered from our earlier exchange talking about text to speech what it takes to make an effective Texas a long time, which is probably why we accepted it sounds kind of crappy but these days is improving at such a rapid Pace than he does need to be thinking about you deploy text-to-speech technology class will hear from David about where he sees is technology going in the future.
I should also point out that since we recorded this conversation Google has announced that it is addressing the issue. You heard David raise a minute ago, but was a duplex ought to identify itself as a bot starting every conversation by introducing itself as Google's automated booking service. And in response to privacy concerns using those calls to collect all kinds of interaction in conversational data. We've got a great show today. So stick around. I'm John Prine welcome to the impact podcast.
I David welcome to the show thanks for being here nice to be here. And would you tell us a little bit about yourself? I'm a text to speech voices einar Google my background is in speech audio interaction in language. I got my start about 15 years ago as a composer in a sound designer for educational games and toys for children and it was during that time that I became immersed in dialogue design on voice acting at the intersection between voice and sound design and music heard. You know, what stood out for me was that this entity said it just seemed so you and the last thing I was expecting from a computer system is natural science. You've been spending your life now around this technology. What did you hear? Well, I definitely heard what I see is that some of the most important pieces and now with voice interaction and with the speaking with Chesapeake systems and that was
Nonverbal cues that were predicated on understanding real understanding as opposed to just transcribing sounds into text and perhaps an emotional connection the kind of service interactions that we have with human beings would work knowledge in what they're saying and we know that that they're working with such a problem or conversing with us this to me told me the system understood what I was saying was acknowledging it in a very human way through nonverbal cues in the in the eyes are representative of that and you mentioned it just a demo and you would mention that the person I was answering the phone at the Hampshire is it did not know that this was a also use the word entity. I feel like I'm in a Star Trek World. Yo Google assistant said they'll announce themselves as Google assistance and took to get rid of that creepy fact, which I think is probably the right answer to it as we go beyond Google home and Amazon Echo and I assume they will begin to interact with the uses of the cyst.
As this entity did as well. So do you expect to see them evolve a little bit as they begin to add more of his natural level of communication. We think it'll be more of a command and response just for that as well. I think it'll be more and more natural. I would say that the kind of work that I've done I've been focused on and I'm focused on now and at a high level is precisely to in Beauty Systems with more human qualities and to understand emotional response and respond in a more emotional way. I think that part is great. I think we want to continue doing that. I think what happens is that when will you have Amazon systems in Ceres systems in the Google home or the Google Assistant speaker? I think it's all fair game. I think it's it's actually it in my experience the more the system sound like human beings and can read books or can can sound very natural and conversational that I think that's all great. So long as you understand who you're interacting with, I think we're on it. You and I are really talking through this journey that we're going to be going on. Is this technology just continues to weave
More and more and I think there's two sides to this and we have to think about how we have to know how you managed used to react and learn from that and then we have to program these systems to react a certain way as well. When I thought about how the devices for the entity was reacting to the receptionist at Hair Salon. It was how things were said and I felt on both sides some pussy and Anna's you talk about being more more human show. How does one begin to think about building a system that has a little more or you build the words around Embassy in the intonations around nipple when you start talking about empathy with conversation with speech and this is what we're talking about here. It's really important to understand the intent the intent of the user and interactions and be able to offer a response in the rights to simplify the right tone of voice. So when you're you're speaking with somebody in your you're saying well, I you know, I want this.
Berkshire I want this book or you put a little bit of emphasis or maybe you raise your voice a little bit earlier or enunciate a bit more or make your voice a little louder all of these things can vary richness of information that could be considered an orchid to be considered nonverbal emotional communication. And those are the things that are important to communicate any synthesized voices that is what has been lacking obviously because of technological constraints. I swear we have surgery robotic sounding Bill voices the voices now that better is sort of the the product of of the next generation of machine learning are capable of doing just those things and speaking as if there's a smile on their face, there's no face but there's a smile that's a great analogy. I love that. I'm a big audiobook fan and I did a lot of commuting in my early days and then was just racist and listening to books and the way I summarize what I'm listening is I'm not listening to an actor read a book. I'm listening to an actor performed the book. I was feeling they were
Interpretations of the words even having the same actor change voices when things were in quotes in the book and reading voices how far away we are from seeing that I don't know if I could put a Time light on and I've been. Just blown away at the rate at which the audio output component to understanding which is the voice and the experience of telling the story has come so far but that that level of Storytelling involves a true human understanding true comprehension and the ability to understand what you're reading it and how you want to communicate that to your audience. Far. Those are purely human qualities with audio books of Storytelling. We're not anywhere necessarily close to having these systems understand what they're reading and understanding Their audience but we have to be able to have the voices are reading those stories to your sound as if they did and that part that magic is part of prosody and and there is an understanding there is natural language processing and understanding.
You understand parts of speech where they're entities name nativities in the tax that can be correlated to different components of that story or different components of of the utterance pink said so there's an understanding but that true human understanding that spark of Storytelling that has the teller and as the receiver is is so human. We're not there yet, but really does a lot and turns out and it does can interpret. Well when I listen to one with or without the, it's extremely different when I listened in and I know it's more of an academic exercise perhaps boat the phrase that I read somewhere with you. Don't desert me in the desert. If you don't get that right you just lose the audience completely word the proper pause understanding contrastive elements in a sentence just simple things that are actually relatively straightforward to do with these systems makes a huge difference in the consumption of that has
This is a human that has as a as of the audience listening to the stories. I've listened to a few audiobooks through this and I can say that I've enjoyed it, but it's not there yet the voice recognition because it's hard to measure if we had it right in terms of all these intonation an element. I dunno remember the number but something like if you're only 94% recognition the usable use a system. Can you got to get to 97 or 98 before using uses work constantly unique a measurement at the same time. If it doesn't work, then I'm going to use it. It's actually the same almost the same story here. Yes. Yes. It really is that you know that sentient entity who can tell you this to read your your your favorite audiobooks. It has to be ways to design that experience that people are still receiving information are still delighted by and are willing to listen
A project that I worked on when I was at at Nuance was at a news reader that he has basically raised you any web content that you chose and you cure a tooth at content and its Urban up on a daily basis to get it. So, you know, one of the challenges we had was how can we do this without just torturing a listener because of some point, you know the systems that we were developing were pretty good but they weren't anywhere near what we have now. And so we thought well, why don't we couch this and in that narrative experience that we're all sort of 21st century human beings Augusta to which is the radio or the podcast experience. Why not? Why not put music underneath and have Segways and and you know, read the top of the Articles we did that and what we did was we had a music and Jeanette algorithmically create music names underneath the stories and using text analysis would generate music that was consistent with a tone of the article. What was upbeat or perhaps a bit more somber or neutral and it would sort of couch that experience and at least in theory. What we were doing was removing a little bit of the focus away from the fact that you're listening to a TTS voice your man.
The inconsistencies in the synthesis and if there was a Dropout in the Stream The Experience continue with music playing until I got back. I was listening to the news. I'm listening to whatever I want to hear the design of it made those elements invisible that you were just taking in information to go, back to the basics. Tell people have our audience understand. This Siri was a real person. I've seen Susan Bennett on YouTube. So I understand that, you know you as it as you produce these entities need sounds build the speech. I'd really I think I understand now that you can't have an infinite list of pre-recorded responses that you just picked from that you will build them. How has that evolve how ago how long did someone have to go in the studio? How much did that person have to say vs. What do you need now at there been improvements in the amount of data that
Be necessary, but because of the depth of the kind of level of naturalness an extra City that we're going after we're still in the process of collecting a bunch of additional types of data that we were collecting for. So it's I think it's still reliant on a large amount of high-quality data. Now, we're seeing improvements with the technology today in many many companies where small amounts of data are are yielding very remarkable replications are clones of the voices of sampling and modeling but none of those are really production-ready that the standard the old way. I would say the series style of Eunice license is required in the vicinity of 15 to 20 hours of runtime audio, which one is it being 10 to $12,000 is is there any make Leaf contain as many as a sort of the the necessary elements in the in a given language in this case in English anemic Leigh prasanna Glee internationally that would account for just about all of the possible sort of
Intersections in combinations that are required to stitch together and utterances and honor is a text to speech then literally what units lesson would do would be to create a ransom note of sorts. It learns by using linguistic features and text on rosacea and all of these different pieces to understand what was being done and then be able to help a sauna done entirely different approach their they're using systems and they're basically using a raw unstructured data and learning from these observations and then providing an interpretation of what that data is still requiring the same amount of audio data at work large amounts of high-quality audio, but rather than using it to reconstitute at the same as Glanville on spit out a sentence that you need its observing this data and and then saying okay based on these observations. Here's what I think is what you're looking for in synthesizing that address.
You'll still need people in the studio to give you this Rich status at the bills really quality responses invoices from production radiologist in this for the purpose of this question called sophisticated. Do we deal we still have systems ivr system might have some simple responses that I preach and maybe somebody has mites destroy human along the way hopefully maybe seamlessly what's your sense? Then of the applications of this world will be okay to live with just basic systems person when will need some of those sophisticated systems. I think the days of IV are over in that they're very constrained to with the domain that you're looking for at the moment and obviously there static recording. So if there's any change in verbage or nomenclature or structure anything you need to go back and re-record that what text to speech does is it offers that an elastic and flexible system that will essentially generate any other is that you needed to generate
Ivr as a whole this is my opinion is going away because the the level of sophistication and the level of the naturalist and expressivity for these synthesized speech. This is far superior. Now the idea of production-ready. I mean just more to some of the incredible experimental work that you're seeing come up on these research papers where they take me to five minutes of users data, and that's another thing I'm talking about that is still not ready. But in terms of high-quality statistically model dornier old speech synthesis systems those exist now and obviously there are some of them are in production, but they do require a large amount of high-quality date and they required some sophisticated processing and Hardware to do that. It's hard to remember this is years now. I think it was the American Airlines voice response system and I do remember the system going let me get back to you is just a very it was just trivial but I was like
If I was like baby, I shall stay put a smiley. Maybe did Noah put a smile on my face, right? Those systems are like we're so before we got really really deep into TTS. So when I first joined you on the mission was to build Texas speech but really it was in the service of supporting River Enterprise platform. And so what we were building were Airline and banking systems and we built the US Airways system and it was about humanizing that as much as possible in the making sure that the Beyond since we had the freedom to record these problems and they were constrained to this particular context, you know, you could be expressive and you could be friendly and you could do everything you could for users to want to stay on the phone and feel like they're talking to someone is very natural and not want to hang up getting the answers that they needed. That was as much natural speech is it was natural language understanding? When is she going to get started in the need to worry about voice? What type of Team should I be putting in place? I think you're going to need a group of machine learning specialist on head. You're going to want to have people who can prepare this data.
Unstructured data for use to generate voices. You're going to need linguists during the data capture the acoustic data capture component. You're going to need conversation designers. You're going to need people if your company was interested in providing not just a platform has services without going to want to have Julia Riders to eat. You're going to want to have specialist. We're also focused on The Voice I say that because that is my job. I'm I'm sort of sitting at the at the crossroads design and production research and Engineering in need somebody that's focused on The Voice experience itself in the boys is the product because that to me is a foundation to all of the word becomes whether you're building an Amazon Assistance or or a phone app or in car system. You need people who will Design those experiences, but you need people were paying attention to those waste and I think that the voice is the first principle hear people here that boys are going to be turned off by it. They're going to order going to want to engage with their they're going to like how it sounds even before they've heard the vocal qualities. Are they going to want to see other voices?
We can talk to it. This is an evolution to me. I often think that the most important person when it companies build his own he is a product manager there one that can bring together and manage it and get the in the tech team engage in the business seemed engage and understand the end product and get the consumer needs service really help to find the product right? That's the most important point and I've never saw before until you talk about somebody that has Lily skills that the role of the product manager might evolve or there might even be perhaps a voice product manager that is it fair to take the tools will be available through all these large companies as they are the machine then I got to build platforms. I'm talking about developing a voice application. I think that's the right to evolving into some sort of voice p.m. Role that is in charge of the of the ID and the strategy and the business opportunity that can then come together with the team of conversation designers.
Mediator or Engineers took to build these voices on these platforms that I think that is really an evolving role on this wants to come work on it and you did and I'm thinking a script design and managing the conversation. So that's a step someone under that voice product manager focusing just on what those might look like if I'm thinking about what I do. It's really focusing on The Voice in the Persona in and really focusing a large part of it on the data design. What kind of data is going into these systems and in the past of what the kind of dated at one of these systems used to Simply the requirement was is this phonemic we reached that I can sort of constitute any possible combination and the language can we see into the system and we'll do it the answer that was yes. Now I've been involved in a really the day that you provide to these systems to learn from the Train on or or to stitch together content is really going to make the difference. It has to be it usually is going to be focused on the domain of the product that that's
You can either look for a bunch of Wikipedia obtuse Jamie Lee dance Wikipedia entries that would work to sort of feel that requirement and get you all the pieces. You need to build a voice or you can start looking at interactions and product dialogues and where that voice going to be applied or if you're going to be making an audiobook voice when it was you gather data. You want it to be germane to the actual output and that's a huge part of developing a more natural sounding voice and data to take the data science and machine learning team has to have his learning engine to know kind of what's coming into then recommend the right reaction that you'll get from The Voice is so so there's that little interaction Tazewell understanding what's coming in. Yes. Yes to be mean in the context set that these systems would be training for it. And then if there's any sort of additional information or annotation is required for those contacts specifically for audiobook narration.
Making sure that BML people know that that's the kind of data that you're feeding the systems and that's the kind of doubt that you're looking for right has to do with what's the Persona. What's your company image? How do you want to represent yourself? And that blinks a little back to the voice talents of that. You're not going to generator was you probably going to hire somebody and say this is what I want or maybe a couple of somebody's a couple of different accents or voices are gender. You might have a couple of different voices. But again it a small number that tide of your corporate strategy. And yes as a brand or is a company you looking to develop a relationship with users. And so you're designing this interaction agent this companion that is going to encourage repeated usage and and provide answers for for their users in a way that that is very simple and easy and then you also want to make sure that the kind of experience you delivering Scheels intimate Scheels personalized
Chevy all you talk about keeping him engaged to you talking about making someone happy. There's a different set of measures were going to be the same measure is that we having a classic marketing automation system of dwell time. And are they staying on your page still be a different set of measures but a whole bunch of maybe an additional set of measurements and measure its four people. Certainly. I was at like retention or containment I think was determined and Ivy are for making sure that people stayed on the line and and and work through their experience without resorting to an operator is an important feature. But yeah, I think more and more as we get into these much more attractive and what's more natural of human-like systems. I think people are looking I looking to feel good about the experience with the moisture speaking with I think in some ways, I think that they want to enchant the cyst the system itself. I've noticed that they want to be liked by the assistant by the interactions as much as the other way around and they're looking
Make a connection with that system much more than just okay to see no I asked for a and I got this and I'm happy we're done this since I get when you start introducing much more emotive an expressive qualities that humans react to respond to that. Of course they do and they're looking for something a little more in there looking for something else. I was a cage with experience very cool has any any thoughts as we look out into the future? What might be the next app that people have thought about what do you think we're going to see voice? I have been a huge proponent in this is the calm with this technology of personalized speech. I think that the future is moving away from these monolithic personae that represent a brand there's one voice the technology has allowed us to develop the multiplicity of voices with less data, or at least much quicker than before that are much more expressive and I personally see a future where
You can have voices that are made in your image or your friends image or somebody would never imagined before and seeing those represented in in text messaging or in your chatbots or or that translate your voice into a different language when you're traveling opportunities for customization and generation is tremendous. And I think that's that's where the technology seems to be pointing. That's really where the exciting stuff is happening. I guess somebody at if that's true. Somebody could take your voice or my voice from this podcast and kind of creative.
Iodide I think so. I mean I'd for better or worse. I see no reason why it wouldn't happen to hear it and that's that's interesting and concerning but absolutely always has two edges as it rolls out. There is there's always good to comes and dads that comes from each of those things. She came back of the two edges you early opened up by saying I'm listening to this voice system and I wanted to feel like it's smiling at me and then we flipped it later in the conversation that the it's talking to me and I'm smiling back and it's so I think that you know, we are really building a new set of emotions and recognition of both sides. I think awareness of who that I'm a human and that this entity is not my is okay to companies have a white the journey to go. It should be great. I'm looking forward to where the where the future holds and I know you're doing a lot to drive that so it's a David. Thank you so much for being with us today. Thank you for having me.
Thanks for listening. If you like what we're doing, we appreciate your telling other people about the show better yet. Give us a 5-star rating and iTunes write a review you and so will really help ensure the more people can find us and if you haven't already please subscribe to the impact podcast on iTunes, SoundCloud or have you go to find your podcast.
Earlier this year, Google Duplex garnered lots of attention when it shared its demo of an interaction between a very life-like bot and a human being. It was an impressive display of the potential of text-to-speech applications. Although text to speech has been around for a long time, these days it’s improving so quickly that getting it right is going to be critical to keeping customers happy. In this episode of the Impact Podcast, Jon Prial welcomes David Vasquez, who until recently was a text-to-speech voice designer at Google. Together, they discuss what CEOs need to be thinking about to deploy text-to-speech technology and where it’s headed in the future.
You’ll hear about:
- Interesting applications of text-to-speech technology
- The ethical implications of not identifying a bot as a bot
- The importance of empathy in text to speech
- How to make a text-to-speech system effective
- How to deploy a system effectively
- Trends for the future of text to speech