Episode 118: The Business Case for Deep Fakes with Descript's Kundan Kumar

00:00

0.5
1
1.25
1.5
1.75
2

This is a podcast episode titled, Episode 118: The Business Case for Deep Fakes with Descript's Kundan Kumar. The summary for this episode is: Deep Fakes are incredibly realistic impersonations that blur the line between truth and fiction. So what happens when the tech to make them is available to everyone? We’re about to find out. Kundan Kumar is our guest on this episode of the <a href= "https://www.georgianpartners.com/the-impact-podcast/">Georgian Impact Podcast</a>. He is co-founder of Lyrebird AI, and now heads up research at Descript. Descript is making it vastly easier for anyone to manipulate audio, even to the point of inserting words that were never actually said. This brings up obvious ethical, trust, and security questions. Fortunately that’s something Kundan and the company are putting a lot of thought into. You’ll hear about: <ul> <li>How anyone can create a voice-double, and the <a href= "https://www.descript.com/ethics">ethical questions</a> that raises.</li> <li>Automating common audio editing tasks like removing ums and ah, and bleeping swear words.</li> <li>Georgian’s podcast production workflow – and how we manipulate what you hear.</li> <li>How these new technologies are, in many ways, evolutions of exiting machine-assistance features like auto-correct, and <a href= "https://www.newyorker.com/magazine/2019/10/14/can-a-machine-learn-to-write-for-the-new-yorker"> Google’s Smart Compose</a></li> <li>Maintaining trust when trickery is effortless and skepticism is ubiquitous.</li> </ul> Who is Kundan Kumar? <a href="https://twitter.com/kundan2510">Kundan Kumar</a> is Research Lead at <a href="https://www.descript.com">Descript</a> and co-founder of Lyrebird AI. He is a PHd student at <a href= "https://mila.quebec/en/mila/">Mila</a>, Quebec’s artificial intelligence institute, where he works on generative models for sequences e.g. speech and music.

Key Takeaways

Transcript

Robot Jon Prial Introduces the Show: Take a listen - it's pretty convincing. 🤖

00:51 MIN

With Descript's overdub, you can create a voice double so there's less back and forth to the studio. 🌟

00:14 MIN

Is this new deep fake technology any different than stunt doubles or CGI? 🤔

01:14 MIN

What are the ethics of cloning someone's voice?! Descript is leading the thinking in this area. 👏

01:03 MIN

Jon on the experience of creating his own voice double!

00:06 MIN

A new workflow for using voice clones with trust at its core. 👏🎧

00:21 MIN

Warning: This transcript was created using AI and will contain several inaccuracies.

John profile here. We recorded this podcast long before the pandemic it but as we all shelter-in-place the content of today's show focus on Communications at scale wage is relevant as is the discussion with today's guest on what facts and Truth are in the digital age what interesting times before we jump into the interview. I want to thank all of those at an able to work at home be a delivery people Food & Pharmacy workers and especially the healthcare workers struggling so hard to keep us safe you working for us all and we'll do a part for you by staying home. And now I push the play button. We often discuss the subject of Trust on this podcast. That's because new technologies create questions around how businesses interact with their customers and how they choose to use the technologies that are available for them. One area that gets a lot of coverage for its trust challenges is deep fakes. That refers to video or audio that is created to look and sound like the real thing. But what if those same techniques also have legitimate business use cases. What if I told you that this entire introduction has been created by voice double software? Maybe you can tell by my intonation that this is audio created by an algorithm. And today we will be talking to its creato. This is robot John Prial and welcome to the Georgian Impact Podcast.

I think that sounded pretty good. Not perfect and we could have spent a bit more time tweaking it but you get the point right today as you've just heard will be put up a new wave of Technology under the lens that will impact your company. It might help you or my potential be a risk to the trust you've established with your customers. The podcasting is hot. I don't know where I'd be without podcast at home and I love having the opportunity to host this podcast. So I thank you. And for real, I am John profile and this is the Georgian impact podcast.

I'm very excited to have kundan Kumar with us today. Now part of a merged company operating as descript kundan was with lyrebird AI and he is currently leading the research team there now took it out of the heart of the Montreal AI Community working with Joshua bengeo wounded and his company brought what we consider and I should say full disclosure here cuz Georgian is a customer of descript some groundbreaking technology to Market going to want you tell me about the script and what you've done with lyrebird. So this group is a company which is creating word processor for New Media by that. I mean like it's enabling editing audio and video as if you will edit text like in a Word document word processor for New Media allows content creators to easily edit wage record and make their usually difficult to create media very very easy, and I'll tell you as a user I appreciate it cuz I used to sit there with a playback song.

Didn't matter which one it was and I would say okay. I really want to cut from 329.123 to 331.14 and it was a pain to get all those marks and everything. So wage. No doubt from my perspective just being able to look at the text editing. It has been a great time saver for us. Yeah. So along with the things which are already existing in the app Liebert brings the magic of sucked into it. For example before you could just remove word and move words from one place to another and it will move the corresponding audio or it will delete the corresponding audio, but With the new overub that we have created, now you can even add new things it requires training of your voice beforehand, but then you will be able to add new things so that you don't need to go back to recording studio again, once you have been through the process now, although I think your number one use case is probably podcast tool you did talk about New Media. Do you see other use cases where this might be used? Yep.

If you think about videos one of the key component that is audio itself. You can represent audio or video as text. And then whatever you can do to edit for example cutting podcast. You can apply similar things for cutting video as well at the end. You are just transcribing audio there. But then you get the alignment of this text with the corresponding video right now. We are starting with allowing users to save time for audio. There is a minimal support for video now, but going forward it's going to be not restricted to just sort of it's going to be full-fledged media creation tool in his first choice in perspective a little bit and to some degree. It's just a natural evolution of where we've been pre technology to some degree in the making of movies. There are stunt doubles and body doubles that the people that watch the movie don't necessarily see that I don't know how long it's been that CGI's been around and making movies Godzilla used to be a puppet or the USS Enterprise was just a model probably home on a string so that's kind of movie making then again. If I think about you're talking about text editing autocorrect is just another case where technology's kind of helps. I just have to be a little careful before hitting the send button. So do you see this as more of a natural evolution of kind of all of these this historical trends we've had? Yeah indeed like technology what really brings to the table is specifically with AI and machine learning that it enables you to automate many of the tedious and difficult to do things existence of voice doubles or body doubles is their wage because it is difficult for actors to come back all the time. It is more expensive to bring actors back in the recording studio to really redo some specific scenes. So these doubles exist do make it easier now with AI for example with over at up this voice doubles is doing exactly that facilitating re doing certain things that you have done before, but you're not home.

. But at the same time as you said the autocorrect, you said autocorrect for typing text think of autocorrect for audio, for example, you wrote something. Then there was some grammatical mistakes let off and now you apply autocorrect first which corrects the text, which is there for that audio and now then you overdub to even change the corresponding audio and you just like Ukraine convert all your audio, they're into grammatically correct. Well pronounced version of yourself, why layers upon layers, so it's interesting. I have heard all the the audio books. I've listened to for years. Now. I've heard words mispronounced and it drives me crazy and I realize that whatever person was in the studio didn't catch it and the narrow has long left the building. So a feature like this obviously is perfect to make a correction of a pronunciation perhaps but as I think about it and I think about even movies, these are all schools.

Did so that someone has already written a script. So that's one end of the spectrum where you you know, what you want to do. You do some simple scripting perhaps As you move up that Spectrum now, we're doing some authors correcting and changing that but you know, it may be at the Other Extreme. We have some of these intentionally false deepfakes. We're we're doing more than just doing Simple correcting. You've got a lot of policy thoughts on that can talk to me how that's been going so far and the response to the policies established yet as you mentioned. Right clear, like There is basically two sides of this technology. You can use it for enabling better editing, easy editing making your content better, but at the same time if it works very well, it can be misused, for example. At Descript, we have some strong ethics guidelines about how we want to pursue this technology. This technology is already existing which means that it is out there in the world and people or other technologists are going to use it for things that we don't even imagin right now. So it is an important concern and for us at Descript being leader of this technology, it is our responsibility to handle this ethically so that other people can follow the example. We want to be in discussion with leading researchers policymakers about how we want to handle this. So far, we have decided that only the owner of the voice is able to create its voice double by that. I mean that we will require an authorization from the user to create his voice double. And it's it's quite easy to get to be honest technically because you just ask them to speak some random sentence and then you create voice only if the recording are matching the authorization sentence, so I should make that clear I did this cuz as part of the intro to this podcast the overdone feature is used. I'm just going to I had to read this record this and send this to descriptive. I'm just going to read a couple of sentences cuz I think it's fascinating game.

A the owner of this awesome voice you are listening to right now give consent to descript to create a voice double of this voice based on the project information contained in this descriptive project. I go on I understand that with my voice double it will be possible to generate speech that will sound like this voice. There's some really interesting points of clarity, you're asking me to make you have me state law recognize that your employees will have access to it that you'll be creating some responses for me that you're going to monitor what I'm doing. I think it's fabulous. Well, we will put a link to your ethics page on our show notes sure is very important for us to first let the user know that how we are going to create the voice and who like really make the process as transparent as possible to make sure that users are comfortable sharing their data with us and make sure that we are handling it very responsibly. That's fantastic now, I love the example. So right now not sure.

Is automated and obviously automation is a huge value. Can we automatically remove at some point you see that coming? So it is already there in age Bata probably by the time this podcast is published you will have access to that feature. We're looking forward to that and then I'm thinking about Automation and and broadcast on radio or broadcast television that there's somebody sitting there with a finger on a beeping but in any case somebody says a curse word could you end up at some point with real-time perhaps catching it and getting rid of curse words. So yeah almost real-time, of course like or if you're using discrete for creating content this can do even automatically for your it's like so since we took it in text, it really makes doing these things very very easy, like arms and hours are implemented like that. Similarly any curse words as you said, like can we just identify them?

And then be marked like that so so may not be real yet. But maybe that's a place you'll be in the future. Yes. So I really do think you've nailed the ethics peace. I think you've made it clear understandable. You're very odd just about the positives as well as the potential pitfalls one of the thought though as you think about this and again, this might be getting out to the Future putting a Security First is much more than just Tech and there's always a human factors. So we've got authorization of my voice but there's obviously a team that produces these podcasts have we thought about examples yet where my voice might be changed by unscrupulous employee. Yeah. We have been thinking a lot about how to really allow serving access our voices. Like for example, you are recording your voice. Um, you created your voice but now you give it to one of your employee to to really edit it. So one of the ways we are thinking is is like like think of suggest change in Google Docs your employee is going to make them.

But it's not going to be automatically generated - it's just going to be suggested changes. These are the request for generating overdubs from your voice and then it comes to you and you just need to do except accept or reject and then the corresponding audio will be generated. That's perfect. It makes a lot of sense. This is something I've never done want to talk to you about the Georgian Partners use case here. Here's what we do, you know, we do a lot of post-production and we do remove ums and spaces and I'm happy that will be automated. We allow for guests to stop and restart questions wage. We don't believe it's unethical sometimes if a guest goes on a bit too long with an answer we might insert a comment for me in the middle to kind of break up their response to keep the dialogue going and allow them to go on. I don't think that's wrong. But I'm just telling you what we do and this has never been told publicly before so this is the first is going to be out into the wild here. We also don't believe it's wrong. But if I ask question Aand my guest answers question B that probably means they had good press training. I was trained to do that then if I like answer B I'm just going to go back and re-record my question and I'm sure their really good answer and that makes the entire episode better and also at the end we might say what have you thought would you like to the other things and if I get a really good answer might we'd it back earlier into the podcast and I'll get a better off that way. So we've been pretty comfortable with this and without a doubt the use of descript made it easy for us, but you know, we've never told this to our guests and you really have a thinking about this. So what's your thoughts on what we should be doing Georgia Partners about disclosure and feel that it is a responsibility of content creators as well to share how they are editing the the record store information that they got from their podcasters or their guests. I think anything which if you've misplaced in certain contexts can be harmful or can really convey a very different message and then Thursday,

Guess store has a person participating in the podcast. I want to know how the sentences that I have spoken in. The podcast is being used. It could be a matter of degrees out. For example, if I'm typing on my iPhone and the spelling was autocorrected. We don't have the bottom say by the way. This was auto corrected by you know by a system. There is a very interesting article on Smart compose for Google and you're in the New Yorker magazine recently. We're going to put a link to that in the show notes as well the example how to parent writing a note to a child and the parent was going to write I am pleased and as the same letter as the as the parent type the letter P instead of saying pleased proud of you cropped up. I am proud of you now that was a suggestion from the system because maybe more people have written a proud of you in the past. So my writing may get better. I guess there's a question here of should we begin to do more attributions of how systems have made things better in general I mean wage

This is probably relevant only when people are not aware about it. For example, let's say that you are creating a a movie. Okay, and then you have used some awesome tools. And that really let's say three-d effects. You have created a movie with these three-d effects in conjunction with many different acting and all those things and then it comes comes out to be very amazing. The question is, who are you giving the credit to do the machine that helped you create it or the person that made this technology available off the operator who is operating that machine there? I think we are talking about it right now because this is something now and appears magical but going forward in five to ten years. It will just become a moment people's expectations will be calibrated. Well so that they will kind of expect that you have used similar things. I love the answer. I think you're right. I think I shouldn't be thinking that hard about it that it is just natural.

Cool that everyone sort of knows their stunt doubles in the movie and everybody sort of knows a CGI and everyone sort of knows has been autocorrecting and even if Grammer gets improved, it's all fine. And we just have to get rid of they lived in this hybrid kind of tech human world. So in terms of what George and partners does and what we're doing today, you believe we should be a little more forward about what we begin with our post-production or is that not necessary? I think that it is necessary to do that like for users if I know a priori that you are going to make some grammatical correction and you're going to do these things then if I want I have an opportunity to go over it before it is published. I think it is great. And then I kind of know how the sentences that I have a spoken have been. Let's say improve their how my presentation have been improved but not knowing it and then finding out later that it has been let's say misrepresented in some way it is I'm not comfortable with it off.

This in a way also means that when you are building a system in place, you have to be careful about these things making it as transparent as you can be giving users control of everything that you can or everything that user wants and then there will be some users who are not comfortable with it. For example, they should have a choice to not use the system in the first place because they are not comfortable being really transparent and honest about how the system works and how their voice is being used. For example for overdub is very important and just thought about it is I think critical that's so we will not overdub our customers and we have to make a decision on overdubbing my own words. We did actually truly though. I think that for example, like I am your customer right now, for example, you can ask my permission that you know, I want to use your voice double to make certain areas and this is how I am going to use it off.

When you do those edits in text and then you send me that oh, you know, these are the changes that I have I have made and do you accept it or not? And if I accept it, that's great. Like the technology is still not enabled better editing easy editing of the content while still preserving the control of my voice to me you giving us fantastic Food For Thought here. We have to think about we talked a little earlier about things that were scripted and unscripted and although I have a kind of an outline in front of me everything I say is quite unscripted. So if I've said something wrong and we go into post-production, we could use overdub to change a word more often than not I will re record the entire sentence and more often than not it's a different sentence of constructed differently sounds different, but maybe it's a little more John Prine. I wish that we like so we I might make a couple of passes each one different and we'll pick the best one. So I think when there's a scripture

It's Crystal Clear in the case of how we were Carl. I work we may or may not we're going to make some decisions based on this podcast for me. And I think for all of our customers it's another step in this office significant journey in terms of the merger of tech and humans and Tech and non-tech solutions and how a company has to think about this so they could build and make sure there's a breaking trust with their customers indeed. So there is a huge range of expertise required here one how how are people going to think of the technology itself may be going to build trust in the face of people being skeptical about it. How are you going to to make people comfortable with the things that they are not right. Now? How are you going to enable it? Technically, how are you going to present it to the user in a nice interface so that it is fully transparent and honest. Yeah.

There's no better way to end it than that. You're right. We have a lot to think about we are going to do everything we can to get it right you're helping us get it, right and we will keep this dialogue going. I'm sure so thanks so much for what time is it a pleasure chatting with you? Thank you. Thank you very much. Very nice to be here.

DESCRIPTION

Deep Fakes are incredibly realistic impersonations that blur the line between truth and fiction. So what happens when the tech to make them is available to everyone? We’re about to find out.

Kundan Kumar is our guest on this episode of the Georgian Impact Podcast. He is co-founder of Lyrebird AI, and now heads up research at Descript. Descript is making it vastly easier for anyone to manipulate audio, even to the point of inserting words that were never actually said. This brings up obvious ethical, trust, and security questions. Fortunately that’s something Kundan and the company are putting a lot of thought into.

You’ll hear about:

How anyone can create a voice-double, and the ethical questions that raises.
Automating common audio editing tasks like removing ums and ah, and bleeping swear words.
Georgian’s podcast production workflow – and how we manipulate what you hear.
How these new technologies are, in many ways, evolutions of exiting machine-assistance features like auto-correct, and Google’s Smart Compose
Maintaining trust when trickery is effortless and skepticism is ubiquitous.

Who is Kundan Kumar?

Kundan Kumar is Research Lead at Descript and co-founder of Lyrebird AI. He is a PHd student at Mila, Quebec’s artificial intelligence institute, where he works on generative models for sequences e.g. speech and music.