Episode 83: Understanding Differential Privacy with Chang Liu
Welcome to the Georgian impact podcast on Jon preall freezer using it with the intersection of trust and a i a Hot Topic right now differential privacy is well on its way to becoming an integral part of the conversation in this episode will be talkin about what differential privacy why it's important why every company needs to think about having it as part of its product and tactical strategy. I'm thrilled to welcome Sean Lew to the show charms and applied research scientist here at Georgia partner and leverages are rich experience with analytics and model building help our portfolio companies, and she's one of our experts and differential privacy. It's great to be with you.
Thanks for having me our pleasure. So listen, let's get started. Probably not fully understood people kind of assumed that data can just be anonymize. They've all heard the term p. I I personally identifiable information. They heard and know that it could be removed but hasn't it been shown in both theory and practice that this Anonymous Asian is I don't know. Should I call it an illusion? Absolutely. We seen many figure examples actually. So what happens is that they're usually exist other open public data sets out there that we could potentially link other information and retrieve that personally identifiable information when I sample is so recently New York City has released a dataset on the New York taxi information as a result of the freedom.
Information request and so the data set includes information such as pickup and delivery time and locations. There's also the price of the fare how much tips they received and they thought that just stripping away the license place to driver information. This information can be public and where we seen is that researchers was actually able to relate that information to celebrity Instagram photos and toast and So based on what time and location they were using what I see they were able to reference back to which celebrity took which taxi and they could actually figure out if that's a pretty tipped or not or even how much does it take to form a shin and thinking that we Stripped Away The Piano Guys, it's not particularly privacy-preserving.
Scarier than me, cuz they are laying naively or maybe you thought it was okay to send this data outstrip out the pii and everything is good, but it's this integration and going across these different. I guess I'll use the word silos a date of a different datasets allow some really scary things to happen. So you think the problem has been solved or not solved in terms of what they should be doing. What company should be doing with data. This is this is a serious real problem and put in these pieces together into text me some answers to just move Beyond Simple and animals Anonymous I say it is a shame. That's a strategy and curly. We're in the age of machine learning that relies on a large information and this information is often private or contain sensitive information. So without any additional
I guess no one should really assume that their privacy has been protected and right and so there's this other field of study that differential privacy, which is also the topic of today's top hits mathematical provable privacy guarantees. And this is very interesting because we start to see a lot of the large companies such as Microsoft Google Apple starting to use to find your privacy in all of their software offering to provide that privacy for their users while she sleep removable privacy guarantees, and that's that's the key element differential privacy. That's how we're going to dig into that. But I think it's probably worse just a few minutes to step back a little bit you mention machine learning is behind almost most software now and of what
To me it analyzes data. It makes these predictions based on analysis of that data. So with all this proliferation of machine learning it is that one of the key elements that makes differential privacy important.
Oh, yes, absolutely. So in the past few years, we have gathered so much information and data and everybody is using a machine no name because it's one of the most powerful lord. I can't process this amount of data and learn from it. And what today is computational power. We can't do this in a very efficient way and we can use this information. Are we flying to recognize patterns in the data or even make predictions? And so because machine learning is so powerful it and I guess it's no surprise for you that when we're trying to you when we're training on machine learning model. It memorizes the training data and some and Subway and this training day. I could very be private or contain sensitive information. And therefore we should be very concerned with the privacy of the machine learning models.
Joshua research more than just got a matching the Instagram photos from SARS with a database is actually doing this predictions and it's much closer to what I see Netflix doing in predicting movies for me that will swing a little bit I can kind of privacy. I'm sure what are the new trends that are out there in terms of privacy associated with collecting so much data and creating a product based on these private data. So we've had in the past cryptography or encryption methods of the state of being very popular techniques. However, is there not an engineer correctly? They can be easily compromised and this is this is why we believe the differential privacy is such a promising technique because it actually protect individual is beta while preserving statistical information about the entire population, which means that we cannot really identify and you put your car in
Information but we can still learn from the entire population of got my forwards mathematically provable privacy guarantees. I think it's time for me to get your best definition than of kind of what this is all so when we say that a particular I'll grab them at Asda friendship private. What we really mean is that you're forgiven this random algorithm and we apply to Sagar them on to neighbouring datasets, which means that they differ by only one record the probability of getting any any particular outcome is two different data sets should be approximately the same.
Teams in a run analysis against these two model. I want to make sure the results of the same. Otherwise, you might do something about revealing me every single individual and estate asset. It's protected in the way. You described it. So which means the intuition is that with that but you could record or without that picture for a record the algorithm shoots the probabilistic way similar results with or without you in the dataset got that. And we we can tell our company's. Hey look even though individual records are removed somehow, you know, it's quite the same but not that dissimilar from the taxi database an Instagram database being merged together somehow. There's something that allows you to figure things out and so stripping one person out changes it so I've kind of got that piece okay to one which obviously we're going to guarantee.
I'm okay or anything. I should be so selfish you that an individual person is okay whether that person is in or out of these data sets, but I think that's a proofpoint know. I know as I did some reading and you've been doing so much to teach me along the way here. There's a there's other ways of doing some things to protect the Privacy. So one of them was just kind of cool for me and and you told me this is been around since the sixties. It is called randomize responses. People are asking questions to a particular survey. But in some cases don't answer the question just flip a coin and you put in yes, if it's heads and know if it's doubt it has nothing to do with reality, but it really begins to mess up the data is is that still something that happened? So this is this is what we call a dimension Theory randomized response, which we actually inject noise.
Into the data and so what you can think of is that in the end in some particular case. So for example, if 50% of the people answer truthfully and the other 50% are just answering base on the coin flip which we call noise. But as as a person collecting to stay. I don't really know who is answering based on the coin flip or who is answering truthfully. So I have no way of really identifying your personal information. However, if we think of globally when we take this information as a whole we know what that 50% of that is noise so we can actually calculate and backtrack and and because we know that in the noise, we have 50% chance of saying yes and 50% chance of saying no so we can't exactly figure out. What is the average answer that?
We're looking for for this population.
That's pretty cool. That would I ever really like about it is I often hear people talk about signal and noise and we're trying to find the signal from the noise and case because we're trying to protect people's privacy. We're putting noise in at that's different and that's kind of cool and I do believe I understand the mass. So if I was I was in people took the survey and 500m flip the coin, you know about 250 heads 250 taels and What's Left Behind is the real number? So that's really why's that you injected that allows you to remove it, but nobody knows what John Prine I'll answer. That's exactly right. Yes.
Now I've done that. So, how do I live in measure? What what is that? What is that mean now in terms of being differentially proud boys and ensuring the date as private jokes. Now relate that back to me for differential privacy privacy is really based on the sensitivity of your algorithm. So, for example, when we talked about this randomize response what we're trying to do perhaps as counting the number of people who said yes and Counting the number of people who said So based on the sensitivity of the count, then we can really inject noise to a level that were comfortable with and as we mentioned before this is a differential privacy is a probabilistic guarantee. So we have this margin the probability and which will show called the price of budget and it's often. Do you know?
Epsilon and interface on the Epsilon then we can sort of tuna how much noise we want to inject into the dataset so far in our example, we injected 50 percentage of noise. But if we are okay with less privacy, then we can actually enter less noise. So perhaps only 10% of the population will answer randomly and budget will be much larger.
I like that you actually your time this Epsilon which is cool for me it to a business decision. This is this you make a decision on how you want to in this case were so I'll talk about really modifying data adenoids to the data and I kiss my model is look everybody everybody can knows that system with 99.9 person updated up time is good, but it's not acceptable. 99.999 is better and I know he's always loved, but it's a business decision. What am I going to invest and I what kind of getting its to some degree? It's a service-level agreement in some degree. It's guarantee. So it really is kind of the same piece. Yes, not only from the technical side, right? So as you can probably imagine if we inject noise then we're we're taking away from some performance and Indeed jobs.
We only use the simple algorithm as counting but picture we're using machine learning models. That's way more complex and perhaps you cannot actually reverse back to the exact number that we wanted to compute then and a lot of times we may actually lose some accuracy automotive performance because we're injecting this noise. So this is really a business decision as well because we need to figure out the Comfort level that we want our privacy budget to be well and how much model performance we want to sacrifice. We need to find a balance between the two I like to feel you had really thought about the trade-off. What if there was a model performance I think in the case of the randomized response, it's straight mass and it's easier but obviously you have other ways of post adenoids to the data or adding noise to your training datasets has lots of places to add noise and in each one of these cases.
Measure but you got to make it Buzz business decision as to what's the final impact of the the analysis that this machine learning engine is running very cool show. That's so important a couple different examples for sure. You're the first thing you talk to me about company's willingness to share data between different company so we can talk to me about how that works. So incorporating differential privacy in Southborough kompany's actually opens a lot of opportunities to to these companies. So first of all and adds a level of privacy and it actually builds trust between the company and their customers and on the other hand. So after building this trust and that customers feel that their information is private, what does pretentious could enable is sharing Ada or models among his customers so that the pump
You can actually leverage the information from customers to use for other customers so that we can actually boost the performance of all of these models for the different customers altogether. I actually so I see two benefits of I have it right so far. So I've got more data from multiple companies there for more users and therefore I can have better results cuz I got better analytics against it and you're still able to go through this privacy measure tell these companies better results and you're still nice and safe. Are there any other areas of companies are interested in in differential privacy besides going against multiple multiple datasets for multiple companies at what we could also help is what we called the cold start problem that we see with a lot of software companies. So what happens in a cold start problem, is that the date?
When their customers are in silos and when a new customer comes on they have to wait sometimes months to collect enough data from this new partner and we can build a model that performs relatively wall for this customer and while we're waiting for a few months of scattering data the software companies May often do opportunities for revenues, or maybe the customer is waiting for too long and have not gotten a good performance and there's lots of lost opportunities there and what we could do with the sharing of data is because of differential privacy and customers are more willing to share data with information from existing customers to help with this new customers so that we can really speed up this entire starting cycling so that we can resolve the problem.
Interest is it the same kind of construct? I do I have multiple company data together. The first thing we talked about was getting better results of this is now for somebody new joining the system. They just get up and run that much faster time to value is kind of their exactly how does a company get started with differential privacy always start by really understanding their data the type of data they have and what model is there using in order to pick the best defensive private eye technique for a gram. So as we mentioned before it so we know most most scenarios differential privacy, will it decrease the performance? So it is very crucial to understand how much you're willing to sacrifice in order to achieve the level of privacy that you want and so is there for understanding your business and then in descending order?
And make some research and choose the best suitable differential privacy technique and this can be actually quite challenging and may require some research for clarity. It's actually use the term choosing the most suitable differential privacy technique. Its really choosing the right privacy protected techniques that can be measured or guaranteed with some different spots. And we talked about adding noise of the data would mention cows your little bit training models. I don't want to go into on this call you when we we may do another follow-on if people want to hear about it, but you can also be injecting noise into models. Correct a lot here to learn more about this.
White paper DCUO guy to differential privacy which is available on our website to download and it gives very nice introduction to what differential privacy is and it explains some use cases of differential privacy. And we're also currently working on developing a open-source privacy package that will be able to go to the company's and potentially available to the public and therefore I guess users are encouraged to download the package and give it a try when it's available on links to that are show notes. I just want to go back a little bit more we talk to use the word trust a bit and I wanted a little more kind of close this on the discussion around trust. We spent a lot of time does the number podcast I think it's critical importance of companies understand what's coming in terms of bias and risk. So I would just like you to help me understand. You know, how come they should be thinking about really make
I'm sure they do the right thing in terms of understanding that and communicating out what they're doing is against trustworthy. Right? So what machines will learn is the data that we feed to the machine. So if we only feed male data then the machine has never seen the female patient before then. Of course, we cannot expect the machine to perform while on chemo patients. So therefore we have to be really careful on choosing the data that we train the machine with and which is which actually says to another very interesting research topics that we start to see in a literature as a researcher is starting to think about the techniques really help alleviate this Pious problem because nutrition is that we're injecting noise into the date of said or the training model and
We're really doing is we're sort of injecting random voice and this will also help with a more General model that could potentially prevent biases from The Yard Machine learning models. I think it's fantastic to see this stuff moved from Academia to reality great to see it go from Facebook and Microsoft two companies that that you're working with smaller companies. I think that that's the great stupa technology go so far and wide great story great discussion. Thanks so much for taking the time to be with us Chong. It's a pleasure. Thank you so much.
Thanks for listening. If you like what we do it, we'd appreciate your telling other people about the show better yet. Give us a 5-star rating on iTunes or write a review doing so will really help ensure the more people can find us and if you haven't already please subscribe to the impact podcast on iTunes, SoundCloud or have you go to find your podcast.