Episode 109: Who’s Who in your Data with Jeff Jonas
Jon Prial: A lot of effort goes into identifying who we are almost from the very moment we're born, birth certificates, fingerprints, passports, now facial recognition. And do we think for a moment that there are that many times that companies don't know where we are? How come it still feels like companies don't know me? Why do they send the same offer to me several times? I mean, shouldn't it be easy for CRMs to get it right and reach me just once? And if I move to a new house or city, these systems should know that I'm the same person, right? It seems wrong. Even though we're leaving a wake of data behind us, most CRMs and other corporate systems of record are just horrible at telling and maintaining who we are. That's where entity resolution software comes in. And it can tell the difference, for example, between a John Smith, Sr. and a John Smith, Jr. at the same address, but it can also help with much more salient problems like voter registration and fraud detection. So today I'm really pleased to be talking with Jeff Jonas, CEO of Senzing. Jeff's been working in the area of identity resolution for almost 35 years. Over that time, he's worked with casinos, government agencies and enterprises. Jeff and I have a bit of history as we were part of IBM together, and he's always had an amazing knack for making the difficult easy to understand. And staying within our wheelhouse here at Georgian, we'll be talking about data, AI, and recognizing how important solution and industry knowledge is to build a powerful application. I'm looking forward to this discussion. I hope you are too. I'm Jon Prial and welcome to the Georgian Impact Podcast. So give us a little bit about yourself, your history and your background.
Jeff Jonas: My name's Jeff Jonas. I'm the CEO and founder of Senzing, S- E- N- Z- I- N- G. My company is an IBM spin- off so we're not really a start- up, we're more like a reincarnation. In 2005, IBM bought my last entity resolution company. That company was called Systems Research and Development. We were located in Las Vegas and we were helping casinos better understand who they were doing business with. And it was ultimately that technology that was figuring out who is who and who is related to who, aka entity resolution that caused IBM to buy the company. I stayed with IBM 11 and a half years. I was an IBM fellow. It was an extraordinary journey. We started building a new product while at IBM, it was code- named G2, and we spun it out into Senzing in 2016.
Jon Prial: One question on the Las Vegas, you said, who were they doing business with, but wasn't it more about who they should not be doing business with in Las Vegas?
Jeff Jonas: Casinos definitely want to know who they're doing business with because some people are on what's called the exclusionary list. That's the Nevada Gaming Control Board's list of people who you better not be doing business with or you're going to get a big fine or lose your license. And so that would be one of a number of sources that they want to keep an eye on to make sure they're not giving you casino credit if you're on that list.
Jon Prial: Or if you're the MIT guys that have figured out how to count cards in a funny way.
Jeff Jonas: If you're a known opportunist as they call them, which means you're using your mind to improve your odds, it's not illegal, but they might want to know that. And they are like outright service, just like restaurants. If you're just too good for them, if you can change the odds using your head, they can ask you to go across the street and play at the neighbors so.
Jon Prial: Wow. Tell us more about Senzing please.
Jeff Jonas: So Senzing we're at 20 full- time people now, 30 if you count the whole family including consultants. A lot of us came out of IBM. Most of the core engineers on the team have been with me in excess of a decade. And we are democratizing any resolution. What we really do is we take a complicated task for programmers and make it easy. And in a way we're like Stripe and Twilio, like at Stripe if you want credit cards on your website, you can go write a credit card interface, or you can just use Stripe. If you want cell phone interfaces you can go write your own or just use Twilio. And that's what we're doing for entity resolution, and entity resolution is really... it just... it plagues almost every company. In fact, it plagues most people. If you have duplicates in your address book on your phone, because the first time you met them, you didn't know their last name and you put in a question mark on last name and the next time you met him you got your Yahoo account. Anybody that finds duplicates in their phone has an en... excuse me, has an entity resolution problem. But then banks have a big entity resolution problem. They think they have three customers, but it's really one.
Jon Prial: Now, it sounds like from the banking perspective, then this affects most traditional CRM systems and it's a most a classic data cleansing problem. But it's really more so than that. There's so much more to be done once you get the right information. Is that fair to put it in that space?
Jeff Jonas: Yeah. And maybe our biggest use case is fraud. If you have people you're trying to keep out of the bank or defrauding the insurance company, you want to be able to stop them so they're not stealing from you and coming back the next week and stealing some more. That's understanding who is who and who's related to who, so in our entity resolution we're resolving identity," Hey, these two people are the same, Elizabeth versus Beth. And they're related to this person because they share an address or they have been in three accidents altogether." So we're building a resolved graph of who's who, and who's related to who.
Jon Prial: Just for fun then let me stay down on this Elizabeth and Beth thing. Obviously you've got decades of building a corpus of knowledge so that if I'm a bank then I'm going to use Senzing now, I already signed a license with you and you already know that Robert and Rob and Bob are all the same as a way to help get there. So you have a baseline of where you could start.
Jeff Jonas: The system comes with common sense so that you don't have to take the time to train it. So all the variations of Dick, Dickie, Richie, Ricardo, all the spellings of Mohammad and the transliterations across different cultures and languages, all of that's built in. Messy addresses from around the world, in the old days we used to have to use address hygiene soccer to clean it all up. Now we have a really world- class machine learned address parser trained off of open street maps, and so you can just pass it long bags of words about addresses and it figures out what the parts are. And all of that common sense just occurs out of the box.
Jon Prial: But what's in the box that's also interesting is let's say we've got a name and all of a sudden we figured out that it's a John Smith, Sr. and a John Smith, Jr. Prior to that it might've been one person. Now in an instant, you just figured out that it might be two people. What happens at that point?
Jeff Jonas: Well, that's one of the really unique things about what we're doing is we can change our mind about the past. So as you're ingesting data, figuring out who's who you might decide two people are the same, because they have the same name, address and phone. A year later, you learn something and realize that it's a Junior and Senior. And at that split second, you want to go and fix the past. What most technologies out there have to do is they have to go reload. Well, that just means like what you're going to go reload every 90 days, it means you're going to just have wrong answers for 90 days. And there's some kinds of businesses where that's pretty catastrophic-
Jon Prial: Wow.
Jeff Jonas: ...whether you're protecting a country or doing law enforcement or doing child safety. I guess it's also important, I might as well just say it now, but we're not a hosted SAS service. Nobody sends us their data. People just go to our website, download the software without even giving us their email address, and they can run it and test it on their own data locally. All the data runs local. People run it on their cloud or on their laptop.
Jon Prial: So the third party data that you might augment yourself with, as you mentioned, Google Street Maps, for example, but you let each company maintain their own set of data. That's their crown jewels at this point.
Jeff Jonas: We ship with the common sense and we ship with things called principles about that understand how to resolve data, their general purpose. They work for almost everybody, whether it's people data, or company data, whether it's restaurants in Singapore, or US marketing data about people. Out of the box it does a really great job without any training, tuning or experts. So people are getting pretty quick joy and then as they load their own data, it just gets smarter as it loads.
Jon Prial: And there are some other tricks of the trade beyond just the Senior and the Junior for example, that if someone is doing fraudulent activities, they don't just make up a new birth date, right? It's hard to create a new birth date to make it, what do they do when they're trying to fake somebody out?
Jeff Jonas: Well, we learned this in Vegas, actually. Everything I learned about identity resolution and national security and privacy I learned in Vegas. In Vegas, we found early on when we were doing a project for a corporate security group is they had people that had 32 different names, I don't know, eight different social security numbers, five different dates of birth. And they would use different combinations of these things. And so when people are trying to alter their identity, they're often reusing pieces and parts they've used before. They're using that address where they really know they can reliably get mail or they're using that cell phone or that date of birth or email address. They really try... it's just a key thing that people do when they're trying to deceive, is they do channel separation. Nobody uses the same name, address, and phone on every record. Only the idiots would do that. And if you want to catch clever bad people, you'd have to figure out it's really the same person. And so we do something very particular called entity- centric learning. And that means over time, we learn all the name variations they've ever used. We learn every date of birth they've ever used and every email, every phone, every address and all the variations of those. So if they put in the address one time as Main Street and the other time as Main Avenue, we're remembering that. And it's helping us get smarter to reidentify that same person.
Jon Prial: It's an interesting... you think of a data model for an entity in this case a person with a name, an address, a date of birth, email, there's much more to that. It's much more layered and nuanced when you start thinking about all the alternatives and the options. How does that then affect a CRM system? What are you expecting a bank to reach out to an individual? You'd look them, this is going to reach out to it an individual one time, right?
Jeff Jonas: Well, yeah. Bank wants to make kind of their... Well, first of all, they want to know it's one customer and then they want to figure out on what channel to communicate with them. But they might have that customer's information in, I don't know, six or sixty different places and they think it's four different people. So what entity resolution does is helps them get a 360 degree view of that person. Then what they'll just have to decide is what will they consider the best name? Maybe if it's a female and different last names maybe the most recent name is the best name, just because they got married. So that's called data survivorship. It's about picking which address... name, address, and phone are current ones that you want to communicate to them with.
Jon Prial: And you can stick with that one. Now at the same time, and I hope this is going to be an okay question, does the names really matter anymore? I feel like we're constantly being targeted, barraged with ads all the time. Does it matter anymore? Because I'm getting the sense of when I'm reading that perhaps it's now my device ID or maybe my location or where I visited on a web somewhere. Is there something more nefarious going on in terms of how businesses are using entities, do you think that's kind of an outlier type of question?
Jeff Jonas: That's a bit more of an outlier. First I was thinking I was going to explain that name's important because sometimes family members share the same email address. And so if you don't notice that Patrick and Susan Smith are using the same email, if they're using separate names you want to really think of it as two different people, even though the email address is the same. So in that case name is very helpful, if people are sharing devices in a home you're not resolving an identity. You're just resolving the behavior of a device.
Jon Prial: That's right. You know, it's funny. I struggle sometimes when I'm putting names in my address book. I might have John and Mary Smith and then I'll use the iPhone labeling feature and I give John a phone number and Mary a phone number. But then when my phone rings, it says John and Mary Smith and well, I've got to look at the two point font thing to say who's really calling me. So it's still a struggle I think getting down to individual entities at times so.
Jeff Jonas: Everybody's struggling with it. The market's rather interesting. There's 50 companies out there at least that are offering entity resolution, but all the good stuff they say call for a quote. So we just put our prices on the website and all the good stuff, proof of concepts are 150 K and we say," Download it today for free and do a POC this afternoon."
Jon Prial: Wow. Wow. Well that's because you've taken the corpus of knowledge and just made it so accessible. That's fantastic.
Jeff Jonas: Yeah.
Jon Prial: So let me change the subject, talk a little about an article. It was November of last year in The Times. It highlighted a project that you were working on called ERIC. And I'd love to hear a little more about why you did that and how that fits into this world of entity resolution as well, because this is such a great practical business case.
Jeff Jonas: You know the team and I at Senzing are just so proud of this project and organization. So ERIC is a nonprofit. They've been running our software since 2012. So we've been in production there for, is that seven years. They are modernizing voter registration in America. They're governed by state election offices. Blue states and red states both subscribe. They're running over to half the country right now, voter registration. They have over 350 million records and what they do when a state joins ERIC, they provide to ERIC, not to us, ERIC runs their own systems. They provide to ERIC voters and DMV data from their state. And here's what happens in America. When you move from one state to another, let's take the last time that you moved Jon, did you... when you left the one state, did you ask to be removed from the voter rolls?
Jon Prial: I did not. I know I... I never left. I don't think I did anything in Connecticut other than register in Vermont, I think.
Jeff Jonas: Criminal. Well it turns out no one remembers. Everybody's a criminal. You can stay registered if you still own property there, but you can only vote in state elections or local elections, not federal elections. But here's the deal. When people move, and it's a highly mobile country, we move once every five years, we forget to unregister. And so you end up with states where you have a lot of people that have moved on. And so what the system does is it help states do two things. The ERIC provides recommendations to states that says," Hey, you might want to reach out to these voters and ask them if they moved because it appears they've now turned up in Oregon," and it tells Oregon," Hey, some people just moved to your state. Maybe you should reach out and they've not registered yet. Maybe you should reach out." So it puts people on the rolls when they've moved, and it helps get people off the rolls if they've moved, and by the way it is just such a great system. We're all so proud of it. But another thing that speaks to our ease of use, this system has probably, I don't know, 70 or 80 data sources that flow in at least monthly. They provide insight reports to 24 and soon to be 30 states. Until recently their entire IT department, including running the entity resolution engine is a whopping two people.
Jon Prial: Wow.
Jeff Jonas: This is unheard of.
Jon Prial: That's amazing. That's great.
Jeff Jonas: Yeah.
Jon Prial: And this could be the first time in my history of doing podcasts, maybe in my life, I've never been called a criminal before. So this is kind a high point. It's great.
Jeff Jonas: We all have done it so there's...
Jon Prial: I guess I shall still asleep tonight.
Jeff Jonas: Yes.
Jon Prial: So you spoke a little about the common sense thing that you do, particularly real- time learning. Just to clear up on some of the semantics a little bit in my view machine learning, it takes this corpus of all kinds of data that gets correlated and eventually runs predictions. You're really in a different space. You bringing a domain expertise to a very unique solution and apply technology in a different way. And I'd love to get your sense a little bit, your description of kind of how you do that.
Jeff Jonas: Okay. Well, let me start at the top. We are the first real time AI that's been purpose- built for entity resolution. And when I say AI, I mean, it's human smart. And we get that feedback routinely. People might tell us," Hey, we found a few records you missed," but then they'll say," But you found a whole bunch of things that humans missed." So we're routinely outperforming humans on matching identity data. The next question is, so AI acts human smart, but the next question is," Do you use machine learning?" Well, the definition that I am drawn to, and I actually got this from an IBMer is systems that learn through experience, and Senzing absolutely learns through experience, but we can't use these neural nets and some of these popular machine learning techniques these days, because we're a real- time engine and our customers don't have... some of them don't have millions of records of training data. We needed to have a type of machine learning that didn't require training data.
Jon Prial: Well, you actually brought it to the table to start with.
Jeff Jonas: Well the common sense layer has things like all the name transliterations, all the spellings of Mohammed and Dick, Dickie, Richard, Ricardo, stuff like that as I mentioned, also, the address parsing. It also has common sense around what kinds of things would constitute a good match. If somebody's got the same ID, like a passport and the name's pretty close, it's the same person. If you have two people with the same address with the same phone, but they're Junior and a Senior, different people. So those are all built into the common sense layer. But then we do real time machine learning on top of that. A couple of examples, maybe, you know that ID numbers are really great ways to identify when people are the same, but let's say there's an ID that's turning up. Let's make it up. Let's say it's 11223344. And somebody's just putting in it's a bogus passport number.
Jon Prial: Right.
Jeff Jonas: Well no one told you up front that bogus passport number's on its way. Most systems, that'll foul up most systems. What we do is in real time we notice that that passport number is not very discriminating. We're like," Man. Quite a few people have 11223344." And the moment we learn that this is learning through data and learning through experience, the moment we learn that we do two things. We say," Hey, let's stop thinking that that number is a good number. All the other passport numbers are good. Let's stop thinking that number's good going forward." And then we do something else which is totally special, it's super hard to do, is we go," Well, now that we know that that passport numbers crap, let's go backwards in time and fix every record we've ever seen."
Jon Prial: Fantastic.
Jeff Jonas: I want you to imagine this. You've already seen a billion records. You've made a billion decisions. Now you get record billion in one, and it's with that record you learn that 11223344, no good. You have to ask yourself," Had I known this at the very beginning before I made a billion decisions, would I have made any of those decisions differently?" And fix them all.
Jon Prial: Outstanding.
Jeff Jonas: That, you can't do that. You have a system that requires being reloaded. That is the Achilles heel of almost every entity resolution system out there, because that means every month or every quarter, you've got to reload.
Jon Prial: Right, right.
Jeff Jonas: Not Senzing, because of this style of real- time learning. So that's one kind of learning when we leave that you have a nickname like if you had a nickname like Blake, the moment we learn that nickname, we go back in time and see if there's any other Blakes at your address and phone that are maybe... now you realize that it's you, and so we do things like that. And what's happening as it's ingesting is it's using common sense to get started. So it works really well on the first few records and then it tracks an entire statistical distribution of how data's showing up. And it uses that to self- tune in real time.
Jon Prial: It's actually more... you've used the word real time. To me, it's actually even more continuous improvement. Continuous is actually slightly broader or stronger than real time to me because you're going back in time all the time. It's interesting.
Jeff Jonas: Yeah, absolutely. There's a bunch of programs getting started up to do continuous vetting. Like you validate somebody in the moment to make their... they should have credentials to, I don't know, drive the boat. And then maybe once every five years they check. But when you have a real time engine that can do continuous, as you're mentioning, it means you can revalidate that that person really deserves those credentials every second as every record arrives.
Jon Prial: Fantastic. So now you mentioned you don't share data across companies, so the ERIC data's in an ERIC data system augmented by state data. So you don't want Jon Prial and Company A mixed up with Jonathan Prial and Company B. But do you have the opportunity to share learnings back at the Senzing level that you might... we know Jon and Jonathan might be the same, but as you learn across different companies, do you have the ability to aggregate some of your learnings?
Jeff Jonas: Yeah. So just one thing to make clear is we don't have any data, so about like our customers... people when they subscribe to Senzing, we don't get any data. We send them software, they run a local of them. If two banks want to share data because they're merging, they get to choose that. What happens though is we get feedback from our customers and when we get that feedback, we imagine to ourselves, what could we do to make it faster or smarter in a general purpose way? And that's really cool because when we do something that helps vessel resolution, vessels that have call signs, and vessel names, and things called IMO numbers, when we do something that make it match vessels better in a general manner, it matches company data and people data better as well. And so we produce new builds every 30 days, and then people just download the latest builds and they're finding that we're just getting faster and better.
Jon Prial: Nice. Now I liked that you talked about you don't have any data. So this is a great way to wrap this up and talk a bit about privacy and trust. You are a privacy by design company. We actually had a wonderful interview and I enjoyed Ann Cavoukian's books so I'd love to get your take on why you went this route of privacy by design.
Jeff Jonas: Well, I'll tell you what, when I was doing this work in Vegas, doing entity resolution, I didn't even know the word privacy existed. I didn't realize there's policy debates around it. I was just literally writing software. I mean one day we were writing software to track labor. Another time we were doing inventory of fish tanks, they were like," Can you match these watch lists to make sure we're not doing business with these people?" We were just hammering out widgets. Only later did I become a late bloomer, and then I realized that you really want to bake as much privacy into software as possible. Originally I started creating some specialty privacy products, but it turns out, in my experience, people don't just run around and say," We need to buy privacy software." May be different now the GDPR and the California CCPA law are coming, but people just weren't buying privacy features as a privacy feature. So we decided when we started to build this G2 class technology, which is what's inside Senzing, we just built every bit of privacy into it we could, like literally on day zero, I took every privacy feature I'd ever imagined, there's seven primary features and all seven of those features I made either embedded inside of Senzing or compatible with. And we're proud of that.
Jon Prial: You categorize yourself as a late bloomer. I'm going to take exception with that. It is only, or already 2019, and I would argue you are way, way ahead of the curve on where we need to be with privacy and the ability for companies to deliver trust downstream to their customers. Jeff Jonas, it's been a pleasure to chat with you. Thank you so much for your insights. I really appreciate it.
Jeff Jonas: Thanks man. Good talking, Jon.
DESCRIPTION
A lot of effort goes into identifying who we are almost from the very moment we’re born. Birth certificates, passports, fingerprints, now facial recognition. So why does it still feel like companies don’t know us? Why do they send the same offer several times? Shouldn’t it be easy for CRMs to get it right? Well, it can be if you cleanse your data using entity resolution software.
In this episode of the Georgian Impact Podcast, Jon Prial talks with Jeff Jonas, CEO of Senzing. Jeff has worked in the area of entity resolution for almost 35 years. Over that time he’s worked with casinos, government agencies, and enterprises.
You’ll hear about:
- Why entity resolution is a harder problem than you'd imagine
- How it can be used to detect financial and voter fraud and identify terrorists
- Why privacy by design is core to Senzing’s thinking
Who is Jeff Jonas?
Jeff Jonas is an acclaimed data scientist and CEO of Senzing. He is at the forefront of solving some of the world’s most complex business and big data problems for government and companies. A former IBM fellow, Jonas is the leading creator of entity resolution systems. National Geographic recognized him as the Wizard of Big Data. Today, many organizations rely on his systems to extract useful intelligence from tsunamis of data.
A three-time entrepreneur, Jonas sold his last company to IBM in 2005.