Episode 82: Aggregating Massive Data Sets with Chris Moore
Hi, everyone is Jon preall from the Georgian impact podcast a few months back. I interview Bill Adler the founding president and CEO of Truth in a data driven personalization platform for Footwear and apparel retailers. We talked about the convergence of machine learning and fashion. If you haven't heard it yet. It's episode number 75 and it's worth checking out but at the time I also spoke with Chris mortgage company's Chief analytics officer. I had a fascinating discussion about how the company Aggregates and works with its massive data set a fashion data set includes sizes colors brand Styles design Broad and shallow, which brings its own challenges. And anyway, we had a fantastic dialogue that didn't make it to the main episode today. I wanted to share that conversation with you about 17 minutes, and if you're interested in understanding the
All right, here it is.
Thanks, Chris. Just joining us for thank you very much. I like it start with You Chris kind of a recap on an astounding number that the bill Adler mentioned and tell me again how many characteristics of particular piece of clothing might have it varies by the the natural type of clothing but you know typically were connected collecting about a hundred and forty datapoints 250 on every item that covers a wide range of things. But I want to see that would be 154 a top and then 154 trousers is an another another different hundred for shoes. Perhaps is my color is color. Everything else is different, right? Yeah. So each category has its own set of more than others because there's a lot more variety in some categories another. So if you look at dresses for instance, not only do you have you know, just Fabric and there's a lot of riding in the Fabrics that end up in dresses, but you also have
The deal with pattern color and shape the stress has come in a lot of different silhouette. So there's a lot of detail there that piece you also have to look at say coverage, you know, is this a sleeveless shirt short? Sleeve the mid sleeve whatever was the Hem like a lot of detail bet that shows up especially in the more fashionable areas like like addresses cuz there's just so much variety, but it can't be that every manufacturer necessarily even list these things the same way. So how do you go out and and get that and then I guess my follow-up question would be you would typically talk about normalizing data. You have a normal life. How do you begin to manage all this data that you're collecting? Right? So you're right then today the normalization is the whole name of the game. You can't really scale a business like this unless you can take the data from all the crazy places. You got it and normalized into one system where you can really work with it and you know and steal that business without having to redo all the analysis for every client individually. So the normalization game with respect the garments for instance.
Starts with what we call Spec's the blueprints for making the clothing they come from the manufacturers and all kinds of different ways. So we do a fair amount of that normalization in house. And in the worst case you get a picture of a scribble. Usually it's way better than it is. I may be a PDF with a bunch of figures and facts written on it that are possible buy it and I'll bring them. Sometimes you get a spreadsheet or sometimes you get an API call that you can call into their product management system so I can come a lot of different ways what you find out when you pull that dated back in will be a bunch of different what we call points of measure where we say, you know, the sleeve length measured from the center back to the cough is a certain length. Someone else might Express sleeve length as from the top of the shoulder to the cost though. There's a lot of good stuff like that there you can imagine pretty easily things that you can normalize right cuz you always have a shoulder measurement and you got the back of the color to the bottom of the Garmin you got imagine just a whole bunch of different points of measure.
As long as it's well specify that enough to manufacture its well special made enough for us to convert it to our standard width of doing it that you have a personal history before you kind of joint roof that you do your number different Industries, including Pharmaceuticals. So I guess as we've always felt that although it's counterintuitive focusing on the data is is kind of universal and it and although you going to bring a little bit of a fashion to this. Our audience isn't going to be just fashion people. So talk to me kind of the best practices in terms of how you and your team began to focus and manager on this just smashed about today. I think the most important thing I've learned throughout my career in terms of you know, dealing with large amounts of data and normalizing it is that you never throw away the source data never ever you don't use it for work because you got to normalize it and converted into what's really useful to you. But you can definitely find as you progress in in dealing with the data that there are ways you wish you would normalize it differently.
Or little bits of data whose important you didn't understand really at the first so always having those Rod dated a fall back to and then converting them a fresh in a new way that you know, what reflects the things that you've learned is really sick supposed to I mean, it's a simple thing but it's it's really one of the most important things I've learned of my career and it certainly well in lots of different positions in about your training data set and how often you you rework your model because I never thought about that in terms of ghost something up. Maybe I'm just thinking about an open back or or a particular that you might not even thought about. How do you think about training the system in that kind of constantly staying at the Leading Edge of fix? What's a happening in I'll give you an example from the open back. For instance when you're worried about fit the open back is probably not that big an issue a strapless dress that's very differently from a dress that's got shoulders, but maybe at you know, what a partial open back doesn't doesn't matter too much when you're doing when you're just worried.
Set so early on we didn't really worry about it too much since then we've gotten much more interested in style helping people find dresses for instance that we think they're going to like based on what we understand about their personal style now something like that matters a lot cuz it may be that you know, you have a preference or a lack of a preference and that's something we want to be able to understand and capture. So we need to have that in an armada love what that dress is. So a lot of times what forces you to expand your model, is it starting to use the data in different ways and actually you're bringing me to which is now we've got this day. I mean gobs and gobs of this that you don't want to manifest that a bit in user you want to ask the end use a simple you want to be trying to I guess come to this sense of cents of style whether it's a dress or a shoe or or something that it all gagra Gates into something that the consumer to so I get that Netflix could figure out kind of movies. I
I want to watch by looking at Ya by predicting by looking at the data. How much more challenging is it with the size of the day to you have to come up with these Tucker predictions were not put might be style or what are the different types of Al but you might come up with for your end users of the system, right? The challenging in fashion. Generally is that the data are much thinner there there just aren't that many movies compared to how many different dresses tops and pants there are so you know what typical movie might have tens or hundreds of thousands of years in something like the next Netflix database. Where is a particular dress that was sold one year at a particular retail might have 20 people who bought it. Actually it's kind of high-end. If it's it's more sort of medium. Maybe it's got a few hundred. So you have a situation where there's there's a lot less data about each style in a lot more styles to deal with and one of the ways around that is is what we've been talking about and that's having attributes that help you understand the style not just as well to stay.
I don't know what it is instead of just saying it's address you have a way of understanding it as a dress that has these properties. It has the shape. It's something that fits tightly or Loosely. It has an open back. It's sleeveless, whatever it is and having all those properties allows you to start trying to model why it is people like something in in terms of those properties, even if they themselves wouldn't express it that way now and I like this depth vs. Kind of Brett thinks of his the Brett's of option of the breasts of choices. You can buy in terms of dresses with a Slimmer did number of people making that choice which is actually a more complex problem to solve perhaps how do you get the team focus on the type of algorithms in and moving along then? How did it how do you work with this thin? I don't like this station, which state has a sandwich. How do you work on that? Send data?
It's really about trying to produce what we call features in the the data science. Lingo that are relevant to the problems are trying to solve and which you can extract from every possible example of the of the clothing you're working with so that a lathe enables you to kind of convert clothing from just hear. Well, here's this item when I have a picture of it into a set of numbers really about all the the features that that Garmin hats and if you're good at your feature engineering you can essentially help machine learning algorithm get out to the underlying truth of what's going on. So it's much easier for now. They've been friends since to have a list of features about what the slaves are like and whether it's an open back rather than to have just saying there is value in a m has been you know recent and that's a recent work on actually doing a fair amount of impressive stuff with nothing but an image but it certainly is easier especially in the earth into have more.
And have these features work. That's great. That's great. And I was thinking about going to use the word feature not in the end of data science and machine learning point of you but one of you I can see how things change and this is really about staying on top of this data. So I've got a couple of questions you also first one is somewhere along the line and I remember when Air Jordans came out and there was a sneaker that had a bladder that you would pump with air. How would you discover that? How would that show up in terms of what you have? So if you were to look at our current features for describing shoes, we have a whole bunch of future set that surrounds cushioning so initially probably what you would see is that we would code those are bladder shoes as having a really sick cushioning, but we wouldn't necessarily distinguish them immediately from Air bladders.
But there's another piece of the system that looks at the text that describes the shoes and uses that as teachers so that that part of the system would say it would start picking up on words like air bladder be able to use that as part of the eye wear them. But you had you had it you had cushioning that's that's fantastic that all humans are I should not categorize. This is all humans, but most people many people get a little heavier Overture from him. How do you think about that when you're working with your boat manufacturers who might be changing sizes because I think I read something that the sizes of changing how things are described as people want to buy a smaller size but it's bigger or want to buy a bigger size cuz it makes them think they're cut my muscles or whatever or the fact that people change themselves. And now there's that there's a kind of a two sets of data changing her the people themselves and their bodies and how they manufacture their labor.
Is that all just come back down to the while the specs? I'll get some of that but there is more to this. I think it was a people change people change. Our preferences also change right and at some level it doesn't matter.
If a user is wearing a medium sized garment, I don't need to know from the perspective of the algorithm design or whether they're wearing their clothes very tightly cuz they like it that way or very Loosely they're they're buying mediums right eye and if I can understand what is the unifying information about the medians that they like and how that differs from the ones where they need to buy a large or the South where they need to buy a small I can put it all together without really having to know what is medium mean right? It's just it's really just a clustering or categorization coming out of the machine learning model. It's really not the my view of medium. It's your it's your data's you of medium, correct? And what year what you tell me about media is really about the clothes that you know are the clothes that the system absorbs you buying at keeping that happen to be mediums. But I course, you know, the system knows about all the clothes and it has the ability to say okay if he wants.
The snoo sure that he's never tried before to fit like a medium but like the medium see, where's he needs by a large in that shirt.
I still don't know how big you really are and whether you're wearing a Titleist, but it doesn't matter. I don't need to know right it's like what I'm trying to do is line up the properties of things you like with things that you know what things you don't know yet and help introduce you to that so it is not need to know some of that the Harry bit. So you're with your more recent purchases count more than what's happened in the past and that enables us to track the the slow changes in our bodies over time as we think about our audience in terms of Industry. We got it. We've got people listening in supply chain. We got people listening any comment way. Hopefully, you've got the breath of listeners hear your thralls kind of face with the same problem of managing the data staying on top of the evolution of the data. So kind of what road be your key takeaway for one of our customers you you've been in
How many do we need mosquitoes you kind of sit back and a neutrally look at your data what what matters the most to it to a data science senior CA? Oh look at the state. I might have to run with more than one thing is is being able to develop good features. So that requires that you really have normalize your data end of some common set of features. So, you know, like we describe what the specs for instance. We also generate features by having humans. Look at pictures and descriptions of garments. We generate features by looking at the text that's used to describe them. So having a way that you can bring all the stuff, you know into a future set that's reasonably consistent across your world. Next step is really being very careful about algorithms selection. You need algorithms that meet your business needs. So, you know various I'll bring those are going to be better or worse at different kinds of performance get better or worse at develop a generating recommendations very quickly. So in our world and within the true that world we need algorithms where we can do more.
Supercomputing offline create a bunch of artifacts that when a user shows up on a webpage we can very quickly put together and get a recommendation out so that they don't notice the 20 or 50 seconds. It took us to feed there's no time in the web world to say hold on. I need to do 20 seconds of computer do kind of work that way but another business contacts you might have situations that you know, when we were tracking ships moving around in that context chips don't move very fast, you know, you get updates on their position once every hour or two so you can you can do an awful lot of computing about the whole world full of ships in an hour. So there's a lot of time and a lot of times that you can employ when you have a problem that looks like that versus a problem that looks like consumer service to website. So, you know figuring out algorithms which are likely to meet your performance criteria is really important and
Both in terms of the quality of the production suicide runs make and the time it takes to generate them. Exactly. You don't want to visit problems with cart abandonment. You do not want to be the cause of another element of cart Abandonment for right exactly or of just failure didn't press the buy button in the first place cuz the page didn't load or you know, whatever. The last piece is is really a question about personalization. You can make things look a little bit personal by
clustering the users and clustering the garments and just saying, okay. Well, I've got five clusters of users This Guy's in cluster to Custer two people do this send that wreck out the problem is that unless you have an awful lot clusters. It doesn't feel very personal. So you have to figure out for your business need can you get away with that or do you need to do something? That's a lot more personalized feeling and then you need to talk about you know, how to how do I make algorithms that can learn information across all the Clusters and produce something that is I won't say unique to user but also you need to use or any others probably somebody else who has exactly your profile cuz there's so many customers right but it's something that's really more refined to a particular user that takes into account all the features that you know about that user rather than something that just buckets them into Category 5 and then then that's something the Temptation especially in retail because they are so thin is to do that. But the problem is then you lose the 1st the real personalization. So one of the things we try to do is make sure that
In our algorithm choices, we're being as personalized as the data support beam and I say that because in reality what you end up doing is having fallbacks. So when our users fresh to the site, I have one out there in that I can use to do the best I can in that case which is not very personal cuz I don't know much about the kicks in when I got some more information on another album butt kicks in when I've got even more information. So having the ability to make the most personal on the best recommendation you can make at different levels of data availability ends up being important in this place is great. Thank you so much.
Thanks for listening if you like what we doing, we appreciate your telling other people about the show better yet. Give us a 5-star rating on iTunes or write a review doing so a really help ensure the more people can find us and if you haven't already please subscribe to the impact podcast on iTunes, SoundCloud or have you go to find your podcast.