ML Ops with Demetrios Brinkmann and David Aponte
Jon Prial: Today, a very interesting topic. And please, when you hear the word Ops, do not stop this podcast. Surely you've heard of Systems Ops and Data Ops, but in this new world of machine learning and AI, where data basically writes code and algorithms, MLOps has emerged as something much broader. It's important and so different that a huge community has developed around this space. Our guest today within MLOps community, which includes them co- hosting a podcast. I might be outnumbered but hopefully, I'm not too intimidated. I'm Jon Prial, welcome to Georgian's Impact Podcast. Gentlemen, great to have you here. And it's my pleasure to leave your introductions to each one of you.
David Aponte: Well, my name is David Aponte and I'm a machine learning engineer. I work on machine learning infrastructure at a company called BenevolentAI. I also am an organizer for the MLOps community, helping develop content, helping organize things. And yeah, I love what I do. I love to build stuff, I love to code, and I love to solve hard problems.
Jon Prial: Well, since this recording, let me just let you know, David has taken on a new role and he's a software engineer at Microsoft and he's focusing on MLOps.
Demetrios Brinkmann: My name is Demetrios Brinkmann. I am the self titled chief happiness engineer at the MLOps community. Basically, I go around and try and organize everything, make sure it runs without a hitch. And I also am doing a lot of stuff in the ethical AI space. I work with a company called Ethics Grade and we're grading different companies on their ethical AI initiatives.
Jon Prial: Running a community, doing interesting jobs. Wow, that's just great. Let's get started. Who can define for me what the MLOps community is? How big it is? Tell me about the MLOps community.
Demetrios Brinkmann: Yeah, recently we just turned one year old and we are pushing around 3, 600 people in Slack right now, which is the main place where we congregate. We've got another 2,100, 2, 200 subscribers on YouTube and a couple, I think, three to 400 people on podcast land. But the spot that we really have a lot of the community feel is in Slack. And that's where people can come. They ask questions, they get answers. We make jokes. We have memes about different things that are happening in the industry, and so that's the spot.
Jon Prial: And just to get a sense of the breadth besides the people, how many channels are in that Slack?
David Aponte: Yeah, I mean, we have all sorts of channels. One for Fresh Jams, if you want to listen to some good music. Another one special for vendors, if you want to just talk there, talk there. We have a shameless channel. We have MLOps questions answer channel which is probably one of our more popular channels. We have so many engineering labs channel, learning resources channel, open source channel, data science channel, explainable AI. I mean, we got so many and there's new ones popping up all the time. It's great.
Demetrios Brinkmann: But my favorite these days is probably the bad startup ideas channel.
David Aponte: That's a really funny one, actually, yeah. It's a good one.
Jon Prial: Excellent. There's a great reason for people to go. So I'd like to get a definitional statement, talk a bit about MLOps. And I think the way I'd like to ask you the question, you've been around about a year, I don't know how long MLOps has been around as an entity, but maybe the best way to help me and to help our audiences maybe to compare it with DataOps. So, my view is data there's data scientist, data analyst they've been around. There's the Cloudera's and the Splunks and things like data lakes. But now we have this MLOps, what's the same and what's different in terms of how these people do things?
David Aponte: I think that's a great question. It's probably one of the first questions that most stakeholders will ask when they hear words like MLOps or things of that nature. So I think everyone's going to have a bit of a different answer to this, so I'll give you mine. I would consider MLOps as not just the model component, but actually covering the entire life cycle of a machine learning workflow. So everything from getting the data that you need to train your model to then outputting those predictions and monitoring that. All of that requires lots of different disciplines, lots of different priorities, and lots of different stakeholders. From data engineers, machine learning engineers, data scientists, product managers, and even senior leadership should be involved in that as well as a big part of what I would say machine learning Ops is or MLOps is, is unifying machine learning system development and machine learning system operations. So again, it's the development where machine learning is really challenging to organize as most of it is research. Most of it is still experimental. You're iterating on ideas, trying new ideas out and that is science. And like we've spoken to people who are very adamant about communicating that, machine learning is still a research and I think that's important to echo. I mean, it's been around for I guess you could say, a while, but not in its full form now and definitely not at the scale at which people are using it. And now machine learning is going into almost every industry. So going to your question about the difference between MLOps and DataOps, I actually wouldn't see much of a difference between them. I would say though, that it is somewhat useful to differentiate them. So, I would argue that MLOps encapsulates the DataOps component, but for people who think that it's better to separate them, I guess one of the biggest differences there is that machine learning Ops has that additional layer of the model. So when a model learns, it learns its behavior based off the data and sometimes there's randomness involved there. So it's a little bit different there from traditional software. My another layer is that the machine learning component also has that data component. They're very closely related. The quality of the data affects the quality of the model. And then there's the operation side, the monitoring, the infrastructure, all of those things that support that for all sorts of environments and all sorts of business use cases there. There's a little bit more precedent from the world of DevOps and from the world of mature software engineering. But in general, I actually see a lot of these things overlapping. So sorry if my answer isn't that helpful, but I would argue that MLOps actually captures the data component of the machine learning life cycle.
Demetrios Brinkmann: It's interesting that you say that, because within MLOps, you can nest DataOps but within DataOps, you cannot nest MLOps. Because you can have data, but you don't necessarily need to be doing machine learning on it. But with MLOps, it's implying that you're doing machine learning with it.
Jon Prial: We have Demetrios comment on this. I just talked about a much broader set of capabilities, for example, product management, which in theory could be part of DataOps but I feel it's a little more narrow cast. Go into the DataOps team, get what they need to do and then there's a whole product management. It feels like and maybe that's the nature of iterating with ML and the data keeps changing and you're refining your models. Versus, once I get my maybe... I don't want to demean DataOps, you get your report done, you're sort of done. Demetrios, you think that is explaining something much broader. I mean more so than getting beyond just ML, but into this whole product world?
Demetrios Brinkmann: Well I think that's a great challenge these days with machine learning is the product management. And that's something that we talk about in the community quite a bit, because it's a moving target and we say that a lot. It's never so clear. And it's called data science, right? There's a science behind it. And machine learning when you're creating these models, it's never clear if you can do what you really want to do, if you can create a model that has the accuracy score or it's able to predict things that you need it to predict because you may not have the data or it just might be more difficult than you anticipated. So a product manager in the machine learning field, it's a very interesting job to have and tip of the hat to him, for sure, because it's not easy.
David Aponte: I would just only add to that, one of the differences there is, I guess, what makes it a little bit more challenging when you have machine learning or AI in a product is the outputs of that is all learned behavior from the data. And then you get into the question of like," What is it learning? Is it learning things that we want it to learn?" Is it," What are we optimizing for?" A good example is YouTube, I forget when this happened but when they started using neural networks that they were optimizing for engagement and it really worked. You watch some YouTube videos, you will go down that rabbit hole and they're good at doing that. But there's the interesting question of," Is that what we want to optimize for? Do we want to optimize for diversity?" We want to make sure that what they're learning is ethical, that it's giving a fair representation of whatever it is that they're learning. And it's all of these scientific questions, these quality concern questions that I think are outside of the scope of traditional software development. So a product manager now has to take on that additional complexity, has to understand these additional questions that may be outside of the scope of a traditional software product. But now because it's embedded into the product and if not the heart of what that product is doing for a lot of applications, it becomes even more complicated, I guess. And so I think I would just say that they have to now think about the fact that the machine learning isn't like, we don't always know what it's learning and what the outputs are and whether or not those answers are the answers that we actually want. Are those the predictions what we want it to do? Or do we want this person to be learning? What we think will keep them engage when it could potentially lead them down paths that could be harmful. There was a podcast about some extremists getting influenced by what was being recommended on YouTube. So again, it could have some really big implications on what we're exposed to because the algorithm is essentially, it's narrowing down that search base. It's telling you what to watch and a lot of the places in. It's so good at what it does and we ended up watching that, but there's some interesting questions are on, is that what we want it to do? Was that really the right thing to do? And these are ethical questions now. So again, it just becomes more complicated, I think.
Jon Prial: For that podcast, I think you referred to as the New York times series called Rabbit Hole, and I do recommend that it was astounding podcast. But you covered two pieces to read. One was accuracy that the model is working and then there's bias. And I could be accurate and give you heart attack data for white males because that's in the data and maybe that's useful if I'm a white male going into a doctor, but maybe not useful if I'm something other than that. There's an element of accuracy and bias. How do you address that? How do you think about that? How does the community think about those two sides of the point?
David Aponte: That's a great question, a very hard question, and it's something that we're still working on I think as a community. This is where I think MLOps is going to get really interesting because this is not just about engineering. This is about, again, the science behind what's going on in all of this engineering. You mentioned, what are we trying to do? But now, I guess what MLOps is really trying to do in a lot of ways is automate some of these processes to enable this sort of DevOps workflow, where you're continuously integrating and continuously delivering. But when you have that and you have all the science there, now you need to try to automate the science. Okay. Automatically validating the quality of the data, automatically validating the quality of the model, knowing whether or not it's good to promote or we need to retrain it. So there's all these sorts of things, and you can look at it as a pipeline that are being specialized components that are focusing on specific parts of that problem. You mentioned bias. So there's a tool dedicated to detecting bias, quantifying bias, and using that to allow a team to develop a legal strategy even. And there's other products or services that are focused on automatically detecting whether or not your model is outputting something that could be model drift. So there's all sorts of services that are specializing on specific parts of the problem. And some of these services are still experimental. Let's say, using machine learning to detect bias or something like that could be something that I've seen in a paper. And that's really cool because that's not something that I think there's much precedent on. You have to now think about developing these things from scratch. So there's engineering challenges there, but there's also the science. There's all these new techniques on understanding the quality of the data and also the quality of the model. So I think MLOps is you're going to see more and more of these sort of researching fields specializing in one specific component of the machine learning workflow, like bias. Or another one that I saw, there's this product called Ease. ml. And they have a whole ecosystem of tools dedicated to these different particular components of the ML workflow. But again, I think we're going to see more and more of that as it matures, it's still in the early phases of that.
Demetrios Brinkmann: I will say too that every meetup that we've had so far that talks more about the ethical issues within the MLOps community, they're the lowest attended. And so I find it fascinating because it feels like for the engineers, it's not really on their mind, the ethical concerns. They're more about just," Hey, how do I build? What's the best way to optimize this and make it better." But like David is saying, this is something that is going to need to start coming into the conversation and it's already starting, it's coming up more and more. So we're going to be seeing it happen and it should be top of mind for engineers. We'll just see if the interest is there and if it picks up.
Jon Prial: Are you making an overt outreach to diversify the community, so it's not just engineers that it's sociologists, it's ethicists, it's lawyers. Do you feel like you need to make it more active recruitment?
Demetrios Brinkmann: We haven't done that. Now that you say it, it's probably something that could be interesting. I mean, I feel like the community at the moment is so technical and we have only one or two of these ethical channels, like explainable AI and such. There isn't something for that breed of person and that is very much into the ethical side, but I do see there needs to be the marrying between the two, the ethical side and the technical side. I feel that is something that definitely needs to happen.
David Aponte: Yeah, just thinking about the right team structure. What type of people should you have for a machine learning team or machine learning infrastructure team? I definitely think there needs to be a diverse set of stakeholders. MLOps is not just a technical problem. And we've talked about this in other podcast, but I think that's important to reiterate here that like you're mentioning, we don't need only engineers. We need domain experts, for example, Benevolent AI where it struck discovery. There's a lot of knowledge there about biology, about chemistry, about how to develop drugs, how to test drugs. That industry has been around for a while and I think you need that when you're thinking about," Okay, what data do I want to get? What models I want to build?" So you need a mixture of those things. I think this new sort of concern with ethics, I also think that's really important. Your company could be liable if you're not thinking about some of these things, so it's becoming more and more important. And I think like you're kind of getting at, we need to diversify the people that are going to be included in these ML team so that it's not just a bunch of tech people, because then we're kind of getting more of the same stuff. We need some different perspectives. And I think some companies are already doing that. When we look at companies like Google or Microsoft and the sort of roles that they're developing in this field, there's all sorts of new roles that are coming out and working alongside machine learning teams that are not just technical stakeholders and that says something. That says that this is not just a technical problem.
Jon Prial: And in addition to the diversity of roles, just on the more of the basics of MLOps. When I think about agile development in scrum masters, it's not necessarily a programmer that's evolved to be a scrum master. It could be a really good project manager type of skills. Do you see that evolving more so the people are kind of project managing and becoming the right type of leader in MLOps role and what backgrounds work the best?
David Aponte: So first with what type of backgrounds work best. It depends on what type of work you're doing. So let's say if you're working as a machine learning engineer or MLOps engineer, whatever the role is, data engineer, obviously having a good understanding of software engineering goes a long way. But then a lot of us know, machine learning is also math. There's a lot of applied math, so you need to understand some of those things. And I would say even people that are involved in the monitoring stage should understand those things. You have a good understanding of basic statistics, distributions and understanding how to evaluate the quality of the data at every level. And that could even happen with let's say, if you have an SRE type of person on your team. Responsible for maintaining the reliability of your machine learning infrastructure, they will often have to monitor how much CPU or memory using, and there you're applying you need analytical skills. But then if you're monitoring the distribution of some predictions or some input features, then you need to know more about the data science side of things. So it really depends upon the role that you're trying to aim for. I would say, there's lots of different roles and I do think that there is more and more kind of product managers, project managers coming into play. I feel it must be challenging because you have to now not just understand product software development. You mentioned these sort of ways of working, but now you have to think about all those added pieces of complexity. So I think that a product manager for a machine learning product should have a decent sense of the basics of what's going on. The data science, maybe a little bit of the math and how to monitor it and some of the concerns that I guess would come up. What are the problems that can come up in a system like this? That way, there's more common language. If it's just the engineers focusing on these things and it's not going to be solved in I guess the best way.
Demetrios Brinkmann: Yeah. A lot of the time when I do talk to product managers, they'll say, half if not more of their job is just being a translator. Translating from the business side to the engineering or data scientist. Because what you don't realize is, you'll have the business side and say," Hey, we want x, y, z." And maybe the data scientist can do that. They can create a model that does that, but it's not like you really understand that directly. Right? You have to be able to understand really what they're looking for. And as David said earlier, you have to have these deep subject matter experts that are asking for something. And then you have the data scientist or someone that is able to understand what exactly they're asking for and how can I make that a reality?
Jon Prial: We've got data centers, we've got hybrid private clouds in our clouds, and yet we're constantly iterating on these models and I get nervous about how something goes in production. But don't forget I'm an old school person as you know, my days of development will waterfall. So it's a different world here. And now this is fast and moving. How do you even think about things like that?
David Aponte: Well, just you mentioned the cloud, right? How is the cloud affecting all of us? I think it's accelerating the rate of development. Now, for example, you think about scalability and building systems that are robust to millions of users. Some of these websites or these companies could have 50 million users in a day. When you have this much scale, it's hard to do that all on your own. So moving your workloads to the cloud allows you to really essentially have infinite compute, infinite resources for the money. That's the thing. You have to pay a lot of money for the cloud for these managed services and even for compute itself, if you want to just use a virtual machine and run your workloads on that. Or stand up infrastructure, using your own Kubernetes cluster which is what we do at Benevolent. Either way, it allows you to scale out. It allows you to meet the demand of the traffic that we're dealing with today. There's so many users and it's only getting more and more. And so you have to think about how do you do that at scale. And realistically more and more companies are moving their workloads to the cloud as they have great services. A lot of these providers like GCP, AWS or Azure have so many great even MLOps specific services that are managed that will scale, it can handle the scale that some of these big companies have. And I think that's changing things a lot, because before that, either you had your own data center or you had your own server, your own rack that you were maintaining, you needed to have experts maintaining that and making that better. But now you can kind of offset that cost or move that cost to AWS or GCP and they're really good at what they do. The services that they provide are awesome. And I think that's going to make machine learning development a little easier to do, as now there's access to specialized hardware like GPUs and TPUs. These are things that are really expensive and not every person has access to that but the cloud gives you that and they're even giving offerings that allow people to start using the cloud. If you sign up for GCP, you get$ 300 worth of credit, I believe. And you could just spin up and start playing around with things. So I think it's lowering the barrier to entry and it's also supporting the acceleration of development and in particular for machine learning because of the specialized hardware and all of these different services that are really complicated to stand up.
Jon Prial: Well, the accessibility of this is a big deal. This was a short and sweet and really, I think, great content. Let me ask Demetrios, we'll put in our show notes, MLOps community, your podcast and things, but kind of give me a wrap as to why we should get more people kind of following what the two of you were doing which I think is really great work.
Demetrios Brinkmann: Right now it feels like what we're trying to do is demystify this space. It's so new and there's so many different tools. There's so many different processes. There's so many different architectures that all these companies that are out front, we can say are leading the way. They're coming out with blog posts. They're showing the ways that they're doing it. But if you're at a startup or you're a one man band, maybe it's difficult to figure out how you can bridge that gap. And you're looking to the tools that are on offering and you don't understand which ones are perfect for my need right now, because it's so confusing. There's so many different options, and there's so many ways to do this. It makes it really difficult to figure out what's best for my use case. And so that's really what we're trying to do. We're trying to demystify this space so that it is a bit more clear. And so that people can come and have a place to ask questions or learn from the ones that are a little bit further ahead of them in this journey.
Jon Prial: Yeah. It's just a great close. I think hopefully will drive more people to your sites and to get this information. Clearly the world of programming has changed, it's data- driven, it's ML driven. We got to get it right. We got to iterate. We got to keep the bias out, and so much to be done. I think it's such an exciting space. David, Demetrios, thank you so much for taking the time to be with us today. It was just a pleasure.
David Aponte: That was an absolute pleasure. For anyone listening, feel free to reach out to us. We highly encourage you guys to check out the MLOps community, pop in, introduce yourself and ask your question or just say hi.
Jon Prial: And I think on that note we'll just tell, if you want more information, we'll get that out there for the community and for the MLOps community. Fantastic dialogue. Just thank you so much.
DESCRIPTION
David Aponte and Demetrios Brinkmann are our guests on this episode of the Georgian Impact Podcast. David is a Software Engineer at Microsoft with a focus on MLOps. Demetrios is the Community Coordinator for the MLOps Community and also works in the ethical AI space. Together they will break down MLOps and why it is so important.