The nitty-gritty of fine-tuning a GenAI model
Jon Prial: The material and information presented in this podcast is for discussion and general informational purposes only, and is not intended to be and should not be construed as legal, business, tax, investment advice, or other professional advice. The material and information does not constitute a recommendation, offer, solicitation or invitation to the sale of any securities, financial instruments, investments, or other services, including any securities of any investment fund or other entity managed or advised directly or indirectly by Georgian or any of its affiliates. The views and opinions expressed by any guest are their own views and does not reflect the opinions of Georgian. We've all heard about how generative AI is changing business. If you crack open the door and peer in on the AI teams, you'll see them playing with models. And no, we're not talking about planes and trains. We're talking about providing the correct inputs necessary to drive desired outputs in an AI model. In this podcast, we'll cover whether GenAI has changed and how we think about data modes and the role fine- tuning plays in helping companies find use cases for GenAI. For background, fine- tuning is taking a generic foundational model like OpenAI's, GPT-IV, and training it on your own specific data so you can tailor the foundational model to your use case. Not easy, but this is really interesting stuff. I'm going to call this the high- level ditty gritty. So let's open the door the rest of the way and sit down with Georgian's Rohit Saha. Rohit, Tell me about what you do on Georgian's AI team.
Rohit Saha: I'm Rohit Saha. I'm one of the ML scientists at Georgian's R& D team. I get to work with our portfolio companies that we consider our customers to help them try to accelerate their data science roadmap. What that means is I work with the data science teams to help them scope research problems. Over the last two and a half years, as Georgian has continued to invest in more companies, I've had the privilege of working across computer vision, natural language processing, large language models, and I get to see commonalities across different projects and bring solutions that can scale faster.
Jon Prial: I'm excited about the fact that you're mentioning LLMs because I really want to do that. Before I get to LLMs, I just want to step back a little bit and see if a paradigm has changed a little bit. So prior to this explosion of LLMs, which is really the whole generative AI thing, has somewhat changed the landscape, but we have been working with companies in the world of ML and AI, and a lot of things we do with them have been figuring out how to get the model right. These are perhaps smaller data science and AI teams, how to do it, what data to be used, how to build a data moat, getting the right size team. So have things changed in terms of what you did with them before, in terms of focusing on a model, and now there's LLMs in the picture. Has it changed or has it not changed?
Rohit Saha: A couple of years ago, companies that typically have a data moat had the opportunity to leverage these ML or AI models to build solutions on top of it. That bit, Jon, has not changed over the last few years, and that bit has not changed even today. Companies who still hold data moat can leverage large language models as yet another toolkit that they have in their toolbox. Yes, LLMs do have some benefits over conventional ML systems, and maybe we can get to those things later on. Aside from that, with regards to teams, if I look at just the data science bit, just the modeling piece, nothing has changed. If you had two or three data scientists on your team three years ago before the LLM or the GenAI boom, you still need those two or three data scientists today. Who have a good understanding of your data moat, who have a good understanding of the domain in which you operate. The part which will change is the engineering bit. Why? Because traditionally these systems, these models were very small. Today these models are huge. What that means is in order to deploy those solutions, you need more engineering hands to ensure that these models are predicting things faster. Which is a challenge.
Jon Prial: Interesting. I like the challenge. I think when people think about LLMs and they think about generative AI, I think for our audience, people automatically think about what consumers are doing playing with this stuff, individual employees interacting with a gen AI system, or students doing their homework or some type of cool productivity things. But here you are, we're staying focused on businesses. Georgian invests in these high growth B2B software companies, and of course we want our companies to find solutions to these business and technical challenges so they can grow and find some differentiation. So you say these LLMs are big, but as you dig in, what is it that you might think about that they can layer on top of this LLM? How do you begin to help them find efficiencies, help them find their solutions that they could use in this new world leveraging this new technology?
Rohit Saha: There are two buckets. The first bucket, Jon, is you can leverage large language models to enhance productivity. As you mentioned. What that means is initially if a software developer on your team took, let's say, two hours to complete a piece of code, today with the help of large language models such as ChatGPT, you can potentially do that in much less time. If you are writing a draft for an email that you want to send over, you could use a GenAI feature, a large language model to help you create that first draft. Keeping in mind the audience to whom you're writing that email. These are productivity hacks, and as we have seen, there are a bunch of services that exist out there free of cost. What I'm more interested about is bucket number two, which is how do you leverage these language models to actually build enterprise software? Because now businesses can actually start to charge customers based on the solutions that they build on top of this. What does that mean? Well, you could say that, Hey, Rohit, I have in my company there, we already use machine learning models and we have been servicing our customers for a very long time. Why should I even consider switching over to a large language model? And I say, Jon, it's because these models are extremely powerful. You don't need as much data to start getting the value out of these models. What that means is if you have a use case for which you don't have enough data, well, you don't have to wait. Because these models are so powerful, they can actually start learning patterns out of your data. What that means is you can bring value to your customers much faster. You don't have to wait, like you used to to gather huge amounts of data, then run that through a labeling process with human annotators and whatnot. So now you can suddenly get the ground running with very few annotated samples, and it can bring value to your customers much faster.
Jon Prial: So I have the LLM provided by some large company, but I've got my company and you're working with the data science teams, the AI and ML teams. They now have a toolkit they never had before, and they do have their own data. What's the techniques that's used? What is that called today? I think in the past there was a number of different techniques we used, one of them, we did podcasts a while on transfer learning. How are things evolved now? What are the types of tools? So if a CEO walks off and talks to her staff, what would be the questions they say? What are you doing? What tools are you using? What techniques are you using? What type of questions would a CEO ask her staff?
Rohit Saha: Well if I consider myself a CEO for a few minutes, I guess what I would ask is keeping aside ChatGPT, if my company needs to build an enterprise solution so that we can bring value to our customers faster, what models should I look into? Do you think that you can build a product on top of a third party API such as open as GPT, Google Spam, or should we really consider building an in- house solution? Meaning should we entertain the idea of having open source models as opposed to being dependent on third- party APIs? What does that mean from a hiring perspective? Do we need special talent on our team who understand these large language models? These are the kind of questions, Jon, that I would ask to my team just to kind of gauge, first of all, how much data do we have to begin with and if we need extra talent, if we need people who can deliver this product. So those are the kind of questions that I would start asking.
Jon Prial: And you talked about extra talent, but at the same time you talked about some things can be done as a productivity enhancing, you might need less resources. Are the types of skills that our team requires evolving in this new world of generative AI?
Rohit Saha: I would say so, yes. Because right now, the part about being able to leverage these models and fine- tune it on your use cases is being democratized by a lot of companies, by a lot of open- source toolkits. The main challenge is to actually figure out how these models behave so that you can leverage them for your downstream tasks. So you have these toolkits, Jon, that you can leverage, right? But you need people who understand how to use these tools to get the max performance. Moving forward, as LLMs start to become a more household name across data science teams, you will require people who understand how these models behave, what are the shortcomings of these models, when to use them, when not to use them. So definitely from a talent perspective, we will see a shift where data science teams will have to very quickly learn large language models, how to deploy them, what are the cases in which they work, and what are the cases in which they don't work.
Jon Prial: You're really reiterating some of our critical principles around generative AI in terms of instilling this AI mindset, embracing this technology. I think one of the keys I'd like to understand a little more, and I think you touched on it, but do a little more for me in terms of creating a dynamic data strategy and how are we going to leverage what they have, not get distracted and focus on differentiation? How would you have the companies go about that?
Rohit Saha: As a company, let's say that you have identified a use case for a product that you're building, and let's say that you have managed to gather a couple hundred data points. You can actually leverage those data points and one of these open- source models to run the process of fine- tuning. And as a matter of fact, the AI team at Georgian is working to create this open- source repository where we are cataloging open- source models. One of the ways to leverage these models on your data is through the process of fine- tuning. Some methods such as QLoRA or prefixed tuning are very popular that teams can use to start building on top of these models. So from a strategy perspective, Jon, what teams really need is identify the use that you're trying to solve similar to previous ML models, identify what goes into these models and what should come out of these models. Once you have a good understanding of the input and the output, the fine- tuning bit is very similar to what we have seen in the past. There's just new code, new mechanisms that you have to consider. Once you have that, you can deploy it to your customers and then gather feedback.
Jon Prial: It sounds like even a stronger view of iteration than what we've had in the past where a model is built, but you have to constantly test your model and make sure you're evolving and you're not maybe going down dark paths or whatever. But this is more relevant because I think the degree of interaction that the end users are seeing with this new technology is even greater. So I appreciate how you are staying on top of testing and testing and testing to make sure you've got the right models working for our customers customers.
Rohit Saha: Absolutely, Jon. Like the data flywheel has always been spoken about in the past, how critical it is to close the loop. You don't want to just start exposing predictions to your customers and then go grab a beer and call it a day. You need to get that feedback.
Jon Prial: Darn.
Rohit Saha: You need to get that feedback. You need to incorporate that in your model because only then will the model be able to stay on top of user trends of what your customers really prefer, what your customers don't want. And we use the word data flywheel to kind of represent that scenario where you train your model, you get predictions, you get feedback, and then you incorporate that feedback, retrain your model, and then the cycle goes on. As a ML scientist, I notice that there is a gap in the market where not enough work is happening to address the challenges that businesses will come across as they start to incorporate more of these large- language models into their day- to- day operations. The AI team at Georgian came up with this open source repository, it's called LLM fine- tuning Hub, where researchers can come to our repository, get access to all open source large- language models that are out there and instructions and practices to leverage them for their use cases. We also run our own benchmarking experiments because we also noticed that while there are some initiatives to benchmark these large- language models across tasks, they're primarily focused on academic tasks.
Jon Prial: In a previous podcast we learned from Parinaz Sobhani, she talked about working with our customers to find the golden use case. Tell me what that means to you and as you work with your customers.
Rohit Saha: What that means to me, Jon, and right now I'm working with three to four customers that are knee- deep into use cases. One such use case is the task of classification, which in the world of AI, what that means is if I give you a text or an image, the model should be able to tell if that text belongs to a particular let's say Twitter conversation, or if that text belongs to a Reddit conversation. Or if I give you an image, does that image contain a cat or a dog? To put it in very simple words. So for the task of classification, what we are seeing, Jon, is these models are fantastic in low data situations. Traditionally ML systems, if you wanted to build any model that could do the task of cats versus dogs or Reddit versus Twitter, they would need thousands of samples, thousands of hand- labeled samples. You can imagine the amount of time and effort that goes into annotating these samples because it's a very laborious and manual process where you need humans to go after each example and then annotate them. Now with LLMs, what we're seeing is you can get very good performance with a couple hundred samples.
Jon Prial: Wow. So this is interesting to me, I've often thought about labeling. I love all the cats and dogs stories. So companies actually need a different mindset that you're bringing to them so they don't have to think about what's the vast amount of data that I have that I need to classify or label. I actually can do things faster and more effectively in this new world. How do companies deal with all this in the age of generative AI
Rohit Saha: With regards to the terminology change. Fine- tuning as a concept has been prevalent in the industry for the last decade. What has changed, Jon, is these models, given how huge they are, they have their own ways of being fine- tuned. So to give you just an example, and I'll try my best to keep it high- level, Jon, historically, there's a model called BERT. It's a household name in the NLP world. The way you fine- tune a BERT is by taking the model as is and then slapping another layer at the top, and you basically tune that layer specific to your own data set. Now with large language models, what has happened is that bit has changed slightly and teams need to understand that fine- tuning a BERT is slightly different from fine- tuning a Falcon or a Llama to. These are, by the way, very popular open- source large language models. The technology has changed just a little bit. The methods that we use, such as QLoRA, prefixed tuning, to fine tune these large language models are new, so data scientists will have to learn these methodologies and the tools that they can leverage to fine tune the large language models, which are slightly different from fine- tuning a BERT or a DistilBERT or a RoBERTa.
Jon Prial: Rohit, thank you so much for giving us the time today.
Rohit Saha: Oh thank you, Jon. Very happy to have this conversation.
DESCRIPTION
We've all heard about how generative AI is changing almost every aspect of a business. If you crack open the door and peer in on the AI teams. You'll see them playing with models and, no, we're not talking about planes and trains. We're talking about providing the correct inputs necessary to drive desired outputs in an AI model.
On this episode of the Georgian Impact Podcast, we will be discussing the impact of generative AI and fine-tuning data strategy with Rohit Saha, an ML scientist on Georgian's R&D team. Rohit will explore how large language models (LLMs) and fine-tuning are changing the AI landscape for businesses, the necessary skills for data science teams in the age of generative AI, and the pivotal role of dynamic data strategy in leveraging new technology effectively.
You’ll Hear About:
- The role of fine-tuning in tailoring foundational AI models to specific use cases.
- How the landscape of ML and AI has evolved with the emergence of LLMs.
- Leveraging LLMs to enhance productivity and build enterprise software.
- Evolution of skills and talent required in the era of generative AI.
- Creating a dynamic data strategy and leveraging open source models for fine-tuning.
- Identifying golden use cases and the impact of LLMs on classification tasks.