Tracking Large Language Models (LLM) with MLflow : A Complete Guide
As Large Language Models (LLMs) grow in complexity and size, tracking their performance, experimentation, and implementations becomes increasingly challenging. This is where MLflow comes into the picture: it provides a comprehensive platform for managing the entire lifecycle of machine learning models, including LLMs.
In this comprehensive guide, we explore how to use MLflow to track, evaluate, and implement LLMs. We’ll cover everything from setting up your environment to advanced evaluation techniques, with plenty of code examples and best practices along the way.
Set up your environment
Before we delve into pursuing LLMs with MLflow, let’s set up our development environment. We need to install MLflow and several other key libraries:
pip install mlflow>=2.8.1 pip install openai pip install chromadb==0.4.15 pip install langchain==0.0.348 pip install tiktoken pip install 'mlflow[genai]' pip install databricks-sdk --upgrade
After installation, it is a good practice to restart your Python environment to ensure that all libraries are loaded correctly. In a Jupyter notebook you can use:
import mlflow import chromadb print(f"MLflow version: {mlflow.__version__}") print(f"ChromaDB version: {chromadb.__version__}")
This will confirm the versions of the key libraries we will use.
Understanding MLflow’s LLM tracking capabilities
MLflow’s LLM tracking system builds on existing tracking capabilities and adds features specifically designed for the unique aspects of LLMs. Let’s break down the main components:
Executions and experiments
In MLflow, a ‘run’ represents a single execution of your model code, while an ‘experiment’ is a collection of related runs. For LLMs, a run can represent a single query or a set of prompts processed by the model.
Key tracking components
- Parameters: These are input configurations for your LLM, such as temperature, top_k or max_tokens. You can log this via
mlflow.log_param()
ormlflow.log_params()
. - Statistics: Quantitative measures of your LLM’s performance, such as accuracy, latency, or custom scores. Usage
mlflow.log_metric()
ormlflow.log_metrics()
to follow this. - Predictions: For LLMs, it is critical to log both the input prompts and the output of the model. MLflow saves these as artifacts in CSV format using
mlflow.log_table()
. - Artifacts: Any additional files or data related to your LLM run, such as model checkpoints, visualizations, or dataset samples. Usage
mlflow.log_artifact()
to save it.
Let’s look at a simple example of logging an LLM run:
This example demonstrates the logging parameters, metrics, and input/output as a table artifact.
import mlflow import openai def query_llm(prompt, max_tokens=100): response = openai.Completion.create( engine="text-davinci-002", prompt=prompt, max_tokens=max_tokens ) return response.choices[0].text.strip() with mlflow.start_run(): prompt = "Explain the concept of machine learning in simple terms." # Log parameters mlflow.log_param("model", "text-davinci-002") mlflow.log_param("max_tokens", 100) # Query the LLM and log the result result = query_llm(prompt) mlflow.log_metric("response_length", len(result)) # Log the prompt and response mlflow.log_table("prompt_responses", {"prompt": [prompt], "response": [result]}) print(f"Response: {result}")
Implementing LLMs with MLflow
MLflow provides powerful capabilities for deploying LLMs, making it easier to operate your models in production environments. Let’s see how you can deploy an LLM using MLflow’s deployment features.
Create an endpoint
First, we create an endpoint for our LLM using MLflow’s deployment client:
import mlflow from mlflow.deployments import get_deploy_client # Initialize the deployment client client = get_deploy_client("databricks") # Define the endpoint configuration endpoint_name = "llm-endpoint" endpoint_config = { "served_entities": [{ "name": "gpt-model", "external_model": { "name": "gpt-3.5-turbo", "provider": "openai", "task": "llm/v1/completions", "openai_config": { "openai_api_type": "azure", "openai_api_key": "{{secrets/scope/openai_api_key}}", "openai_api_base": "{{secrets/scope/openai_api_base}}", "openai_deployment_name": "gpt-35-turbo", "openai_api_version": "2023-05-15", }, }, }], } # Create the endpoint client.create_endpoint(name=endpoint_name, config=endpoint_config)
This code sets up an endpoint for a GPT-3.5 turbo model using Azure OpenAI. Note the use of Databricks secrets for secure API key management.
Test the endpoint
Once the endpoint is created, we can test it:
<div class="relative flex flex-col rounded-lg"> response = client.predict( endpoint=endpoint_name, inputs={"prompt": "Explain the concept of neural networks briefly.","max_tokens": 100,},) print(response)
This will send a prompt to our deployed model and return the generated response.
Evaluating LLMs with MLflow
Evaluation is crucial for understanding the performance and behavior of your LLMs. MLflow provides comprehensive tools for evaluating LLMs, including both built-in and custom metrics.
Preparing your LLM for evaluation
To evaluate your LLM mlflow.evaluate()
your model must have one of these shapes:
- A
mlflow.pyfunc.PyFuncModel
instance or a URI pointing to a registered MLflow model. - A Python function that takes string input and outputs a single string.
- An MLflow Deployments endpoint URI.
- Set
model=None
and include model output in the evaluation data.
Let’s look at an example with a logged MLflow model:
import mlflow import openai with mlflow.start_run(): system_prompt = "Answer the following question concisely." logged_model_info = mlflow.openai.log_model( model="gpt-3.5-turbo", task=openai.chat.completions, artifact_path="model", messages=[ {"role": "system", "content": system_prompt}, {"role": "user", "content": "{question}"}, ], ) # Prepare evaluation data eval_data = pd.DataFrame({ "question": ["What is machine learning?", "Explain neural networks."], "ground_truth": [ "Machine learning is a subset of AI that enables systems to learn and improve from experience without explicit programming.", "Neural networks are computing systems inspired by biological neural networks, consisting of interconnected nodes that process and transmit information." ] }) # Evaluate the model results = mlflow.evaluate( logged_model_info.model_uri, eval_data, targets="ground_truth", model_type="question-answering", ) print(f"Evaluation metrics: {results.metrics}")
This example registers an OpenAI model, prepares evaluation data, and then evaluates the model using MLflow’s built-in metrics for query answering.
Custom evaluation metrics
MLflow allows you to define custom metrics for LLM evaluation. Here’s an example of creating a custom metric for evaluating the professionalism of responses:
from mlflow.metrics.genai import EvaluationExample, make_genai_metric professionalism = make_genai_metric( name="professionalism", definition="Measure of formal and appropriate communication style.", grading_prompt=( "Score the professionalism of the answer on a scale of 0-4:\n" "0: Extremely casual or inappropriate\n" "1: Casual but respectful\n" "2: Moderately formal\n" "3: Professional and appropriate\n" "4: Highly formal and expertly crafted" ), examples=[ EvaluationExample( input="What is MLflow?", output="MLflow is like your friendly neighborhood toolkit for managing ML projects. It's super cool!", score=1, justification="The response is casual and uses informal language." ), EvaluationExample( input="What is MLflow?", output="MLflow is an open-source platform for the machine learning lifecycle, including experimentation, reproducibility, and deployment.", score=4, justification="The response is formal, concise, and professionally worded." ) ], model="openai:/gpt-3.5-turbo-16k", parameters={"temperature": 0.0}, aggregations=["mean", "variance"], greater_is_better=True, ) # Use the custom metric in evaluation results = mlflow.evaluate( logged_model_info.model_uri, eval_data, targets="ground_truth", model_type="question-answering", extra_metrics=[professionalism] ) print(f"Professionalism score: {results.metrics['professionalism_mean']}")
This custom metric uses GPT-3.5 turbo to assess the professionalism of responses, and shows how you can use LLMs themselves for evaluation.
Advanced LLM Evaluation Techniques
As LLMs become more sophisticated, the techniques for evaluating them also become more sophisticated. Let’s explore some advanced evaluation methods using MLflow.
Evaluation of Retrieval-Augmented Generation (RAG).
RAG systems combine the power of retrieval-based and generative models. Evaluating RAG systems requires assessing both the collection and generation components. Here’s how to set up and evaluate a RAG system with MLflow:
from langchain.document_loaders import WebBaseLoader from langchain.text_splitter import CharacterTextSplitter from langchain.embeddings import OpenAIEmbeddings from langchain.vectorstores import Chroma from langchain.chains import RetrievalQA from langchain.llms import OpenAI # Load and preprocess documents loader = WebBaseLoader(["https://mlflow.org/docs/latest/index.html"]) documents = loader.load() text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0) texts = text_splitter.split_documents(documents) # Create vector store embeddings = OpenAIEmbeddings() vectorstore = Chroma.from_documents(texts, embeddings) # Create RAG chain llm = OpenAI(temperature=0) qa_chain = RetrievalQA.from_chain_type( llm=llm, chain_type="stuff", retriever=vectorstore.as_retriever(), return_source_documents=True ) # Evaluation function def evaluate_rag(question): result = qa_chain({"query": question}) return result["result"], [doc.page_content for doc in result["source_documents"]] # Prepare evaluation data eval_questions = [ "What is MLflow?", "How does MLflow handle experiment tracking?", "What are the main components of MLflow?" ] # Evaluate using MLflow with mlflow.start_run(): for question in eval_questions: answer, sources = evaluate_rag(question) mlflow.log_param(f"question", question) mlflow.log_metric("num_sources", len(sources)) mlflow.log_text(answer, f"answer_{question}.txt") for i, source in enumerate(sources): mlflow.log_text(source, f"source_{question}_{i}.txt") # Log custom metrics mlflow.log_metric("avg_sources_per_question", sum(len(evaluate_rag(q)[1]) for q in eval_questions) / len(eval_questions))
This example sets up a RAG system using LangChain and Chroma and then evaluates it by recording queries, responses, resources retrieved, and custom metrics in MLflow.
The way you divide your documents can significantly affect RAG’s performance. MLflow can help you evaluate different chunking strategies:
This script evaluates different combinations of chunk sizes, overlaps, and splitting methods, and records the results in MLflow for easy comparison.
MLflow offers several ways to visualize your LLM evaluation results. Here are some techniques:
You can create custom visualizations of your evaluation results using libraries such as Matplotlib or Plotly, and then log them as artifacts:
This function creates a line chart comparing a specific metric across multiple runs and records it as an artifact.