Tracking Large Language Models (LLM) with MLflow : A Complete Guide

As Large Language Models (LLMs) grow in complexity and size, tracking their performance, experimentation, and implementations becomes increasingly challenging. This is where MLflow comes into the picture: it provides a comprehensive platform for managing the entire lifecycle of machine learning models, including LLMs.
In this comprehensive guide, we explore how to use MLflow to track, evaluate, and implement LLMs. We’ll cover everything from setting up your environment to advanced evaluation techniques, with plenty of code examples and best practices along the way.
Set up your environment
Before we delve into pursuing LLMs with MLflow, let’s set up our development environment. We need to install MLflow and several other key libraries:
pip install mlflow>=2.8.1 pip install openai pip install chromadb==0.4.15 pip install langchain==0.0.348 pip install tiktoken pip install 'mlflow[genai]' pip install databricks-sdk --upgrade
After installation, it is a good practice to restart your Python environment to ensure that all libraries are loaded correctly. In a Jupyter notebook you can use:
import mlflow
import chromadb
print(f"MLflow version: {mlflow.__version__}")
print(f"ChromaDB version: {chromadb.__version__}")
This will confirm the versions of the key libraries we will use.
Understanding MLflow’s LLM tracking capabilities
MLflow’s LLM tracking system builds on existing tracking capabilities and adds features specifically designed for the unique aspects of LLMs. Let’s break down the main components:
Executions and experiments
In MLflow, a ‘run’ represents a single execution of your model code, while an ‘experiment’ is a collection of related runs. For LLMs, a run can represent a single query or a set of prompts processed by the model.
Key tracking components
- Parameters: These are input configurations for your LLM, such as temperature, top_k or max_tokens. You can log this via
mlflow.log_param()ormlflow.log_params(). - Statistics: Quantitative measures of your LLM’s performance, such as accuracy, latency, or custom scores. Usage
mlflow.log_metric()ormlflow.log_metrics()to follow this. - Predictions: For LLMs, it is critical to log both the input prompts and the output of the model. MLflow saves these as artifacts in CSV format using
mlflow.log_table(). - Artifacts: Any additional files or data related to your LLM run, such as model checkpoints, visualizations, or dataset samples. Usage
mlflow.log_artifact()to save it.
Let’s look at a simple example of logging an LLM run:
This example demonstrates the logging parameters, metrics, and input/output as a table artifact.
import mlflow
import openai
def query_llm(prompt, max_tokens=100):
response = openai.Completion.create(
engine="text-davinci-002",
prompt=prompt,
max_tokens=max_tokens
)
return response.choices[0].text.strip()
with mlflow.start_run():
prompt = "Explain the concept of machine learning in simple terms."
# Log parameters
mlflow.log_param("model", "text-davinci-002")
mlflow.log_param("max_tokens", 100)
# Query the LLM and log the result
result = query_llm(prompt)
mlflow.log_metric("response_length", len(result))
# Log the prompt and response
mlflow.log_table("prompt_responses", {"prompt": [prompt], "response": [result]})
print(f"Response: {result}")
Implementing LLMs with MLflow
MLflow provides powerful capabilities for deploying LLMs, making it easier to operate your models in production environments. Let’s see how you can deploy an LLM using MLflow’s deployment features.
Create an endpoint
First, we create an endpoint for our LLM using MLflow’s deployment client:
import mlflow
from mlflow.deployments import get_deploy_client
# Initialize the deployment client
client = get_deploy_client("databricks")
# Define the endpoint configuration
endpoint_name = "llm-endpoint"
endpoint_config = {
"served_entities": [{
"name": "gpt-model",
"external_model": {
"name": "gpt-3.5-turbo",
"provider": "openai",
"task": "llm/v1/completions",
"openai_config": {
"openai_api_type": "azure",
"openai_api_key": "{{secrets/scope/openai_api_key}}",
"openai_api_base": "{{secrets/scope/openai_api_base}}",
"openai_deployment_name": "gpt-35-turbo",
"openai_api_version": "2023-05-15",
},
},
}],
}
# Create the endpoint
client.create_endpoint(name=endpoint_name, config=endpoint_config)
This code sets up an endpoint for a GPT-3.5 turbo model using Azure OpenAI. Note the use of Databricks secrets for secure API key management.
Test the endpoint
Once the endpoint is created, we can test it:
<div class="relative flex flex-col rounded-lg">
response = client.predict(
endpoint=endpoint_name,
inputs={"prompt": "Explain the concept of neural networks briefly.","max_tokens": 100,},)
print(response)
This will send a prompt to our deployed model and return the generated response.
Evaluating LLMs with MLflow
Evaluation is crucial for understanding the performance and behavior of your LLMs. MLflow provides comprehensive tools for evaluating LLMs, including both built-in and custom metrics.
Preparing your LLM for evaluation
To evaluate your LLM mlflow.evaluate()your model must have one of these shapes:
- A
mlflow.pyfunc.PyFuncModelinstance or a URI pointing to a registered MLflow model. - A Python function that takes string input and outputs a single string.
- An MLflow Deployments endpoint URI.
- Set
model=Noneand include model output in the evaluation data.
Let’s look at an example with a logged MLflow model:
import mlflow
import openai
with mlflow.start_run():
system_prompt = "Answer the following question concisely."
logged_model_info = mlflow.openai.log_model(
model="gpt-3.5-turbo",
task=openai.chat.completions,
artifact_path="model",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": "{question}"},
],
)
# Prepare evaluation data
eval_data = pd.DataFrame({
"question": ["What is machine learning?", "Explain neural networks."],
"ground_truth": [
"Machine learning is a subset of AI that enables systems to learn and improve from experience without explicit programming.",
"Neural networks are computing systems inspired by biological neural networks, consisting of interconnected nodes that process and transmit information."
]
})
# Evaluate the model
results = mlflow.evaluate(
logged_model_info.model_uri,
eval_data,
targets="ground_truth",
model_type="question-answering",
)
print(f"Evaluation metrics: {results.metrics}")
This example registers an OpenAI model, prepares evaluation data, and then evaluates the model using MLflow’s built-in metrics for query answering.
Custom evaluation metrics
MLflow allows you to define custom metrics for LLM evaluation. Here’s an example of creating a custom metric for evaluating the professionalism of responses:
from mlflow.metrics.genai import EvaluationExample, make_genai_metric
professionalism = make_genai_metric(
name="professionalism",
definition="Measure of formal and appropriate communication style.",
grading_prompt=(
"Score the professionalism of the answer on a scale of 0-4:\n"
"0: Extremely casual or inappropriate\n"
"1: Casual but respectful\n"
"2: Moderately formal\n"
"3: Professional and appropriate\n"
"4: Highly formal and expertly crafted"
),
examples=[
EvaluationExample(
input="What is MLflow?",
output="MLflow is like your friendly neighborhood toolkit for managing ML projects. It's super cool!",
score=1,
justification="The response is casual and uses informal language."
),
EvaluationExample(
input="What is MLflow?",
output="MLflow is an open-source platform for the machine learning lifecycle, including experimentation, reproducibility, and deployment.",
score=4,
justification="The response is formal, concise, and professionally worded."
)
],
model="openai:/gpt-3.5-turbo-16k",
parameters={"temperature": 0.0},
aggregations=["mean", "variance"],
greater_is_better=True,
)
# Use the custom metric in evaluation
results = mlflow.evaluate(
logged_model_info.model_uri,
eval_data,
targets="ground_truth",
model_type="question-answering",
extra_metrics=[professionalism]
)
print(f"Professionalism score: {results.metrics['professionalism_mean']}")
This custom metric uses GPT-3.5 turbo to assess the professionalism of responses, and shows how you can use LLMs themselves for evaluation.
Advanced LLM Evaluation Techniques
As LLMs become more sophisticated, the techniques for evaluating them also become more sophisticated. Let’s explore some advanced evaluation methods using MLflow.
Evaluation of Retrieval-Augmented Generation (RAG).
RAG systems combine the power of retrieval-based and generative models. Evaluating RAG systems requires assessing both the collection and generation components. Here’s how to set up and evaluate a RAG system with MLflow:
from langchain.document_loaders import WebBaseLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
# Load and preprocess documents
loader = WebBaseLoader(["https://mlflow.org/docs/latest/index.html"])
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)
# Create vector store
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(texts, embeddings)
# Create RAG chain
llm = OpenAI(temperature=0)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=vectorstore.as_retriever(),
return_source_documents=True
)
# Evaluation function
def evaluate_rag(question):
result = qa_chain({"query": question})
return result["result"], [doc.page_content for doc in result["source_documents"]]
# Prepare evaluation data
eval_questions = [
"What is MLflow?",
"How does MLflow handle experiment tracking?",
"What are the main components of MLflow?"
]
# Evaluate using MLflow
with mlflow.start_run():
for question in eval_questions:
answer, sources = evaluate_rag(question)
mlflow.log_param(f"question", question)
mlflow.log_metric("num_sources", len(sources))
mlflow.log_text(answer, f"answer_{question}.txt")
for i, source in enumerate(sources):
mlflow.log_text(source, f"source_{question}_{i}.txt")
# Log custom metrics
mlflow.log_metric("avg_sources_per_question", sum(len(evaluate_rag(q)[1]) for q in eval_questions) / len(eval_questions))
This example sets up a RAG system using LangChain and Chroma and then evaluates it by recording queries, responses, resources retrieved, and custom metrics in MLflow.
The way you divide your documents can significantly affect RAG’s performance. MLflow can help you evaluate different chunking strategies:
This script evaluates different combinations of chunk sizes, overlaps, and splitting methods, and records the results in MLflow for easy comparison.
MLflow offers several ways to visualize your LLM evaluation results. Here are some techniques:
You can create custom visualizations of your evaluation results using libraries such as Matplotlib or Plotly, and then log them as artifacts:
This function creates a line chart comparing a specific metric across multiple runs and records it as an artifact.





