Tracking Large Language Models (LLM) with MLflow : A Complete Guide

August 6, 2024

0 7 minutes read

As Large Language Models (LLMs) grow in complexity and size, tracking their performance, experimentation, and implementations becomes increasingly challenging. This is where MLflow comes into the picture: it provides a comprehensive platform for managing the entire lifecycle of machine learning models, including LLMs.

In this comprehensive guide, we explore how to use MLflow to track, evaluate, and implement LLMs. We’ll cover everything from setting up your environment to advanced evaluation techniques, with plenty of code examples and best practices along the way.

MLflow functionality in large language models (LLMs)

MLstream has become a critical tool in the machine learning and data science community, especially for managing the lifecycle of machine learning models. When it comes to Large Language Models (LLMs), MLflow provides a robust suite of tools that significantly streamline the process of developing, tracking, evaluating, and deploying these models. Here’s an overview of how MLflow functions within the LLM space and the benefits it brings to engineers and data scientists.

Learn more about the core components of MLflow

Track and manage LLM interactions

MLflow’s LLM tracking system is an enhancement to existing tracking capabilities, tailored to the unique needs of LLMs. It enables comprehensive tracking of model interactions, including the following key aspects:

Parameters: logging key-value pairs that detail the input parameters for the LLM, such as model-specific parameters such as top_k And temperature. This provides context and configuration for each run, ensuring that all aspects of the model’s configuration are captured.
Statistics: Quantitative measurements that provide insight into the performance and accuracy of the LLM. These can be dynamically updated as the run progresses, providing real-time or post-process insights.
Predictions: Capture the input sent to the LLM and its output, which are stored as artifacts in a structured format for easy retrieval and analysis.
Artifacts: In addition to predictions, MLflow can store various output files, such as visualizations, serialized models, and structured data files, allowing detailed documentation and analysis of model performance.

This structured approach ensures that all interactions with the LLM are rigorously recorded, providing comprehensive lineage and quality tracking for text-generating models.

Evaluation of LLMs

Evaluating LLMs presents unique challenges due to their generative nature and lack of a single ground truth. MLflow simplifies this with specialized evaluation tools designed for LLMs. Key features include:

Versatile model evaluation: Supports evaluation of different types of LLMs, whether it is an MLflow pyfunc model, a URI pointing to a registered MLflow model, or a callable Python function representing your model.
Extensive statistics: Provides a range of metrics tailored to LLM evaluation, including both SaaS model-dependent metrics (e.g., answer relevance) and feature-based metrics (e.g., ROUGE, Flesch Kincaid).
Predefined metric collections: Depending on the use case, such as answering questions or summarizing texts, MLflow provides predefined metrics to simplify the evaluation process.
Custom statistic creation: Allows users to define and implement custom metrics to meet specific evaluation needs, increasing the flexibility and depth of model evaluation.
Evaluation with static datasets: Allows evaluation of static datasets without specifying a model, which is useful for quick assessments without re-running model inference.

Implementation and integration

MLflow also supports seamless implementation and integration of LLMs:

MLflow deployment server: Acts as a unified interface for interacting with multiple LLM providers. It simplifies integrations, securely manages credentials, and provides a consistent API experience. This server supports a range of basic models from popular SaaS vendors, as well as self-hosted models.
Uniform endpoint: Facilitates easy switching between providers without code changes, minimizing downtime and increasing flexibility.
Integrated results display: Provides comprehensive evaluation results, which can be accessed directly in the code or through the MLflow user interface for detailed analysis.

MLflow is a comprehensive set of tools and integrations, making it invaluable for engineers and data scientists working with advanced NLP models.

Set up your environment

Before we delve into pursuing LLMs with MLflow, let’s set up our development environment. We need to install MLflow and several other key libraries:

pip install mlflow>=2.8.1
pip install openai
pip install chromadb==0.4.15
pip install langchain==0.0.348
pip install tiktoken
pip install 'mlflow[genai]'
pip install databricks-sdk --upgrade

After installation, it is a good practice to restart your Python environment to ensure that all libraries are loaded correctly. In a Jupyter notebook you can use:

import mlflow
import chromadb
print(f"MLflow version: {mlflow.__version__}")
print(f"ChromaDB version: {chromadb.__version__}")

This will confirm the versions of the key libraries we will use.

Understanding MLflow’s LLM tracking capabilities

MLflow’s LLM tracking system builds on existing tracking capabilities and adds features specifically designed for the unique aspects of LLMs. Let’s break down the main components:

Executions and experiments

In MLflow, a ‘run’ represents a single execution of your model code, while an ‘experiment’ is a collection of related runs. For LLMs, a run can represent a single query or a set of prompts processed by the model.

Key tracking components

Parameters: These are input configurations for your LLM, such as temperature, top_k or max_tokens. You can log this via mlflow.log_param() or mlflow.log_params().
Statistics: Quantitative measures of your LLM’s performance, such as accuracy, latency, or custom scores. Usage mlflow.log_metric() or mlflow.log_metrics() to follow this.
Predictions: For LLMs, it is critical to log both the input prompts and the output of the model. MLflow saves these as artifacts in CSV format using mlflow.log_table().
Artifacts: Any additional files or data related to your LLM run, such as model checkpoints, visualizations, or dataset samples. Usage mlflow.log_artifact() to save it.

Let’s look at a simple example of logging an LLM run:

This example demonstrates the logging parameters, metrics, and input/output as a table artifact.

import mlflow
import openai
def query_llm(prompt, max_tokens=100):
    response = openai.Completion.create(
        engine="text-davinci-002",
        prompt=prompt,
        max_tokens=max_tokens
    )
    return response.choices[0].text.strip()
with mlflow.start_run():
    prompt = "Explain the concept of machine learning in simple terms."
    
    # Log parameters
    mlflow.log_param("model", "text-davinci-002")
    mlflow.log_param("max_tokens", 100)
    
    # Query the LLM and log the result
    result = query_llm(prompt)
    mlflow.log_metric("response_length", len(result))
    
    # Log the prompt and response
    mlflow.log_table("prompt_responses", {"prompt": [prompt], "response": [result]})
    
    print(f"Response: {result}")

Implementing LLMs with MLflow

MLflow provides powerful capabilities for deploying LLMs, making it easier to operate your models in production environments. Let’s see how you can deploy an LLM using MLflow’s deployment features.

Create an endpoint

First, we create an endpoint for our LLM using MLflow’s deployment client:

import mlflow
from mlflow.deployments import get_deploy_client
# Initialize the deployment client
client = get_deploy_client("databricks")
# Define the endpoint configuration
endpoint_name = "llm-endpoint"
endpoint_config = {
    "served_entities": [{
        "name": "gpt-model",
        "external_model": {
            "name": "gpt-3.5-turbo",
            "provider": "openai",
            "task": "llm/v1/completions",
            "openai_config": {
                "openai_api_type": "azure",
                "openai_api_key": "{{secrets/scope/openai_api_key}}",
                "openai_api_base": "{{secrets/scope/openai_api_base}}",
                "openai_deployment_name": "gpt-35-turbo",
                "openai_api_version": "2023-05-15",
            },
        },
    }],
}
# Create the endpoint
client.create_endpoint(name=endpoint_name, config=endpoint_config)

This code sets up an endpoint for a GPT-3.5 turbo model using Azure OpenAI. Note the use of Databricks secrets for secure API key management.

Test the endpoint

Once the endpoint is created, we can test it:

<div class="relative flex flex-col rounded-lg">
response = client.predict(
endpoint=endpoint_name,
inputs={"prompt": "Explain the concept of neural networks briefly.","max_tokens": 100,},)
print(response)

This will send a prompt to our deployed model and return the generated response.

Evaluating LLMs with MLflow

Evaluation is crucial for understanding the performance and behavior of your LLMs. MLflow provides comprehensive tools for evaluating LLMs, including both built-in and custom metrics.

Preparing your LLM for evaluation

To evaluate your LLM mlflow.evaluate()your model must have one of these shapes:

A mlflow.pyfunc.PyFuncModel instance or a URI pointing to a registered MLflow model.
A Python function that takes string input and outputs a single string.
An MLflow Deployments endpoint URI.
Set model=None and include model output in the evaluation data.

Let’s look at an example with a logged MLflow model:

import mlflow
import openai
with mlflow.start_run():
    system_prompt = "Answer the following question concisely."
    logged_model_info = mlflow.openai.log_model(
        model="gpt-3.5-turbo",
        task=openai.chat.completions,
        artifact_path="model",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": "{question}"},
        ],
    )
# Prepare evaluation data
eval_data = pd.DataFrame({
    "question": ["What is machine learning?", "Explain neural networks."],
    "ground_truth": [
        "Machine learning is a subset of AI that enables systems to learn and improve from experience without explicit programming.",
        "Neural networks are computing systems inspired by biological neural networks, consisting of interconnected nodes that process and transmit information."
    ]
})
# Evaluate the model
results = mlflow.evaluate(
    logged_model_info.model_uri,
    eval_data,
    targets="ground_truth",
    model_type="question-answering",
)
print(f"Evaluation metrics: {results.metrics}")

This example registers an OpenAI model, prepares evaluation data, and then evaluates the model using MLflow’s built-in metrics for query answering.

Custom evaluation metrics

MLflow allows you to define custom metrics for LLM evaluation. Here’s an example of creating a custom metric for evaluating the professionalism of responses:

from mlflow.metrics.genai import EvaluationExample, make_genai_metric
professionalism = make_genai_metric(
    name="professionalism",
    definition="Measure of formal and appropriate communication style.",
    grading_prompt=(
        "Score the professionalism of the answer on a scale of 0-4:\n"
        "0: Extremely casual or inappropriate\n"
        "1: Casual but respectful\n"
        "2: Moderately formal\n"
        "3: Professional and appropriate\n"
        "4: Highly formal and expertly crafted"
    ),
    examples=[
        EvaluationExample(
            input="What is MLflow?",
            output="MLflow is like your friendly neighborhood toolkit for managing ML projects. It's super cool!",
            score=1,
            justification="The response is casual and uses informal language."
        ),
        EvaluationExample(
            input="What is MLflow?",
            output="MLflow is an open-source platform for the machine learning lifecycle, including experimentation, reproducibility, and deployment.",
            score=4,
            justification="The response is formal, concise, and professionally worded."
        )
    ],
    model="openai:/gpt-3.5-turbo-16k",
    parameters={"temperature": 0.0},
    aggregations=["mean", "variance"],
    greater_is_better=True,
)
# Use the custom metric in evaluation
results = mlflow.evaluate(
    logged_model_info.model_uri,
    eval_data,
    targets="ground_truth",
    model_type="question-answering",
    extra_metrics=[professionalism]
)
print(f"Professionalism score: {results.metrics['professionalism_mean']}")

This custom metric uses GPT-3.5 turbo to assess the professionalism of responses, and shows how you can use LLMs themselves for evaluation.

Advanced LLM Evaluation Techniques

As LLMs become more sophisticated, the techniques for evaluating them also become more sophisticated. Let’s explore some advanced evaluation methods using MLflow.

Evaluation of Retrieval-Augmented Generation (RAG).

RAG systems combine the power of retrieval-based and generative models. Evaluating RAG systems requires assessing both the collection and generation components. Here’s how to set up and evaluate a RAG system with MLflow:

from langchain.document_loaders import WebBaseLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
# Load and preprocess documents
loader = WebBaseLoader(["https://mlflow.org/docs/latest/index.html"])
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)
# Create vector store
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(texts, embeddings)
# Create RAG chain
llm = OpenAI(temperature=0)
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(),
    return_source_documents=True
)
# Evaluation function
def evaluate_rag(question):
    result = qa_chain({"query": question})
    return result["result"], [doc.page_content for doc in result["source_documents"]]
# Prepare evaluation data
eval_questions = [
    "What is MLflow?",
    "How does MLflow handle experiment tracking?",
    "What are the main components of MLflow?"
]
# Evaluate using MLflow
with mlflow.start_run():
    for question in eval_questions:
        answer, sources = evaluate_rag(question)
        
        mlflow.log_param(f"question", question)
        mlflow.log_metric("num_sources", len(sources))
        mlflow.log_text(answer, f"answer_{question}.txt")
        
        for i, source in enumerate(sources):
            mlflow.log_text(source, f"source_{question}_{i}.txt")
    # Log custom metrics
    mlflow.log_metric("avg_sources_per_question", sum(len(evaluate_rag(q)[1]) for q in eval_questions) / len(eval_questions))

This example sets up a RAG system using LangChain and Chroma and then evaluates it by recording queries, responses, resources retrieved, and custom metrics in MLflow.

The way you divide your documents can significantly affect RAG’s performance. MLflow can help you evaluate different chunking strategies:

This script evaluates different combinations of chunk sizes, overlaps, and splitting methods, and records the results in MLflow for easy comparison.

MLflow offers several ways to visualize your LLM evaluation results. Here are some techniques:

You can create custom visualizations of your evaluation results using libraries such as Matplotlib or Plotly, and then log them as artifacts:

This function creates a line chart comparing a specific metric across multiple runs and records it as an artifact.

Source link

Tracking Large Language Models (LLM) with MLflow : A Complete Guide

MLflow functionality in large language models (LLMs)

Track and manage LLM interactions

Evaluation of LLMs

Implementation and integration

Set up your environment

Understanding MLflow’s LLM tracking capabilities

Executions and experiments

Key tracking components

Implementing LLMs with MLflow

Create an endpoint

Test the endpoint

Evaluating LLMs with MLflow

Preparing your LLM for evaluation

Custom evaluation metrics

Advanced LLM Evaluation Techniques

Evaluation of Retrieval-Augmented Generation (RAG).

UWM challenges AI anxes; Optimal blue debut REFI TOOL

Nvidia to exclude China from its revenue and profit forecasts

‘Arco,’ ‘Endless Cookie’ neemt prijzen

MLflow functionality in large language models (LLMs)

Track and manage LLM interactions

Evaluation of LLMs

Implementation and integration

Set up your environment

Understanding MLflow’s LLM tracking capabilities

Executions and experiments

Key tracking components

Implementing LLMs with MLflow

Create an endpoint

Test the endpoint

Evaluating LLMs with MLflow

Preparing your LLM for evaluation

Custom evaluation metrics

Advanced LLM Evaluation Techniques

Evaluation of Retrieval-Augmented Generation (RAG).

What is the Net Worth of the British Royal Family?

Implementation deadline for the assessment policy has been moved to October 31

Related Articles

Apple CEO says DeepSeek shows ‘innovation that drives efficiency’

OpenAI’s GPT-4.1 may be less aligned than the company’s previous AI models

Google tests replacing ‘I’m Feeling Lucky’ with ‘AI Mode’

The LLM Car: A Breakthrough in Human-AV Communication

UWM challenges AI anxes; Optimal blue debut REFI TOOL

Nvidia to exclude China from its revenue and profit forecasts

‘Arco,’ ‘Endless Cookie’ neemt prijzen