The Art of Fine-Tuning: A Guide for ChatGPT & Claude

Artificial Intelligence

Fine-tuning

Generative AI

Summary

This article provides a comprehensive guide on fine-tuning large language models (LLMs) like OpenAI’s ChatGPT and Anthropic’s Claude for specific tasks. It outlines the fine-tuning process, including data preparation, model selection, parameter configuration, validation, iteration, and deployment, offering detailed steps for using the OpenAI API and Amazon Bedrock for these models.

Key insights:
  • Purpose of Fine-Tuning: Fine-tuning adapts pre-trained LLMs to specific tasks, enhancing performance while retaining general language understanding. This process is essential for improving reliability, handling edge cases, and performing new tasks.

  • Steps in Fine-Tuning: The process includes data preparation, model selection, parameter configuration, validation, iteration, and deployment. Each step is critical for ensuring the model performs optimally for the intended task.

  • Data Preparation: Curating and preprocessing task-specific datasets is crucial for successful fine-tuning. Properly formatted data ensures the model learns the necessary patterns and nuances required for the task.

  • Model Selection and Configuration: Choosing the right pre-trained model and configuring parameters like learning rates and batch sizes are vital. Layer freezing helps prevent overfitting and maintains model stability.

  • Evaluation and Iteration: Regularly validating the model using metrics like accuracy and loss is important. Iterating on the model based on evaluation results ensures continuous improvement and fine-tuning success.

  • Deployment and Pricing: Deploying the model into the desired environment and considering cost implications are essential steps. Detailed pricing considerations help manage resources effectively, ensuring cost-efficient model deployment.

  • Fine-Tuning GPTs and Claude: Both these models have different implementations to achieve the steps detailed above.

Introduction

Large language models (LLMs) have revolutionized the field of natural language processing, offering advanced capabilities in tasks such as text generation, translation, summarization, and answering questions. Despite their power, LLMs often need adaptation to perform optimally in specific tasks or domains. Fine-tuning addresses this need by refining pre-trained LLMs on task-specific data, enhancing their performance while retaining their general language understanding. 

This article explores the fine-tuning process, best practices, and applications, focusing on optimizing models like OpenAI’s ChatGPT and Anthropic’s Claude.

Overview of Fine-Tuning

Fine-tuning involves adapting a pre-trained LLM to a specific task by training it on a smaller, task-specific dataset. This process improves the model’s performance for the targeted application while leveraging its existing language knowledge. Key steps and best practices in fine-tuning include:

1. Data Preparation

Curate and preprocess the dataset to ensure its quality and relevance. This may involve cleaning the data, handling missing values, and formatting it to fit the model’s requirements. Data augmentation techniques can enhance the dataset’s robustness, contributing to improved model performance.

2. Choosing the Right Pre-Trained Model

Select a pre-trained model that aligns with the specific task’s requirements. Consider factors such as model architecture, size, training data, and performance on similar tasks to ensure a seamless integration into the fine-tuning workflow.

3. Identifying the Right Parameters for Fine-Tuning

Configure key parameters like learning rate, number of training epochs, and batch size. Additionally, freezing particular layers while training others can prevent overfitting and maintain the model’s generalization capabilities.

4. Validation

Evaluate the fine-tuned model’s performance using a validation set. Metrics such as accuracy, loss, precision, and recall provide insights into its effectiveness, guiding adjustments to enhance performance.

5. Model Iteration

Refine the model based on evaluation results. Adjust fine-tuning parameters and explore different strategies, such as regularization techniques or architectural changes, to iteratively improve the model.

6. Model Deployment

Integrate the fine-tuned model into the specific environment, considering hardware, software, scalability, real-time performance, and security requirements for successful deployment.

When to Use Fine-Tuning

Fine-tuning LLMs significantly enhances their performance in specialized tasks. For instance, it improves sentiment analysis accuracy by tailoring models to specific data. Fine-tuned chatbots generate more contextually relevant and engaging conversations, enhancing customer service across various industries. By adapting LLMs to particular use cases, fine-tuning maximizes their effectiveness and applicability in real-world scenarios. Consider fine-tuning when:

  • Prompt engineering, prompt chaining, and function calling do not suffice.

  • Setting the style, tone, format, or other qualitative aspects.

  • Improving reliability and handling edge cases.

  • Performing new tasks that are hard to articulate in prompts.

Fine-Tuning VS RAG

Retrieval-Augmented Generation (RAG) is a hybrid Artificial Intelligence (AI) technique that combines retrieval-based and generative models to generate contextually rich responses using real-time insights from diverse data sources. Use RAG when you need to query large databases for up-to-date information and provide detailed, context-aware answers. However, managing frequently changing data sources can be complex and challenging. Fine-tuning is ideal for customizing AI models to perform specific tasks or exhibit particular behaviors by refining them with task-specific data, enhancing model precision and relevance for specialized applications. Still, it can be resource-intensive and may amplify biases present in the training data. These approaches can complement each other, enhancing AI’s overall effectiveness.

Fine-Tuning GPT Models

1. Prerequisites

OpenAI API Key: Required for making API calls to OpenAI services.

Python: Ensure Python is installed on your machine.

Basic Python Programming Knowledge: Familiarity with Python scripting is necessary for data preparation and model fine-tuning.

2. Prepare the Data

Gather Your Custom Dataset: Collect a dataset relevant to your task or domain. Ensure it contains both user messages and corresponding model-generated responses. The dataset should be in a text format such as CSV or TXT.

Organize your dataset into "messages" and "model-generated" columns. Each row should include a conversation snippet with the user’s message and the model's response.

Convert Dataset to JSONL Format: If required, use a Python script to convert your dataset into JSON Lines (JSONL) format. Here is a simple example:

import pandas as pd
import json

df = pd.read_csv('dataset.csv')

with open('dataset.jsonl', 'w') as f:
    for _, row in df.iterrows():
        json.dump({"messages": [{"role": "user", 
                                 "content": row['messages']}, 
                                {"role": "assistant", 
                                 "content": row['model-generated']}]}, 
                  f)
        f.write('\n')

For fine-tuning GPT-4o-mini and GPT-3.5-turbo models, each training example should be a conversation formatted as a list of messages, including role, content, and optional name. The training data should address cases where the model’s responses are not as desired, with provided assistant messages being the ideal responses.

Example format for GPT-4o-mini and GPT-3.5-turbo:

{"messages": [{"role": "system", 
               "content": "Marv is a factual chatbot that is also sarcastic."}, 
              {"role": "user", 
               "content": "What's the capital of France?"}, 
              {"role": "assistant", 
               "content": "Paris, as if everyone doesn't know that already."}]}
{"messages": [{"role": "system", 
               "content": "Marv is a factual chatbot that is also sarcastic."}, 
              {"role": "user", 
               "content": "Who wrote 'Romeo and Juliet'?"}, 
              {"role": "assistant", 
               "content": "Oh, just some guy named William Shakespeare. Ever heard of him?"}]}
{"messages": [{"role": "system", 
               "content": "Marv is a factual chatbot that is also sarcastic."}, 
              {"role": "user", 
               "content": "How far is the Moon from Earth?"}, 
              {"role": "assistant", 
               "content": "Around 384,400 kilometers. Give or take a few, like that really matters."}]}

For Babbage-002 and Davinci-002, use the prompt-completion pair format.

Example format for Babbage-002 and Davinci-002:

{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}

Data Limits: To fine-tune a model effectively, you should provide at least 10 examples, though clear improvements are generally seen with 50 to 100 well-crafted examples. Starting with 50 examples can help you gauge if the model shows signs of improvement. If the model is not yet production-ready, more data might be needed, and a lack of improvement could indicate a need to revise your data or setup.

Token limits vary by model. For instance, gpt-4o-mini-2024-07-18 supports up to 128,000 tokens for inference and 65,536 tokens for training (with a higher limit coming soon). In comparison, gpt-3.5-turbo models generally support up to 16,385 tokens for both inference and training, except for the 0613 version, which has a training limit of 4,096 tokens. It is essential to ensure that your training examples fit within these token limits to avoid truncation, and you can use tools to compute token counts accurately.

Check Formatting: Before creating a fine-tuning job, verifying your dataset's formatting is crucial. To assist with this, OpenAI provides a straightforward Python script that helps identify potential errors, review token counts, and estimate the cost of the fine-tuning process.

3. Fine-Tune the Model

Install OpenAI Python Library and Set Up Environment Variable: Install the OpenAI Python library using pip.

Set up your OpenAI API key as an environment variable:

export OPENAI_API_KEY="your-api-key"

Choose Your Model: Fine-tuning is available for the following models:

  • Recommended: gpt-4o-mini-2024-07-18

  • gpt-3.5-turbo models: 0125, 1106, 0613

  • Other models: babbage-002, davinci-002

  • Experimental: gpt-4-0613, gpt-4o-2024-05-13

You can also fine-tune previously fine-tuned models if you have additional data. gpt-4o-mini is generally recommended for most users due to its performance, cost, and ease of use.

Upload a Training File: After validating your dataset, upload it using the OpenAI Files API with the following code. This is where you will need to use your OpenAI API key. This step is necessary before creating a fine-tuning job:

from openai import OpenAI
client = OpenAI()

client.files.create(
  file=open("mydata.jsonl", "rb"),
  purpose="fine-tune"
)

Create a Fine-Tuned Model: Start a fine-tuning job with the OpenAI SDK. Replace "file-abc123" with your file ID and "gpt-4o-mini" with your chosen model. Fine-tuning may take some time to complete.

from openai import OpenAI
client = OpenAI()

client.fine_tuning.jobs.create(
  training_file="file-abc123", 
  model="gpt-4o-mini"
)

Manage Fine-Tuning Jobs: You can list, retrieve the status, cancel, or delete fine-tuning jobs using the following commands.

# List fine-tuning jobs
client.fine_tuning.jobs.list(limit=10)

# Retrieve the status of a specific job
client.fine_tuning.jobs.retrieve("ftjob-abc123")

# Cancel a fine-tuning job
client.fine_tuning.jobs.cancel("ftjob-abc123")

# List events related to a fine-tuning job
client.fine_tuning.jobs.list_events(fine_tuning_job_id="ftjob-abc123", limit=10)

# Delete a fine-tuned model
client.models.delete("ft:gpt-3.5-turbo:acemeco:suffix:abc123")

Using the Fine-Tuned Model: Once the fine-tuning job is completed, make requests to the newly fine-tuned model by specifying its name. It may take a few minutes for the model to be ready.

from openai import OpenAI
client = OpenAI()

completion = client.chat.completions.create(
  model="ft:gpt-4o-mini:my-org:custom_suffix:id",
  messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"}
  ]
)
print(completion.choices[0].message)

4. Analyze the Model

Training metrics provided during the fine-tuning process include:

Training Loss: Loss on the training batch.

Training Token Accuracy: The percentage of tokens that were correctly predicted by the model from the training batch.

Validation Loss: Loss on the validation batch.

Validation Token Accuracy: The percentage of tokens that were correctly predicted by the model from the validation batch.

Validation metrics are calculated on small batches during training and on the entire validation split at the end of each epoch, with the full metrics offering the most accurate performance indication.

While a fine-tuning job is active, you can view metrics in real-time through event objects, such as:

{
    "object": "fine_tuning.job.event",
    "id": "ftevent-abc-123",
    "created_at": 1693582679,
    "level": "info",
    "message": "Step 300/300: training loss=0.15, validation loss=0.27, full validation loss=0.40",
    "data": {
        "step": 300,
        "train_loss": 0.1499,
        "valid_loss": 0.2657,
        "total_steps": 300,
        "full_valid_loss": 0.4033,
        "train_mean_token_accuracy": 0.9444,
        "valid_mean_token_accuracy": 0.9565,
        "full_valid_mean_token_accuracy": 0.9090
    },
    "type": "metrics"
}

After job completion, you can analyze training by retrieving a CSV file from the result files, containing columns like step, train_loss, and valid_mean_token_accuracy.

For the most relevant evaluation, compare samples from the fine-tuned and base models using a test set representing the full input distribution. Manual evaluation can be supplemented with the Evals library for automation.

5. Pricing

Until September 23, 2024, fine-tuning GPT-4o mini is free for up to 2M tokens per 24-hour period. Any additional tokens are charged at $3.00 per million tokens.

For detailed training and deployment costs, visit the pricing page. Note that training validation tokens are not charged. 

For example, with 100,000 tokens trained over 3 epochs, the cost for GPT-4o mini after September 23, 2024, would be approximately $0.90. The price for GPT-3.5-turbo-0125 would be roughly $2.40.

Fine-Tuning Claude 3 Haiku

Fine-tuning Anthropic’s Claude 3 Haiku can be done on Amazon Bedrock by following the steps outlined below.

1. Prerequisites

AWS Account: Ensure you have an active Amazon Web Services (AWS) account.

Access: Confirm access to the Anthropic Claude 3 Haiku model in Amazon Bedrock and request access to the fine-tuning preview if needed.

Training Data: Prepare and store your training and optional validation datasets in Amazon S3.

IAM Role: Create an AWS Identity and Access Management (IAM) role with the necessary permissions:

  • Trust relationship for Amazon Bedrock

  • Access to S3 buckets

  • Optionally, decryption permissions for KMS keys

Example trust relationship:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "bedrock.amazonaws.com"
            },
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringEquals": {
                    "aws:SourceAccount": "account-id"
                },
                "ArnEquals": {
                    "aws:SourceArn": "arn:aws:bedrock:us-west-2:account-id:model-customization-job/*"
                }
            }
        }
    ]
}

2. Prepare the Data

Format Data: Convert your data to JSONL format:

{"system": string, "messages": [{"role": "user", "content": string}, 
                                {"role": "assistant", "content": string}]}

Example for text summarization:

{
  "system": "Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.",
  "messages": [
    {"role": "user", 
     "content": "instruction: Summarize the news article provided below. input: Supermarket customers in France can add airline tickets to their shopping lists thanks to a unique promotion by a budget airline. ... Based at the airport, new airline launched in 2007 and is a low-cost subsidiary of the airline."},
    {"role": "assistant", 
     "content": "New airline has included voucher codes with the branded products ... to pay a booking fee and checked baggage fees ."}
  ]
}

Convert Data: Use the following Python script if needed:

import json

system_string = ""
input_file = "Orig-FT-Data.jsonl"
output_file = "Haiku-FT-Data.jsonl"

with open(input_file, "r") as f_in, open(output_file, "w") as f_out:
    for line in f_in:
        data = json.loads(line)
        prompt = data["prompt"]
        completion = data["completion"]

        new_data = {}
        if system_string:
            new_data["system"] = system_string
        new_data["messages"] = [
            {"role": "user", "content": prompt},
            {"role": "assistant", "content": completion}
        ]

        f_out.write(json.dumps(new_data) + "\n")

print("Conversion completed!")

Data Limits: Training data should not exceed 10,000 records; validation data should not exceed 1,000 records.

3. Fine-Tune the Model

Configure Hyperparameters:

  • epochCount: Number of training epochs (default 2, range 1-10)

  • batchSize: Number of samples processed before updating model parameters (default 32, range 4-256)

  • learningRateMultiplier: Multiplier for learning rate (default 1, range 0.1-2)

  • earlyStoppingThreshold: Minimum improvement required to prevent early stopping (default 0.001, range 0-0.1)

  • earlyStoppingPatience: Tolerance for validation loss stagnation (default 2, range 1-10)

Example of early stopping behavior: If earlyStoppingThreshold is 0.001 and earlyStoppingPatience is 2, training stops if validation loss does not improve significantly for 2 epochs.

Run Fine-Tuning Job via Console:

  • Navigate to Foundation Models in the Amazon Bedrock console.

  • Choose Custom Models and then Create Fine-tuning Job.

  • Configure the model, encryption, tags, job name, and input data locations.

  • Set hyperparameters and enable early stopping if using a validation dataset.

  • Start the fine-tuning job and monitor its progress.

Run Fine-Tuning Job via API:

import boto3
from datetime import datetime

bedrock = boto3.client(service_name="bedrock")
base_model_id = "anthropic.claude-3-haiku-20240307-v1:0:200k"
ts = datetime.now().strftime("%Y-%m-%d-%H-%M-%S")
customization_job_name = f"model-finetune-job-{ts}"
custom_model_name = f"finetuned-model-{ts}"
customization_role = "arn:aws:iam::<YOUR_AWS_ACCOUNT_ID>:role/<YOUR_IAM_ROLE_NAME>"

response = bedrock.create_model_customization_job(
    baseModelId=base_model_id,
    customizationJobName=customization_job_name,
    customModelName=custom_model_name,
    customizationRoleArn=customization_role,
    customizationType="FINE_TUNING",
    hyperparameters={
        "epochCount": 2,
        "batchSize": 32,
        "learningRateMultiplier": 1,
        "earlyStoppingThreshold": 0.001,
        "earlyStoppingPatience": 2
    },
    inputDataConfig={
        "trainingData": "s3://your-bucket/training-data/",
        "validationData": "s3://your-bucket/validation-data/"
    },
    outputDataConfig={
        "s3OutputPath": "s3://your-bucket/output-data/"
    }
)

print(response)

4. Evaluation and Deployment

  • Monitor the fine-tuning job status on the Amazon Bedrock console or via the API.

  • Review training and validation metrics available in the S3 bucket specified for output data.

  • Deploy the fine-tuned model for use in your applications by purchasing Provisioned Throughput, after which you can invoke MessageAPI to invoke the model similarly to the base model.

5. Pricing

Apart from the pricing for your AWS account and Amazon Bedrock, you must purchase Provision Throughput, which is billed hourly, if you want to deploy your model. The cost depends on:

Base Model: The original model from which your fine-tuned model was customized.

Model Units (MUs): Each MU defines the throughput capacity, indicating the number of tokens the model can process and generate per minute.

Commitment Duration: Options include no commitment, 1 month, or 6 months. Longer commitments offer discounted hourly rates.

You can check the pricing for these on the official pages.

Conclusion

In conclusion, fine-tuning large language models like OpenAI’s ChatGPT and Anthropic’s Claude allows organizations to tailor these powerful tools to specific tasks and domains, enhancing their effectiveness and relevance. Users can significantly improve model performance by meticulously preparing and formatting data, choosing appropriate models, and carefully adjusting fine-tuning parameters. If you need more information on documentation, check out OpenAI and Amazon Bedrock’s fine-tuning pages.

While fine-tuning requires careful consideration of data quality and model evaluation, it offers substantial benefits in terms of precision and applicability.

Optimize Your AI Solutions with Walturn

Transform your AI capabilities with Walturn's expertise in fine-tuning large language models like OpenAI’s ChatGPT and Anthropic’s Claude. We specialize in customizing models for specific tasks and domains, ensuring they deliver precise, context-aware results tailored to your needs. Partner with Walturn to leverage cutting-edge NLP technologies, optimize performance, and drive innovation in your business.

References

“The Contrast Between RAG and Fine-Tuning Models for Tech Enthusiasts — AI Simplified.” GeekyAnts, geekyants.com/blog/the-contrast-between-rag-and-fine-tuning-models-for-tech-enthusiasts--ai-simplified

“Fine-tune Anthropic’s Claude 3 Haiku in Amazon Bedrock to Boost Model Accuracy and Quality | Amazon Web Services.” Amazon Web Services, 10 July 2024, aws.amazon.com/blogs/machine-learning/fine-tune-anthropics-claude-3-haiku-in-amazon-bedrock-to-boost-model-accuracy-and-quality

“Fine-tuning.” OpenAI Platform, platform.openai.com/docs/guides/fine-tuning

“Fine-Tuning LLMs: Overview, Methods and Best Practices.” Turing, www.turing.com/resources/finetuning-large-language-models

“Loss Function in Fine Tuning.” OpenAI Developer Forum, 29 Sept. 2023, community.openai.com/t/loss-function-in-fine-tuning/403653

“Provisioned Throughput for Amazon Bedrock - Amazon Bedrock.” Amazon Web Services, docs.aws.amazon.com/bedrock/latest/userguide/prov-throughput.html

R2consulting. “A Step-by-Step Guide to Custom Fine-Tuning With ChatGPT’s API Using a Custom Dataset.” Medium, 9 Feb. 2024, medium.com/@r2consultingcloud/a-step-by-step-guide-to-custom-fine-tuning-with-chatgpts-api-using-a-custom-dataset-54dae6c055ce

Other Insights

Got an app?

We build and deliver stunning mobile products that scale

Got an app?

We build and deliver stunning mobile products that scale

Got an app?

We build and deliver stunning mobile products that scale

Got an app?

We build and deliver stunning mobile products that scale

Got an app?

We build and deliver stunning mobile products that scale

Our mission is to harness the power of technology to make this world a better place. We provide thoughtful software solutions and consultancy that enhance growth and productivity.

The Jacx Office: 16-120

2807 Jackson Ave

Queens NY 11101, United States

Book an onsite meeting or request a services?

© Walturn LLC • All Rights Reserved 2024

Our mission is to harness the power of technology to make this world a better place. We provide thoughtful software solutions and consultancy that enhance growth and productivity.

The Jacx Office: 16-120

2807 Jackson Ave

Queens NY 11101, United States

Book an onsite meeting or request a services?

© Walturn LLC • All Rights Reserved 2024

Our mission is to harness the power of technology to make this world a better place. We provide thoughtful software solutions and consultancy that enhance growth and productivity.

The Jacx Office: 16-120

2807 Jackson Ave

Queens NY 11101, United States

Book an onsite meeting or request a services?

© Walturn LLC • All Rights Reserved 2024

Our mission is to harness the power of technology to make this world a better place. We provide thoughtful software solutions and consultancy that enhance growth and productivity.

The Jacx Office: 16-120

2807 Jackson Ave

Queens NY 11101, United States

Book an onsite meeting or request a services?

© Walturn LLC • All Rights Reserved 2024

Our mission is to harness the power of technology to make this world a better place. We provide thoughtful software solutions and consultancy that enhance growth and productivity.

The Jacx Office: 16-120

2807 Jackson Ave

Queens NY 11101, United States

Book an onsite meeting or request a services?

© Walturn LLC • All Rights Reserved 2024