In this blog post, we explore the intricacies of fine-tuning the Qwen2.5-7B-VL-Instruct model—a state-of-the-art multi-modal transformer designed for both text and image understanding. We will delve into the model’s architecture and its applications, then break down the training and inference code in detail.

About Qwen2.5

Qwen2.5-VL represents a significant advancement in multimodal AI, seamlessly integrating vision and language processing capabilities. Developed by the Qwen team at Alibaba Cloud, this model builds upon its predecessor, Qwen2-VL, introducing enhanced features and architectural innovations that broaden its applicability across various domains.

Architectural Innovations

The architecture of Qwen2.5-VL incorporates several key enhancements designed to improve performance and versatility:

The Qwen2.5-VL architecture consists of two main components:

Vision Encoder – Responsible for processing images and videos.
Qwen2.5 LM Decoder – A multimodal language model that integrates visual and textual inputs to generate meaningful responses.

1. Vision Encoder

The Vision Encoder processes image and video inputs while maintaining their native resolution to preserve critical details. It follows a hierarchical structure with the following key features:

A. Native Resolution Input

The encoder takes in images and videos at their original resolution rather than resizing them, which helps retain fine details.
It processes multiple images and videos with varying heights and widths.

B. Temporal Processing for Video Understanding

Videos are handled with dynamic frame rate sampling, meaning frames are selected at different intervals depending on context needs.
The architecture aligns sampled relative time (0, 5, 10, 15, etc.) with absolute time in the video for better temporal reasoning.
Conv3D (2x14x14x14) & temporal merging mechanisms allow the model to efficiently analyze videos across different frames.

C. Attention Mechanisms

The vision encoder employs a Transformer-based architecture with attention mechanisms:

Full Attention: Used in early layers for a global understanding of visual content.
Window Attention: Used in deeper layers to focus on localized regions for efficiency.
FFN with SwiGLU (Feed-Forward Network with Swish-Gated Linear Units) enhances performance by improving non-linearity in processing.

D. Normalization Techniques

RMSNorm (Root Mean Square Normalization) is applied across different layers to stabilize training and improve efficiency.

2. Qwen2.5 LM Decoder

The Qwen2.5 LM Decoder is a transformer-based multimodal language model that integrates textual, image, and video inputs.

A. Token Processing

Each input type (image, video, text) is tokenized:
- Picture 1: 11,427 tokens
- Picture 2: 8 tokens
- Picture 3: 1,125 tokens
- Video 1: 644/1288/2576 tokens (depending on the frame sample rate)
These tokens are processed by the decoder to generate a coherent response.

B. Multimodal Fusion

The LM decoder fuses text and visual information from different sources and generates outputs accordingly.
The architecture supports long video comprehension (exceeding 1 hour) by effectively aligning textual queries with video frames.

Use Case

Now, let's discuss the task I'm working on. I aim to fine-tune a Vision-Language Model (VLM) for data extraction and Optical Character Recognition (OCR) from complex PowerPoint (PPT) slides, converting the extracted information into specific JSON formats. To achieve this, I have prepared a small dataset comprising intricate PPT slides along with their corresponding expected output format.

Preparation

Data Preparation: Formatting input samples into a structured conversation format with system, user (with image and text), and assistant messages.

Model and Processor Loading: Utilizing pre-trained weights for both the vision-language model and its corresponding processor.
Quantization: Employing 4-bit quantization with BitsAndBytesConfig to manage memory footprint and speed up training.
PEFT via LoRA: Applying parameter-efficient fine-tuning (PEFT) using LoRA configuration to update only a subset of parameters.
Training Setup: Using the SFTTrainer from the TRL library to handle training loops, logging, and evaluation.
Saving Artifacts: Finally, saving the fine-tuned model along with its tokenizer and processor for later use in inference.

Hardware

Instance: g5.xlarge (AWS EC2)
Specs:

Compute	Value
vCPUs	4
Memory (GiB)	16.0
Memory per vCPU (GiB)	4.0
Physical Processor	AMD EPYC 7R32
Clock Speed (GHz)	2.8
CPU Architecture	x86_64
GPU	1
GPU Architecture	NVIDIA A10G
Video Memory (GiB)	24
GPU Compute Capability (?)	8.6
FPGA	0

Python Packages

pip install torch==2.6.0 wandb==0.19.6 datasets==3.3.1 \
transformers==4.50.0.dev0 peft==0.14.0 trl==0.15.0 \
bitsandbytes==0.45.2 qwen-vl-utils==0.0.10

Training Script (train.py)

import torch
import wandb
from datasets import load_dataset
from transformers import (
    Qwen2_5_VLForConditionalGeneration,
    AutoProcessor,
    BitsAndBytesConfig,
)
from peft import LoraConfig, get_peft_model
from trl import SFTConfig, SFTTrainer
from qwen_vl_utils import process_vision_info

First, we import all the necessary library required for finetuning

def format_data(sample):
    """
    Format a single dataset sample into the required structure.
    """
    return [
        {
            "role": "system",
            "content": [{"type": "text", "text": sample["texts"]["system"]}],
        },
        {
            "role": "user",
            "content": [
                {"type": "image", "image": sample["images"]},
                {"type": "text", "text": sample["texts"]["user"]},
            ],
        },
        {
            "role": "assistant",
            "content": [{"type": "text", "text": sample["texts"]["assistant"]}],
        },
    ]

then we will define, format_data(sample) function,that takes a dataset sample as input and formats it into a structured list of dictionaries. The structure is used for a conversational AI system, where interactions are organized into roles: system, user, and assistant. The dataset is prepared in format and then later being converted to model required format using chat template.

def generate_text_from_sample(model, processor, sample, max_new_tokens=1024, device="cuda"):
    """
    Generate output text from a single sample using the model and processor.

    Parameters:
        model: The vision-language generation model.
        processor: The processor to apply chat templates and tokenize inputs.
        sample: The input sample containing text and image data.
        max_new_tokens: Maximum number of new tokens to generate.
        device: Device to perform inference on.

    Returns:
        A string containing the generated output text.
    """
    # Apply chat template to sample (skip the system message)
    text_input = processor.apply_chat_template(
        sample[1:2], tokenize=False, add_generation_prompt=True
    )

    # Process visual inputs from the sample
    image_inputs, _ = process_vision_info(sample)

    # Prepare model inputs with text and image data, and move to the specified device
    model_inputs = processor(
        text=[text_input],
        images=image_inputs,
        return_tensors="pt",
    ).to(device)

    # Generate tokens with the model
    generated_ids = model.generate(**model_inputs, max_new_tokens=max_new_tokens)

    # Remove input tokens from generated output tokens
    trimmed_generated_ids = [
        out_ids[len(in_ids):] for in_ids, out_ids in zip(model_inputs.input_ids, generated_ids)
    ]

    # Decode the generated tokens into text
    output_text = processor.batch_decode(
        trimmed_generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )
    return output_text[0]

next we define function, generate_text_from_sample, it is designed to generate text output from a vision-language model using a given processor and sample. It integrates text and image inputs for multimodal processing.

def collate_fn(examples):
    """
    Data collator to prepare a batch of examples.

    This function applies the chat template to texts, processes the images,
    tokenizes the inputs, and creates labels with proper masking.
    """
    # Apply chat template to each example (no tokenization here)
    texts = [processor.apply_chat_template(example, tokenize=False) for example in examples]
    # Process visual inputs for each example
    image_inputs = [process_vision_info(example)[0] for example in examples]

    # Tokenize texts and images into tensors with padding
    batch = processor(
        text=texts,
        images=image_inputs,
        return_tensors="pt",
        padding=True,
    )

    # Create labels by cloning input_ids and mask the pad tokens
    labels = batch["input_ids"].clone()
    labels[labels == processor.tokenizer.pad_token_id] = -100

    # Determine image token IDs to mask in the labels (model specific)
    if isinstance(processor, Qwen2VLProcessor):
        image_tokens = [151652, 151653, 151655]
    else:
        image_tokens = [processor.tokenizer.convert_tokens_to_ids(processor.image_token)]

    # Mask image token IDs in the labels
    for image_token_id in image_tokens:
        labels[labels == image_token_id] = -100

    batch["labels"] = labels
    return batch

next we define function, collate_fn which is a data collator designed to prepare a batch of examples for a vision-language model. It processes both text and image inputs, applies necessary transformations, and ensures the data is properly formatted for training or inference.

Chat Template Application: It applies the chat template to each example without tokenizing, ensuring that the text input follows a structured conversational format.
Image Processing: It extracts and processes image data from each example, making it compatible with the model's vision-processing capabilities.
Tokenization and Padding: It tokenizes both the text and image inputs while ensuring uniform tensor sizes using padding.
Label Creation with Masking:
- Clones the input_ids to create a label tensor.
- Replaces pad tokens with -100 to avoid loss computation on them.
- Identifies and masks image-related token IDs to prevent them from affecting loss calculations, ensuring that only textual tokens contribute to the learning process.
Returns a Processed Batch: The function outputs a structured batch containing tokenized inputs, images, and properly masked labels, making it suitable for training vision-language models.

# Load and format the dataset
dataset_id = "codewithaman/ppt_shapes_extraction"
train_dataset = load_dataset(dataset_id, split="train")
eval_dataset = load_dataset(dataset_id, split="validation")
test_dataset = load_dataset(dataset_id, split="test")

train_dataset = [format_data(sample) for sample in train_dataset]
eval_dataset = [format_data(sample) for sample in eval_dataset]
test_dataset = [format_data(sample) for sample in test_dataset]

now we can perform loading and formatting a our custom dataset for a vision-language model, specifically for a PPT shapes extraction task.

# Model and processor configuration
model_id = "Qwen/Qwen2.5-VL-7B-Instruct"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_id, device_map="auto", torch_dtype=torch.bfloat16, quantization_config=bnb_config
)
processor = AutoProcessor.from_pretrained(model_id)

Then we can set up the model and processor configuration for running a Qwen2.5-VL-7B-Instruct vision-language model with 4-bit quantization to optimize memory and performance.

# Configure LoRA for model adaptation
peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.05,
    r=8,
    bias="none",
    target_modules=["q_proj", "v_proj"],
    task_type="CAUSAL_LM",
)

# Apply PEFT model adaptation and print trainable parameters
peft_model = get_peft_model(model, peft_config)
peft_model.print_trainable_parameters()

Next we configure LoRA (Low-Rank Adaptation) to fine-tune the Qwen2.5-VL-7B-Instruct model efficiently. LoRA is a parameter-efficient technique that adapts pre-trained models without modifying all parameters, significantly reducing computational and memory costs.

# Configure training arguments using SFTConfig
training_args = SFTConfig(
    output_dir="finetuned",  # Directory to save the model
    num_train_epochs=3,  # Number of training epochs
    per_device_train_batch_size=1,  # Training batch size per device
    per_device_eval_batch_size=1,  # Evaluation batch size per device
    gradient_accumulation_steps=4,  # Number of steps to accumulate gradients
    gradient_checkpointing=True,  # Enable gradient checkpointing for memory efficiency
    optim="adamw_torch_fused",  # Optimizer type
    learning_rate=2e-4,  # Learning rate for training
    lr_scheduler_type="constant",  # Learning rate scheduler type
    logging_steps=10,  # Interval (in steps) for logging
    eval_steps=10,  # Interval (in steps) for evaluation
    eval_strategy="steps",  # Evaluation strategy
    save_strategy="steps",  # Strategy for saving the model
    save_steps=20,  # Interval (in steps) for saving
    metric_for_best_model="eval_loss",  # Metric to evaluate the best model
    greater_is_better=False,  # Lower metric values are better
    load_best_model_at_end=True,  # Load the best model after training
    bf16=True,  # Use bfloat16 precision
    tf32=True,  # Use TensorFloat-32 precision
    max_grad_norm=0.3,  # Maximum gradient norm for clipping
    warmup_ratio=0.03,  # Warmup ratio for learning rate scheduler
    report_to="wandb",  # Reporting via Weights & Biases
    push_to_hub=False,  # Do not push the model to Hugging Face Hub
    gradient_checkpointing_kwargs={"use_reentrant": False},  # Gradient checkpointing options
    dataset_text_field="",  # Text field in the dataset (if applicable)
    dataset_kwargs={"skip_prepare_dataset": True},  # Additional dataset options
    # max_seq_length=1024  # Uncomment to set maximum sequence length for input
)
training_args.remove_unused_columns = False  # Do not remove unused columns from the dataset

Now we have to configure training arguments using SFTConfig for supervised fine-tuning (SFT) of the Qwen2.5-VL-7B-Instruct model. It ensures an efficient and stable training process while leveraging LoRA and quantization for better performance on limited resources.

Key Training Configurations Explained

1. Output and Training Parameters

output_dir="finetuned" → Saves the fine-tuned model in the "finetuned" directory.
num_train_epochs=3 → The model will train for 3 full epochs over the dataset.
per_device_train_batch_size=1 → Each device (GPU/CPU) processes 1 sample per batch during training.
per_device_eval_batch_size=1 → Each device processes 1 sample per batch during evaluation.
gradient_accumulation_steps=4 → Accumulates gradients over 4 steps before updating weights, reducing memory usage.

2. Memory Optimization

gradient_checkpointing=True → Saves memory by recomputing activations instead of storing them.
bf16=True → Uses bfloat16 precision to optimize training while maintaining stability.
tf32=True → Uses TensorFloat-32 for faster computations on newer NVIDIA GPUs.

3. Optimizer and Learning Rate Schedule

optim="adamw_torch_fused" → Uses the AdamW optimizer optimized for PyTorch.
learning_rate=2e-4 → Sets the initial learning rate to 0.0002.
lr_scheduler_type="constant" → Keeps the learning rate constant throughout training.
warmup_ratio=0.03 → 3% of total training steps will be used for warmup (gradually increasing the learning rate).

4. Logging and Evaluation

logging_steps=10 → Logs training metrics every 10 steps.
eval_steps=10 → Runs evaluation every 10 steps.
eval_strategy="steps" → Evaluation is step-based (not epoch-based).
metric_for_best_model="eval_loss" → Selects the best model based on lowest evaluation loss.
greater_is_better=False → Since lower loss is better, this is set to False.

5. Model Saving Strategy

save_strategy="steps" → Saves the model at specific steps.
save_steps=20 → Saves the model every 20 steps.
load_best_model_at_end=True → Loads the best-performing model at the end of training.

6. Logging and Monitoring

report_to="wandb" → Logs training progress to Weights & Biases (W&B) for visualization.
push_to_hub=False → Does not upload the fine-tuned model to Hugging Face Hub automatically.

7. Gradient Clipping and Miscellaneous

max_grad_norm=0.3 → Clips gradients to 0.3 to prevent instability.
gradient_checkpointing_kwargs={"use_reentrant": False} → Configures gradient checkpointing to prevent memory issues.
dataset_text_field="" → Placeholder for specifying a text field in the dataset (not needed here).
dataset_kwargs={"skip_prepare_dataset": True} → Skips dataset preprocessing (assumes it's already formatted).
remove_unused_columns = False → Ensures all dataset columns are retained (useful for multimodal training).

# Initialize Weights & Biases for experiment tracking
wandb.init(
    project="ppt-slide-parser",  # Update project name as needed
    config=training_args,
)

next we initialize Weights & Biases (W&B) for experiment tracking during the fine-tuning of the Qwen2.5-VL-7B-Instruct model.

# Create the trainer for fine-tuning the model
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=collate_fn,
    peft_config=peft_config,
    tokenizer=processor.tokenizer,
)

This code initializes an SFTTrainer for fine-tuning the Qwen2.5-VL-7B-Instruct model using Supervised Fine-Tuning (SFT)

Key Components Explained:

Trainer Initialization (SFTTrainer)
- model=model → Uses the Qwen2.5-VL-7B-Instruct model wrapped with LoRA for efficient adaptation.
- args=training_args → Supplies training configurations from SFTConfig, including learning rate, batch size, logging, and optimization strategies.
- train_dataset=train_dataset → Loads the formatted training dataset.
- eval_dataset=eval_dataset → Loads the formatted evaluation dataset.
Data Handling
- data_collator=collate_fn → Uses the custom collation function to correctly format text and image inputs before training.
- tokenizer=processor.tokenizer → Uses the processor's tokenizer to convert text into model-compatible tokens.
LoRA Integration
- peft_config=peft_config → Integrates LoRA (Low-Rank Adaptation) to reduce memory usage and speed up training by fine-tuning only specific layers.

# Start training
trainer.train()

Now we can start the fine-tuning process for the Qwen2.5-VL-7B-Instruct model using LoRA and Supervised Fine-Tuning (SFT).

# Save the model checkpoint (with sharding as needed)
model.save_pretrained(training_args.output_dir, max_shard_size="4GB")
# Save the tokenizer and processor configurations
processor.tokenizer.save_pretrained(training_args.output_dir)
processor.save_pretrained(training_args.output_dir)

Finally, we save the fine-tuned model and processor to the specified output directory, ensuring that the trained model can be reloaded and used later.

Inference Script (infer.py)

Once we have the model finetuned and saved locally, our next step it to run for inference on our finetuned model.

import torch
import os
import json
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from datasets import load_dataset
from qwen_vl_utils import process_vision_info
from transformers import BitsAndBytesConfig
from peft import PeftModel

First, we will import all necessary packages

# Model and processor setup
model_id = 'folder_to_saved_model'

# BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16
)

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="auto",
    quantization_config=bnb_config,
)
model = PeftModel.from_pretrained(model, model_id)
processor = AutoProcessor.from_pretrained(model_id)

Then we can set up the model and processor configuration but this time we load it from the directory we have saved the finetuned model.

# Load dataset
dataset_id = "codewithaman/ppt_shapes_extraction"
dataset = load_dataset(dataset_id, split="test")

# Output directory
output_dir = "output"
os.makedirs(output_dir, exist_ok=True)

def format_data(sample):
    """Format dataset samples for Qwen2VL."""
    return [
        {
            "role": "system",
            "content": [{"type": "text", "text": sample["texts"]["system"]}],
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "image": sample["images"],
                },
                {
                    "type": "text",
                    "text": sample["texts"]["user"],
                },
            ],
        },
    ]

Then we can load a test dataset. We will use test dataset which is unseen data to our model

# Process each sample
for idx, sample in enumerate(dataset):
    test_data = format_data(sample)

    try:
        text = processor.apply_chat_template(test_data[:2], tokenize=False, add_generation_prompt=True)

        # Ensure image token is in text
        image_token = processor.image_token
        if image_token not in text:
            text += f" {image_token}"

        # Process image input and ensure correct format
        image_inputs, _ = process_vision_info(test_data)

        if not image_inputs:
            print(f"⚠️ No images found for sample {idx}. Skipping.")
            continue  # Skip if no valid image

        # Prepare inputs
        inputs = processor(
            text=[text],
            images=image_inputs,
            return_tensors="pt",
        ).to("cuda")

        # Generate response
        generated_ids = model.generate(**inputs, max_new_tokens=1024)  # Reduce token count
        generated_ids_trimmed = [out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]

        # Decode output
        output_text = processor.batch_decode(
            generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
        )

        # Remove image from sample before saving output
        sample_data = {k: v for k, v in sample.items() if k != "images"}

        # Save each output as a JSON file
        output_filepath = os.path.join(output_dir, f"output_{idx}_v2.json")
        with open(output_filepath, "w", encoding="utf-8") as f:
            json.dump({"input": sample_data, "output": output_text[0]}, f, ensure_ascii=False, indent=4)

        print(f"✅ Saved output to {output_filepath}")

    except ValueError as e:
        print(f"❌ Error at index {idx}: {e}")

Finally, we process each dataset sample and run inference with the fine-tuned Qwen2.5-VL-7B-Instruct model and save the output as JSON files.

Model Performance

Here are the training and evaluation metrics from fine-tuning Qwen2.5-VL-7B-Instruct for my use case.

training

eval

Thank you! 🙌😊

Feel free to reach out if you encounter any challenges or obstacles while fine-tuning an LLM or VLM model.

Fine-Tuning the Qwen2.5-7B-VL-Instruct Model: A Comprehensive Guide

Table of contents