📚 Reference Code Available: All code examples from this blog series are available in the GitHub repository. Clone it to follow along!

Fine-Tuning Small LLMs on a Desktop - Part 1: Setup and Environment

Welcome to our comprehensive 6-part series on fine-tuning small language models on a desktop! In this first part, I’ll establish the foundation by setting up your development environment and understanding the key concepts.

Series Overview

This series will take you from zero to having a production-ready fine-tuned language model:

Part 1: Setup and Environment (This post)
Part 2: Data Preparation and Model Selection
Part 3: Fine-Tuning with Unsloth
Part 4: Evaluation and Testing
Part 5: Deployment with Ollama and Docker
Part 6: Production, Monitoring, and Scaling

Why Fine-Tune Small Language Models?

Before diving into setup, let’s understand why fine-tuning small language models has become increasingly popular:

Cost Efficiency

Using API-based models like GPT-4 can quickly become expensive for high-volume applications. A fine-tuned 8B parameter model running locally eliminates per-token costs and can handle thousands of requests per day at zero marginal cost.

Data Privacy and Control

Your sensitive data never leaves your infrastructure. This is crucial for:

Healthcare applications with patient data
Financial services with confidential information
Internal corporate tools with proprietary data
Government and defense applications

Specialized Performance

A small model fine-tuned on your specific domain often outperforms larger general-purpose models. For example:

A 7B model fine-tuned on SQL data can outperform GPT-4 for database queries
A medical-fine-tuned model excels at clinical documentation
Code-specific models generate better domain-specific solutions

Reduced Latency

Local inference eliminates network latency, providing:

Sub-second response times
Offline capability
Consistent performance regardless of internet connectivity

Vendor Independence

Avoid vendor lock-in and maintain control over your AI capabilities regardless of external service changes, pricing models, or availability.

Prerequisites and System Requirements

Hardware Requirements

Minimum Configuration:

8GB RAM (16GB strongly recommended)
Modern CPU with at least 4 cores
50GB free disk space
Stable internet connection for initial downloads

Recommended Configuration:

16GB+ RAM (32GB ideal for larger models)
NVIDIA GPU with 8GB+ VRAM (RTX 3060 or better)
SSD storage for faster data loading and model access
100GB+ free disk space for models and datasets

GPU Compatibility:

NVIDIA: RTX 20/30/40 series, Tesla, Quadro
AMD: Limited support through ROCm (experimental)
Apple Silicon: Full support through Metal Performance Shaders
Intel Arc: Emerging support (check latest compatibility)

Software Prerequisites

Operating System Support:

macOS: macOS 12+ (Apple Silicon strongly recommended for GPU acceleration)
Windows: Windows 10/11 with WSL2 enabled
Linux: Ubuntu 20.04+, RHEL 8+, or equivalent distributions

Required Software:

Docker Desktop (latest version)
Python 3.8+ (Python 3.10 recommended)
Git for version control
VS Code with Python extension (recommended IDE)

Setting Up Docker Desktop

Step 1: Install Docker Desktop

Download Docker Desktop from docker.com/products/docker-desktop

Platform-specific installation:

macOS:

# Using Homebrew (recommended)
brew install --cask docker

# Or download DMG and install manually

Windows:

# Ensure WSL2 is installed first
wsl --install

# Then install Docker Desktop from the downloaded installer

Linux:

# Ubuntu/Debian
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh

# Add user to docker group
sudo usermod -aG docker $USER

Start Docker Desktop and ensure it’s running

Verify installation:

docker --version
docker run hello-world

Step 2: Enable Docker Model Runner

Docker Model Runner is a new feature that makes running LLMs locally incredibly simple. Here’s how to enable it:

Open Docker Desktop
Navigate to Settings → Beta Features
Enable the following options:
- ✅ “Docker Model Runner”
- ✅ “Host-side TCP support” (for API access)
- ✅ “GPU-backed inference” (if you have a compatible GPU)
Click Apply & Restart

After enabling these features, you’ll see a new “Models” tab in Docker Desktop.

Step 3: Verify Docker Model Runner

# Check if Model Runner is available
docker model --help

# You should see output like:
# Usage: docker model COMMAND
#
# Manage models
#
# Commands:
#   list     List models
#   pull     Pull a model
#   push     Push a model to a registry
#   remove   Remove a model
#   run      Run a model

Step 4: Check GPU Availability

If you have an NVIDIA GPU, you can check if it’s properly configured by running the nvidia-smi command:

nvidia-smi

This command should output a table with information about your GPU, including the driver version, CUDA version, and a list of running processes.

Step 5: Pull Your First Model

Let’s test the setup by pulling a small model:

# Pull a lightweight model for testing
docker model pull ai/smollm2:360M-Q4_K_M

# List available models
docker model list

# Test the model
docker model run ai/smollm2:360M-Q4_K_M "Hello, how are you?"

If this works, congratulations! Your Docker Model Runner is properly configured.

Setting Up Python Environment

Step 1: Create Project Structure

# Create project directory
mkdir llm-fine-tuning-workshop
cd llm-fine-tuning-workshop

# Create subdirectories
mkdir -p {data,models,notebooks,scripts,configs,logs}

# Create virtual environment
python -m venv venv

# Activate environment
# On macOS/Linux:
source venv/bin/activate
# On Windows:
venv\Scripts\activate

Step 2: Install Core Dependencies

# Upgrade pip first
pip install --upgrade pip

# Install PyTorch (choose based on your system)
# For CUDA 12.1:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# For CPU only:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

# For Apple Silicon Mac:
pip install torch torchvision torchaudio

Step 3: Install Unsloth

Unsloth is our secret weapon for efficient fine-tuning. It provides up to 80% memory reduction and 2x speed improvements:

# For NVIDIA CUDA systems (Linux/Windows with NVIDIA GPU):
pip install "unsloth[cu121] @ git+https://github.com/unslothai/unsloth.git"

# For Apple Silicon Macs (M1/M2/M3 chips):
pip install "unsloth @ git+https://github.com/unslothai/unsloth.git"

# For CPU-only systems (Intel Macs, older hardware):
pip install "unsloth[cpu] @ git+https://github.com/unslothai/unsloth.git"

Note for Apple Silicon Mac users: Unsloth will automatically detect and use Metal Performance Shaders (MPS) for GPU acceleration when available. Make sure you have the latest PyTorch installed with MPS support.

Step 4: Install Additional Dependencies

# Core ML libraries
pip install transformers accelerate datasets peft bitsandbytes

# Experiment tracking and monitoring
pip install wandb tensorboard

# Development and analysis tools
pip install jupyter notebook ipywidgets
pip install pandas numpy matplotlib seaborn plotly

# Evaluation libraries
pip install evaluate rouge-score bleu sacrebleu

# API and web interface tools
pip install fastapi uvicorn streamlit gradio

# Utility libraries
pip install tqdm rich typer click python-dotenv

Step 5: Create Requirements File

# Generate requirements file
pip freeze > requirements.txt

Step 6: Verify Installation

Create a test script to verify everything is working:

# test_setup.py
import sys
import torch
import transformers
import unsloth
import pandas as pd
import numpy as np

print("🚀 Installation Verification")
print("=" * 40)
print(f"Python version: {sys.version}")
print(f"PyTorch version: {torch.__version__}")
print(f"Transformers version: {transformers.__version__}")

# Check GPU support
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"MPS available: {torch.backends.mps.is_available()}")

if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"GPU count: {torch.cuda.device_count()}")
    for i in range(torch.cuda.device_count()):
        print(f"  GPU {i}: {torch.cuda.get_device_name(i)}")
        memory_gb = torch.cuda.get_device_properties(i).total_memory / 1e9
        print(f"    Memory: {memory_gb:.1f} GB")
elif torch.backends.mps.is_available():
    print("✅ Apple Silicon GPU (MPS) available")
    print("   Training will use Metal Performance Shaders")
else:
    print("⚠️ No GPU acceleration available - will use CPU")

# Test Unsloth import
try:
    from unsloth import FastLanguageModel
    print("✅ Unsloth imported successfully")
except ImportError as e:
    print(f"❌ Unsloth import failed: {e}")

# Test device selection
if torch.cuda.is_available():
    device = "cuda"
elif torch.backends.mps.is_available():
    device = "mps"
else:
    device = "cpu"

print(f"🎯 Recommended device for training: {device}")

print("\n🎉 Setup verification complete!")

Run the test:

python test_setup.py

Installing Ollama

Ollama will help us serve our fine-tuned models locally with a simple API:

Installation by Platform

macOS:

# Using Homebrew
brew install ollama

# Or download from website
curl -fsSL https://ollama.ai/install.sh | sh

Linux:

curl -fsSL https://ollama.ai/install.sh | sh

Windows: Download the installer from ollama.ai and run it.

Verify Ollama Installation

# Check version
ollama --version

# Start Ollama service (in one terminal)
ollama serve

# In another terminal, test with a simple model
ollama run llama3.1:8b "Hello, world!"

Understanding Docker Model Runner vs Ollama

You might wonder why we’re using both Docker Model Runner and Ollama. Here’s the key difference:

Docker Model Runner

Native Integration: Built into Docker Desktop
Performance: Models run directly on host system (better performance)
Ecosystem: Seamless integration with Docker containers and Compose
Port: Uses port 12434
Best for: Development workflows, containerized applications

Ollama

Standalone Tool: Independent model server
Community: Larger model repository and community
Flexibility: More customization options
Port: Uses port 11434
Best for: Production deployment, model experimentation

We’ll use Docker Model Runner for development and Ollama for serving our final fine-tuned models.

Project Structure Setup

Let’s create a well-organized project structure:

# Create comprehensive project structure
mkdir -p llm-fine-tuning-workshop/{
  data/{raw,processed,datasets},
  models/{base,fine-tuned,quantized},
  notebooks/{exploration,training,evaluation},
  scripts/{preprocessing,training,evaluation,deployment},
  configs/{training,evaluation,deployment},
  logs/{training,evaluation,monitoring},
  tests/{unit,integration},
  docker/{images,compose},
  docs/{guides,api}
}

# Create configuration files
touch llm-fine-tuning-workshop/{
  .env,
  .gitignore,
  README.md,
  requirements.txt,
  Dockerfile,
  docker-compose.yml
}

Create .gitignore

# .gitignore
cat > .gitignore << 'EOF'
# Python
__pycache__/
*.py[cod]
*$py.class
*.so
.Python
env/
venv/
ENV/
*.egg-info/
dist/
build/

# Models and Data
models/
*.bin
*.safetensors
*.gguf
*.ggml
data/raw/
data/processed/
*.csv
*.json
*.jsonl

# Logs and Outputs
logs/
outputs/
wandb/
*.log

# Environment
.env
.env.local

# IDE
.vscode/
.idea/
*.swp
*.swo

# System
.DS_Store
Thumbs.db

# Jupyter
.ipynb_checkpoints/

# Docker
.dockerignore
EOF

Create Environment Configuration

# .env
cat > .env << 'EOF'
# Model Configuration
BASE_MODEL="unsloth/llama-3.1-8b-instruct-bnb-4bit"
MAX_SEQ_LENGTH=2048
LOAD_IN_4BIT=true

# Training Configuration
LEARNING_RATE=2e-4
BATCH_SIZE=2
GRADIENT_ACCUMULATION_STEPS=4
MAX_STEPS=500
WARMUP_STEPS=50

# API Configuration
OLLAMA_HOST=localhost
OLLAMA_PORT=11434
DOCKER_MODEL_RUNNER_PORT=12434

# Paths
DATA_DIR=./data
MODELS_DIR=./models
LOGS_DIR=./logs

# Weights & Biases (optional)
WANDB_PROJECT=llm-fine-tuning
WANDB_ENTITY=your-username

# Hugging Face (optional)
HF_TOKEN=your-hf-token
EOF

Testing the Complete Setup

Let’s create a comprehensive test to ensure everything is working:

# test_complete_setup.py
import os
import sys
import requests
import subprocess
from pathlib import Path
import torch
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

def test_python_environment():
    """Test Python environment and packages"""
    print("🐍 Testing Python Environment")
    print("-" * 30)

    required_packages = [
        'torch', 'transformers', 'datasets', 'accelerate',
        'unsloth', 'bitsandbytes', 'peft', 'wandb'
    ]

    missing_packages = []
    for package in required_packages:
        try:
            __import__(package)
            print(f"✅ {package}")
        except ImportError:
            print(f"❌ {package}")
            missing_packages.append(package)

    if missing_packages:
        print(f"\n⚠️  Missing packages: {', '.join(missing_packages)}")
        return False
    return True

def test_gpu_setup():
    """Test GPU availability and configuration"""
    print("\n🖥️  Testing GPU Setup")
    print("-" * 30)

    if torch.cuda.is_available():
        gpu_count = torch.cuda.device_count()
        print(f"✅ CUDA available with {gpu_count} GPU(s)")

        for i in range(gpu_count):
            name = torch.cuda.get_device_name(i)
            memory = torch.cuda.get_device_properties(i).total_memory / 1e9
            print(f"   GPU {i}: {name} ({memory:.1f} GB)")
        return True
    else:
        print("⚠️  CUDA not available - will use CPU")
        return True  # Not a failure, just different configuration

def test_docker_setup():
    """Test Docker and Docker Model Runner"""
    print("\n🐳 Testing Docker Setup")
    print("-" * 30)

    try:
        # Test basic Docker
        result = subprocess.run(['docker', '--version'],
                              capture_output=True, text=True)
        if result.returncode == 0:
            print(f"✅ Docker: {result.stdout.strip()}")
        else:
            print("❌ Docker not available")
            return False

        # Test Docker Model Runner
        result = subprocess.run(['docker', 'model', '--help'],
                              capture_output=True, text=True)
        if result.returncode == 0:
            print("✅ Docker Model Runner available")
        else:
            print("❌ Docker Model Runner not enabled")
            return False

        return True
    except FileNotFoundError:
        print("❌ Docker not found in PATH")
        return False

def test_ollama_setup():
    """Test Ollama installation and service"""
    print("\n🦙 Testing Ollama Setup")
    print("-" * 30)

    try:
        # Test Ollama binary
        result = subprocess.run(['ollama', '--version'],
                              capture_output=True, text=True)
        if result.returncode == 0:
            print(f"✅ Ollama: {result.stdout.strip()}")
        else:
            print("❌ Ollama not available")
            return False

        # Test Ollama service
        try:
            response = requests.get("http://localhost:11434/api/tags", timeout=5)
            if response.status_code == 200:
                print("✅ Ollama service running")
                models = response.json().get("models", [])
                print(f"   {len(models)} models available")
            else:
                print("⚠️  Ollama service not responding (run 'ollama serve')")
        except requests.exceptions.ConnectionError:
            print("⚠️  Ollama service not running (run 'ollama serve')")

        return True
    except FileNotFoundError:
        print("❌ Ollama not found in PATH")
        return False

def test_project_structure():
    """Test project directory structure"""
    print("\n📁 Testing Project Structure")
    print("-" * 30)

    required_dirs = [
        'data', 'models', 'notebooks', 'scripts',
        'configs', 'logs', 'tests', 'docker'
    ]

    missing_dirs = []
    for dir_name in required_dirs:
        if Path(dir_name).exists():
            print(f"✅ {dir_name}/")
        else:
            print(f"❌ {dir_name}/")
            missing_dirs.append(dir_name)

    if missing_dirs:
        print(f"\n⚠️  Missing directories: {', '.join(missing_dirs)}")
        return False
    return True

def test_model_loading():
    """Test loading a small model"""
    print("\n🤖 Testing Model Loading")
    print("-" * 30)

    try:
        from unsloth import FastLanguageModel

        print("Loading small test model...")
        model, tokenizer = FastLanguageModel.from_pretrained(
            model_name="unsloth/tinyllama-bnb-4bit",
            max_seq_length=512,
            dtype=None,
            load_in_4bit=True,
        )
        print("✅ Model loaded successfully")

        # Test tokenization
        test_text = "Hello, world!"
        tokens = tokenizer(test_text, return_tensors="pt")
        print(f"✅ Tokenization test passed ({len(tokens['input_ids'][0])} tokens)")

        return True
    except Exception as e:
        print(f"❌ Model loading failed: {e}")
        return False

def main():
    """Run all tests"""
    print("🧪 Complete Setup Testing")
    print("=" * 50)

    tests = [
        test_python_environment,
        test_gpu_setup,
        test_docker_setup,
        test_ollama_setup,
        test_project_structure,
        test_model_loading
    ]

    results = []
    for test in tests:
        try:
            result = test()
            results.append(result)
        except Exception as e:
            print(f"❌ Test failed with error: {e}")
            results.append(False)

    # Summary
    print("\n📊 Test Summary")
    print("=" * 50)
    passed = sum(results)
    total = len(results)

    print(f"Tests passed: {passed}/{total}")

    if passed == total:
        print("🎉 All tests passed! Your environment is ready for fine-tuning.")
    else:
        print("⚠️  Some tests failed. Please address the issues above.")
        print("\n💡 Common fixes:")
        print("- Install missing Python packages: pip install -r requirements.txt")
        print("- Enable Docker Model Runner in Docker Desktop settings")
        print("- Start Ollama service: ollama serve")
        print("- Create missing directories: mkdir -p data models notebooks scripts configs logs tests docker")

    return passed == total

if __name__ == "__main__":
    success = main()
    sys.exit(0 if success else 1)

Run the complete test:

python test_complete_setup.py

Troubleshooting Common Issues

Docker Issues

Issue: Docker Model Runner not available

# Solution: Enable in Docker Desktop
# Settings → Beta Features → Enable "Docker Model Runner"

Issue: Permission denied (Linux)

# Add user to docker group
sudo usermod -aG docker $USER
# Log out and back in

GPU Issues

Issue: CUDA out of memory

# Solution: Use smaller batch sizes and 4-bit quantization
load_in_4bit = True
per_device_train_batch_size = 1
gradient_accumulation_steps = 8

Issue: GPU not detected

# Check CUDA installation
nvidia-smi
# Reinstall PyTorch with correct CUDA version
pip install torch --index-url https://download.pytorch.org/whl/cu121

Apple Silicon Mac Issues

Issue: MPS not available on Apple Silicon

# Ensure you have the latest PyTorch
pip install --upgrade torch torchvision torchaudio

# Check macOS version (MPS requires macOS 12.3+)
sw_vers

# Verify MPS in Python
python -c "import torch; print('MPS available:', torch.backends.mps.is_available())"

Issue: Unsloth installation fails on Apple Silicon

# Install Xcode command line tools
xcode-select --install

# Install without extras first
pip install "unsloth @ git+https://github.com/unslothai/unsloth.git"

# If compilation issues persist, try:
export PYTORCH_ENABLE_MPS_FALLBACK=1
pip install "unsloth @ git+https://github.com/unslothai/unsloth.git"

Issue: Training crashes with MPS backend

# Solution: Add MPS fallback for unsupported operations
import os
os.environ['PYTORCH_ENABLE_MPS_FALLBACK'] = '1'

# Alternative: Force CPU for problematic operations
device = "cpu"  # Instead of "mps" if issues persist

Network Issues

Issue: Ollama service not starting

# Check port availability
sudo lsof -i :11434
# Kill conflicting processes if needed
sudo kill -9 <PID>
# Restart Ollama
ollama serve

📁 Reference Code Repository

All code examples from this blog series are available in the GitHub repository:

🔗 fine-tuning-small-llms

# Clone the repository to follow along
git clone https://github.com/saptak/fine-tuning-small-llms.git
cd fine-tuning-small-llms

# Run the Part 1 setup script
./part1-setup/scripts/setup_environment.sh

The repository includes:

Complete setup scripts and system checks
Docker configurations and environment templates
All code examples organized by blog post parts
Documentation and usage guides
Requirements files and dependencies

What’s Next?

Congratulations! You’ve successfully set up your development environment for fine-tuning small language models with Docker Desktop. In the next part of our series, we’ll dive into:

Part 2: Data Preparation and Model Selection

Creating high-quality training datasets
Data formatting and preprocessing
Choosing the right base model for your use case
Understanding different model architectures

Quick Preview

In Part 2, you’ll learn how to:

Format data for different fine-tuning approaches
Create synthetic datasets for specialized domains
Select optimal base models (Llama 3.1, Phi-3, Mistral)
Implement data quality validation

Resources and References

Join the Discussion: Share your setup experience and questions in the comments below!

This is Part 1 of our 6-part series on fine-tuning small LLMs with Docker Desktop. Each part builds upon the previous ones, so make sure to follow along!

Context Engineering for AI Workloads: The Evolution Beyond Prompt Engineering

Saptak Sen

If you enjoyed this post, you should check out my book: Starting with Spark.