Fine-Tuning Small LLMs on your Desktop - Part 2: Data Preparation and Model Selection

📚 Reference Code Available: All code examples from this blog series are available in the GitHub repository. See part2-data-preparation/ for the complete data preparation toolkit!

Fine-Tuning Small LLMs on your Desktop - Part 2: Data Preparation and Model Selection

Welcome back! In Part 1, we set up our development environment with Docker Desktop, CUDA support, and all necessary tools. Now we dive into the foundation of any successful fine-tuning project: data preparation and model selection.

This is where the magic begins—the quality of your training data will ultimately determine the success of your fine-tuned model. We’ll explore advanced techniques for creating, validating, and optimizing datasets that produce exceptional results.

Series Navigation

  1. Part 1: Setup and Environment
  2. Part 2: Data Preparation and Model Selection (This post)
  3. Part 3: Fine-Tuning with Unsloth
  4. Part 4: Evaluation and Testing
  5. Part 5: Deployment with Ollama and Docker
  6. Part 6: Production, Monitoring, and Scaling

The Data Quality Imperative

Before we dive into code, let’s understand why data quality is paramount:

“Garbage in, garbage out” - This age-old adage is especially true for LLM fine-tuning. A model trained on 500 high-quality, diverse examples will consistently outperform one trained on 5,000 mediocre, repetitive samples.

Key Principles for High-Quality Training Data

  1. Diversity Over Volume: Cover edge cases and variations
  2. Consistency: Maintain uniform formatting and style
  3. Accuracy: Ensure all examples are factually correct
  4. Relevance: Every example should serve your specific use case
  5. Balance: Avoid overrepresenting any single pattern

Understanding Data Formats for Fine-Tuning

Different training approaches require different data formats. Let’s explore the most effective ones:

Alpaca Format (Instruction-Following)

{
  "instruction": "Generate SQL to find users registered in the last 30 days",
  "input": "Table Schema: users (id, name, email, registration_date)",
  "output": "SELECT * FROM users WHERE registration_date >= DATE_SUB(CURDATE(), INTERVAL 30 DAY);"
}

Chat Format (Conversational)

{
  "messages": [
    {"role": "system", "content": "You are an expert SQL developer."},
    {"role": "user", "content": "Write a query to find recent users"},
    {"role": "assistant", "content": "SELECT * FROM users WHERE registration_date >= DATE_SUB(CURDATE(), INTERVAL 30 DAY);"}
  ]
}

Completion Format (Text Generation)

{
  "prompt": "### SQL Query Request:\nFind users registered recently\n\n### SQL:\n",
  "completion": "SELECT * FROM users WHERE registration_date >= DATE_SUB(CURDATE(), INTERVAL 30 DAY);"
}

Model Selection Strategy

Choosing the right base model is crucial for success. Here’s our decision framework:

Model Evaluation Matrix

Model Size Memory (GB) Speed Use Case Best For
Phi-3-Mini 3.8B 4-6 Fast General, Coding Resource-constrained environments
Llama-3.1-8B 8B 8-12 Medium General, Reasoning Balanced performance/resources
Mistral-7B 7B 6-10 Medium Code, Technical Programming tasks
Qwen2-7B 7B 6-10 Medium Multilingual International applications
CodeLlama-7B 7B 8-12 Medium Programming Pure code generation

Smart Model Selection Function

# model_selection.py
import torch
import psutil
from typing import Dict, List, Tuple, Optional

def analyze_system_resources() -> Dict[str, float]:
    """Analyze available system resources"""

    # Get CPU information
    cpu_count = psutil.cpu_count(logical=True)
    cpu_freq = psutil.cpu_freq()
    memory = psutil.virtual_memory()

    # Get GPU information if available
    gpu_memory_gb = 0
    gpu_count = 0

    if torch.cuda.is_available():
        gpu_count = torch.cuda.device_count()
        gpu_memory_gb = torch.cuda.get_device_properties(0).total_memory / 1e9

    return {
        "cpu_cores": cpu_count,
        "cpu_frequency_ghz": cpu_freq.current / 1000 if cpu_freq else 0,
        "ram_gb": memory.total / 1e9,
        "available_ram_gb": memory.available / 1e9,
        "gpu_count": gpu_count,
        "gpu_memory_gb": gpu_memory_gb
    }

def estimate_model_requirements(model_size_billion: float,
                              quantization: str = "4bit") -> Dict[str, float]:
    """Estimate resource requirements for a model"""

    # Base memory requirements (rough estimates)
    base_memory_gb = {
        "fp16": model_size_billion * 2,
        "8bit": model_size_billion * 1.2,
        "4bit": model_size_billion * 0.75,
        "fp32": model_size_billion * 4
    }

    model_memory = base_memory_gb.get(quantization, base_memory_gb["4bit"])

    # Add overhead for training (LoRA adapters, optimizer states, etc.)
    training_overhead = model_memory * 0.3

    return {
        "inference_memory_gb": model_memory,
        "training_memory_gb": model_memory + training_overhead,
        "minimum_ram_gb": model_memory * 1.5,  # For CPU fallback
        "recommended_vram_gb": model_memory + training_overhead
    }

def select_optimal_model(use_case: str,
                        memory_constraint_gb: float,
                        performance_priority: str = "balanced") -> str:
    """
    Select optimal model based on requirements
    """

    recommendations = {
        "general": {
            "high_memory": "unsloth/llama-3.1-8b-instruct-bnb-4bit",
            "low_memory": "unsloth/Phi-3-mini-4k-instruct-bnb-4bit"
        },
        "coding": {
            "high_memory": "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
            "low_memory": "unsloth/CodeLlama-7b-instruct-bnb-4bit"
        },
        "multilingual": {
            "high_memory": "unsloth/Qwen2-7B-Instruct-bnb-4bit",
            "low_memory": "unsloth/Phi-3-mini-4k-instruct-bnb-4bit"
        },
        "conversational": {
            "high_memory": "unsloth/llama-3.1-8b-instruct-bnb-4bit",
            "low_memory": "unsloth/Phi-3-mini-4k-instruct-bnb-4bit"
        }
    }

    memory_tier = "high_memory" if memory_constraint_gb >= 8 else "low_memory"

    if use_case in recommendations:
        return recommendations[use_case][memory_tier]
    else:
        return recommendations["general"][memory_tier]

# Example usage
recommended_model = select_optimal_model(
    use_case="coding",
    memory_constraint_gb=16,
    performance_priority="balanced"
)
print(f"Recommended model: {recommended_model}")

Creating High-Quality Training Datasets

You have two excellent options for creating training datasets: using public datasets for quick setup or creating custom datasets for specific needs.

For faster setup and efficient desktop training, you can leverage high-quality public datasets:

# Quick Start with HuggingFace Datasets

# Load 1K examples for fast training (30min-1hr on desktop)
python src/dataset_creation.py --source huggingface --num-examples 1000 --format alpaca

# Load 5K examples for comprehensive training (1-2hr on desktop)  
python src/dataset_creation.py --source huggingface --num-examples 5000 --format alpaca

# Use different HuggingFace dataset
python src/dataset_creation.py --source huggingface --hf-dataset "spider" --num-examples 500

Desktop Training Recommendations:

  • 1K examples: ~30min training, 2-4GB memory, ideal for testing
  • 5K examples: ~1-2hr training, 4-6GB memory, good balance
  • 10K+ examples: 3hr+ training, 8GB+ memory, comprehensive but slow

Available Public Datasets:

  • b-mc2/sql-create-context: 78K examples, professionally curated, best for general SQL
  • spider: 10K examples, complex cross-domain queries, good for advanced SQL
  • wikisql: 80K examples, simpler single-table queries, good for beginners

Option 2: Manual Dataset Creation

Let’s create a practical example with SQL generation - a common and valuable use case:

Step 1: Define Your Dataset Structure

# dataset_creation.py
import pandas as pd
import json
from typing import List, Dict
from pathlib import Path

class SQLDatasetCreator:
    def __init__(self, output_dir: str = "./data"):
        self.output_dir = Path(output_dir)
        self.output_dir.mkdir(parents=True, exist_ok=True)
        self.examples = []

    def add_example(self, instruction: str, table_schema: str, sql_query: str,
                   explanation: str = "", difficulty: str = "medium"):
        """Add a training example to the dataset"""

        example = {
            "instruction": instruction,
            "input": f"Table Schema: {table_schema}",
            "output": sql_query,
            "explanation": explanation,
            "difficulty": difficulty,
            "id": len(self.examples)
        }

        self.examples.append(example)
        return example

    def create_basic_examples(self):
        """Create fundamental SQL examples"""

        # Basic SELECT operations
        self.add_example(
            instruction="Select all columns from the users table",
            table_schema="users (id, name, email, created_at)",
            sql_query="SELECT * FROM users;",
            explanation="Basic SELECT statement to retrieve all columns and rows",
            difficulty="easy"
        )

        self.add_example(
            instruction="Find all users who registered in the last 30 days",
            table_schema="users (id, name, email, created_at)",
            sql_query="SELECT * FROM users WHERE created_at >= DATE_SUB(CURDATE(), INTERVAL 30 DAY);",
            explanation="Uses DATE_SUB function to filter recent registrations",
            difficulty="medium"
        )

        # Aggregation queries
        self.add_example(
            instruction="Count the total number of orders per customer",
            table_schema="orders (id, customer_id, amount, order_date)",
            sql_query="SELECT customer_id, COUNT(*) as order_count FROM orders GROUP BY customer_id;",
            explanation="Groups by customer and counts orders using COUNT(*)",
            difficulty="medium"
        )

        self.add_example(
            instruction="Find the average order amount per month",
            table_schema="orders (id, customer_id, amount, order_date)",
            sql_query="SELECT DATE_FORMAT(order_date, '%Y-%m') as month, AVG(amount) as avg_amount FROM orders GROUP BY DATE_FORMAT(order_date, '%Y-%m');",
            explanation="Uses DATE_FORMAT to group by month and AVG for average calculation",
            difficulty="medium"
        )

        # JOIN operations
        self.add_example(
            instruction="Show customer names with their total order amounts",
            table_schema="customers (id, name, email), orders (id, customer_id, amount, order_date)",
            sql_query="SELECT c.name, SUM(o.amount) as total_amount FROM customers c JOIN orders o ON c.id = o.customer_id GROUP BY c.id, c.name;",
            explanation="INNER JOIN between customers and orders with SUM aggregation",
            difficulty="hard"
        )

        # Complex analytical queries
        self.add_example(
            instruction="Find customers who have spent more than the average customer spending",
            table_schema="customers (id, name, email), orders (id, customer_id, amount)",
            sql_query="""SELECT c.name, SUM(o.amount) as total_spent
FROM customers c
JOIN orders o ON c.id = o.customer_id
GROUP BY c.id, c.name
HAVING SUM(o.amount) > (
    SELECT AVG(customer_total)
    FROM (
        SELECT SUM(amount) as customer_total
        FROM orders
        GROUP BY customer_id
    ) as customer_totals
);""",
            explanation="Complex query with subquery to find above-average spenders",
            difficulty="expert"
        )

    def create_advanced_examples(self):
        """Create advanced SQL examples"""

        # Window functions
        self.add_example(
            instruction="Rank customers by their total spending within each region",
            table_schema="customers (id, name, region), orders (id, customer_id, amount)",
            sql_query="""SELECT
    c.name,
    c.region,
    SUM(o.amount) as total_spent,
    RANK() OVER (PARTITION BY c.region ORDER BY SUM(o.amount) DESC) as spending_rank
FROM customers c
JOIN orders o ON c.id = o.customer_id
GROUP BY c.id, c.name, c.region;""",
            explanation="Uses window function RANK() with PARTITION BY for regional rankings",
            difficulty="expert"
        )

        # Common Table Expressions (CTEs)
        self.add_example(
            instruction="Find the second highest order amount for each customer",
            table_schema="orders (id, customer_id, amount, order_date)",
            sql_query="""WITH ranked_orders AS (
    SELECT
        customer_id,
        amount,
        ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY amount DESC) as rn
    FROM orders
)
SELECT customer_id, amount as second_highest_amount
FROM ranked_orders
WHERE rn = 2;""",
            explanation="Uses CTE with ROW_NUMBER() to find second highest values",
            difficulty="expert"
        )
    
    def load_huggingface_dataset(self, dataset_name: str = "b-mc2/sql-create-context", 
                               num_examples: int = 1000):
        """Load dataset from HuggingFace Hub"""
        try:
            from datasets import load_dataset
        except ImportError:
            print("❌ Error: datasets library not found. Install with: pip install datasets")
            return False
        
        print(f"🔄 Loading {num_examples} examples from {dataset_name}...")
        
        try:
            # Load dataset with specified number of examples
            dataset = load_dataset(dataset_name, split=f"train[:{num_examples}]")
            
            # Convert HuggingFace format to internal format
            for i, item in enumerate(dataset):
                # Handle different possible field names
                instruction = item.get('question', item.get('instruction', ''))
                context = item.get('context', item.get('input', ''))
                answer = item.get('answer', item.get('output', ''))
                
                if instruction and answer:
                    self.add_example(
                        instruction=instruction,
                        table_schema=context,
                        sql_query=answer,
                        difficulty="medium"
                    )
                    
                    if (i + 1) % 100 == 0:
                        print(f"  Loaded {i + 1}/{num_examples} examples...")
            
            print(f"✅ Successfully loaded {len(self.examples)} examples from {dataset_name}")
            return True
            
        except Exception as e:
            print(f"❌ Error loading dataset: {e}")
            print("💡 Falling back to manual dataset creation...")
            return False

    def format_for_training(self, format_type: str = "alpaca") -> List[Dict]:
        """Format examples for different training approaches"""

        formatted_examples = []

        for example in self.examples:
            if format_type == "alpaca":
                formatted = {
                    "instruction": example["instruction"],
                    "input": example["input"],
                    "output": example["output"]
                }

            elif format_type == "chat":
                formatted = {
                    "messages": [
                        {"role": "system", "content": "You are an expert SQL developer who generates accurate and efficient SQL queries."},
                        {"role": "user", "content": f"{example['instruction']}\n\n{example['input']}"},
                        {"role": "assistant", "content": example["output"]}
                    ]
                }

            elif format_type == "completion":
                formatted = {
                    "prompt": f"### SQL Request:\n{example['instruction']}\n\n{example['input']}\n\n### SQL Query:\n",
                    "completion": example["output"]
                }

            formatted_examples.append(formatted)

        return formatted_examples

    def save_dataset(self, filename: str = "sql_training_data", format_type: str = "alpaca"):
        """Save dataset in specified format"""

        formatted_data = self.format_for_training(format_type)

        # Save as JSON
        json_path = self.output_dir / f"{filename}_{format_type}.json"
        with open(json_path, 'w', encoding='utf-8') as f:
            json.dump(formatted_data, f, indent=2, ensure_ascii=False)

        # Save as CSV (for Alpaca format)
        if format_type == "alpaca":
            df = pd.DataFrame(formatted_data)
            csv_path = self.output_dir / f"{filename}_alpaca.csv"
            df.to_csv(csv_path, index=False)

        print(f"Dataset saved: {json_path}")
        print(f"Total examples: {len(formatted_data)}")

        return json_path

# Create comprehensive SQL dataset
def create_sql_dataset(source: str = "manual", hf_dataset: str = "b-mc2/sql-create-context", 
                      num_examples: int = 1000):
    creator = SQLDatasetCreator(output_dir="./data/datasets")
    
    if source == "huggingface":
        # Load from HuggingFace
        success = creator.load_huggingface_dataset(hf_dataset, num_examples)
        if not success:
            print("⚠️  Falling back to manual dataset creation...")
            source = "manual"
    
    if source == "manual":
        # Create manual examples
        creator.create_basic_examples()
        creator.create_advanced_examples()

        # Add domain-specific examples
        creator.add_example(
            instruction="Create a query to find the top 10 products by sales volume",
            table_schema="products (id, name, category), order_items (id, product_id, quantity, order_id)",
            sql_query="SELECT p.name, SUM(oi.quantity) as total_sold FROM products p JOIN order_items oi ON p.id = oi.product_id GROUP BY p.id, p.name ORDER BY total_sold DESC LIMIT 10;",
            difficulty="medium"
        )

        creator.add_example(
            instruction="Calculate monthly revenue growth rate",
            table_schema="orders (id, amount, order_date)",
            sql_query="""WITH monthly_revenue AS (
    SELECT
        DATE_FORMAT(order_date, '%Y-%m') as month,
        SUM(amount) as revenue
    FROM orders
    GROUP BY DATE_FORMAT(order_date, '%Y-%m')
),
revenue_with_lag AS (
    SELECT
        month,
        revenue,
        LAG(revenue) OVER (ORDER BY month) as prev_revenue
    FROM monthly_revenue
)
SELECT
    month,
    revenue,
    ROUND(((revenue - prev_revenue) / prev_revenue) * 100, 2) as growth_rate_percent
FROM revenue_with_lag
WHERE prev_revenue IS NOT NULL;""",
            difficulty="expert"
        )

    # Save in multiple formats
    dataset_name = "sql_dataset_hf" if source == "huggingface" else "sql_dataset"
    creator.save_dataset(dataset_name, "alpaca")
    creator.save_dataset(dataset_name, "chat")

    return creator

# Usage
if __name__ == "__main__":
    # Option 1: Use HuggingFace dataset (recommended for desktop)
    dataset_creator = create_sql_dataset(source="huggingface", num_examples=1000)
    
    # Option 2: Create manual dataset
    # dataset_creator = create_sql_dataset(source="manual")
    
    print(f"Created dataset with {len(dataset_creator.examples)} examples")

Step 2: Data Quality Validation

# data_validation.py
import pandas as pd
import json
import re
from typing import List, Dict, Tuple
import sqlparse
from sqlparse import sql, tokens

class DataQualityValidator:
    def __init__(self):
        self.errors = []
        self.warnings = []
        self.stats = {}

    def validate_sql_syntax(self, sql_query: str) -> Tuple[bool, str]:
        """Validate SQL syntax using sqlparse"""
        try:
            parsed = sqlparse.parse(sql_query)
            if not parsed:
                return False, "Empty or invalid SQL"

            # Check for basic SQL structure
            formatted = sqlparse.format(sql_query, reindent=True, keyword_case='upper')
            return True, "Valid SQL syntax"

        except Exception as e:
            return False, f"SQL parsing error: {str(e)}"

    def validate_dataset(self, dataset_path: str, format_type: str = "alpaca") -> Dict:
        """Comprehensive dataset validation"""

        # Load dataset
        with open(dataset_path, 'r', encoding='utf-8') as f:
            data = json.load(f)

        validation_results = {
            "total_examples": len(data),
            "valid_examples": 0,
            "errors": [],
            "warnings": [],
            "statistics": {}
        }

        for i, example in enumerate(data):
            example_errors = []
            example_warnings = []

            if format_type == "alpaca":
                # Check required fields
                required_fields = ["instruction", "input", "output"]
                for field in required_fields:
                    if field not in example or not example[field].strip():
                        example_errors.append(f"Missing or empty {field}")

                # Validate SQL in output
                if "output" in example:
                    is_valid, message = self.validate_sql_syntax(example["output"])
                    if not is_valid:
                        example_errors.append(f"Invalid SQL: {message}")

                # Check instruction quality
                if "instruction" in example:
                    if len(example["instruction"]) < 10:
                        example_warnings.append("Instruction too short")
                    if not example["instruction"].endswith(('?', '.')):
                        example_warnings.append("Instruction should end with punctuation")

            # Record results
            if not example_errors:
                validation_results["valid_examples"] += 1
            else:
                validation_results["errors"].append({
                    "example_index": i,
                    "errors": example_errors
                })

            if example_warnings:
                validation_results["warnings"].append({
                    "example_index": i,
                    "warnings": example_warnings
                })

        # Calculate statistics
        validation_results["statistics"] = {
            "success_rate": validation_results["valid_examples"] / len(data) * 100,
            "error_rate": len(validation_results["errors"]) / len(data) * 100,
            "warning_rate": len(validation_results["warnings"]) / len(data) * 100
        }

        return validation_results

    def generate_quality_report(self, validation_results: Dict) -> str:
        """Generate human-readable quality report"""

        report = f"""
📊 Dataset Quality Report
========================

Total Examples: {validation_results['total_examples']}
Valid Examples: {validation_results['valid_examples']}
Success Rate: {validation_results['statistics']['success_rate']:.1f}%

❌ Errors: {len(validation_results['errors'])}
⚠️  Warnings: {len(validation_results['warnings'])}

"""

        if validation_results['errors']:
            report += "🔍 Error Details:\n"
            for error in validation_results['errors'][:5]:  # Show first 5
                report += f"  Example {error['example_index']}: {', '.join(error['errors'])}\n"

        if validation_results['warnings']:
            report += "\n⚠️  Warning Details:\n"
            for warning in validation_results['warnings'][:5]:  # Show first 5
                report += f"  Example {warning['example_index']}: {', '.join(warning['warnings'])}\n"

        return report

# Usage example
def validate_sql_dataset():
    validator = DataQualityValidator()

    # Validate the dataset
    results = validator.validate_dataset("./data/datasets/sql_dataset_alpaca.json")

    # Generate report
    report = validator.generate_quality_report(results)
    print(report)

    # Save validation report
    with open("./data/validation_report.txt", "w") as f:
        f.write(report)

    return results

if __name__ == "__main__":
    validation_results = validate_sql_dataset()

Step 3: Data Augmentation Techniques

# data_augmentation.py
import random
import json
from typing import List, Dict

class SQLDataAugmenter:
    def __init__(self):
        self.table_variations = {
            'users': ['customers', 'clients', 'members', 'accounts'],
            'orders': ['purchases', 'transactions', 'sales', 'bookings'],
            'products': ['items', 'goods', 'services', 'offerings'],
            'categories': ['types', 'groups', 'classes', 'segments']
        }

        self.column_variations = {
            'id': ['id', 'user_id', 'customer_id', 'primary_key'],
            'name': ['name', 'full_name', 'title', 'label'],
            'email': ['email', 'email_address', 'contact_email'],
            'created_at': ['created_at', 'created_date', 'registration_date', 'signup_date'],
            'amount': ['amount', 'price', 'cost', 'value', 'total']
        }

    def augment_table_names(self, sql_query: str, schema: str) -> List[Dict]:
        """Create variations by changing table names"""
        variations = []

        # Extract original table names from schema
        tables = self.extract_tables_from_schema(schema)

        for table in tables:
            if table in self.table_variations:
                for variation in self.table_variations[table]:
                    new_sql = sql_query.replace(table, variation)
                    new_schema = schema.replace(table, variation)
                    variations.append({
                        'sql': new_sql,
                        'schema': new_schema,
                        'variation_type': f'table_name_{variation}'
                    })

        return variations

    def augment_conditions(self, base_example: Dict) -> List[Dict]:
        """Create variations with different WHERE conditions"""
        variations = []

        # Time-based variations
        time_conditions = [
            "WHERE created_at >= DATE_SUB(CURDATE(), INTERVAL 7 DAY)",
            "WHERE created_at >= DATE_SUB(CURDATE(), INTERVAL 90 DAY)",
            "WHERE created_at >= '2024-01-01'",
            "WHERE YEAR(created_at) = 2024"
        ]

        for condition in time_conditions:
            new_example = base_example.copy()
            new_example['instruction'] = new_example['instruction'].replace('30 days', self.extract_time_period(condition))
            new_example['output'] = new_example['output'].replace(
                "WHERE created_at >= DATE_SUB(CURDATE(), INTERVAL 30 DAY)",
                condition
            )
            variations.append(new_example)

        return variations

    def extract_tables_from_schema(self, schema: str) -> List[str]:
        """Extract table names from schema string"""
        import re
        tables = re.findall(r'(\w+)\s*\(', schema)
        return tables

    def extract_time_period(self, condition: str) -> str:
        """Extract time period description from SQL condition"""
        if '7 DAY' in condition:
            return '7 days'
        elif '90 DAY' in condition:
            return '90 days'
        elif '2024' in condition:
            return '2024'
        return 'specified period'

    def generate_variations(self, original_dataset: List[Dict], augmentation_factor: int = 2) -> List[Dict]:
        """Generate augmented dataset"""
        augmented_data = original_dataset.copy()

        for example in original_dataset:
            # Generate table name variations
            if 'output' in example and 'SELECT' in example['output'].upper():
                table_variations = self.augment_table_names(
                    example['output'],
                    example.get('input', '')
                )

                for var in table_variations[:augmentation_factor]:
                    new_example = example.copy()
                    new_example['output'] = var['sql']
                    new_example['input'] = var['schema']
                    new_example['variation_type'] = var['variation_type']
                    augmented_data.append(new_example)

        return augmented_data

# Usage
def augment_sql_dataset():
    # Load original dataset
    with open('./data/datasets/sql_dataset_alpaca.json', 'r') as f:
        original_data = json.load(f)

    # Create augmenter
    augmenter = SQLDataAugmenter()

    # Generate variations
    augmented_data = augmenter.generate_variations(original_data, augmentation_factor=3)

    # Save augmented dataset
    with open('./data/datasets/sql_dataset_augmented.json', 'w') as f:
        json.dump(augmented_data, f, indent=2)

    print(f"Original dataset: {len(original_data)} examples")
    print(f"Augmented dataset: {len(augmented_data)} examples")
    print(f"Augmentation ratio: {len(augmented_data) / len(original_data):.1f}x")

if __name__ == "__main__":
    augment_sql_dataset()

Data Format Conversion

Let’s create utilities to convert between different formats:

# format_converter.py
import json
import pandas as pd
from typing import List, Dict

class DatasetFormatConverter:
    def __init__(self):
        pass

    @staticmethod
    def alpaca_to_chat(alpaca_data: List[Dict]) -> List[Dict]:
        """Convert Alpaca format to Chat format"""
        chat_data = []

        for example in alpaca_data:
            chat_example = {
                "messages": [
                    {
                        "role": "system",
                        "content": "You are an expert SQL developer who generates accurate and efficient SQL queries based on user requirements."
                    },
                    {
                        "role": "user",
                        "content": f"{example['instruction']}\n\n{example.get('input', '')}"
                    },
                    {
                        "role": "assistant",
                        "content": example['output']
                    }
                ]
            }
            chat_data.append(chat_example)

        return chat_data

    @staticmethod
    def alpaca_to_completion(alpaca_data: List[Dict]) -> List[Dict]:
        """Convert Alpaca format to Completion format"""
        completion_data = []

        for example in alpaca_data:
            prompt = f"### Instruction:\n{example['instruction']}\n\n"
            if example.get('input', '').strip():
                prompt += f"### Input:\n{example['input']}\n\n"
            prompt += "### Response:\n"

            completion_example = {
                "prompt": prompt,
                "completion": example['output']
            }
            completion_data.append(completion_example)

        return completion_data

    @staticmethod
    def chat_to_alpaca(chat_data: List[Dict]) -> List[Dict]:
        """Convert Chat format to Alpaca format"""
        alpaca_data = []

        for example in chat_data:
            messages = example.get('messages', [])

            # Find user and assistant messages
            user_msg = next((msg for msg in messages if msg['role'] == 'user'), None)
            assistant_msg = next((msg for msg in messages if msg['role'] == 'assistant'), None)

            if user_msg and assistant_msg:
                # Try to split user message into instruction and input
                user_content = user_msg['content']
                lines = user_content.split('\n\n')

                if len(lines) >= 2 and lines[1].startswith(('Table', 'Schema', 'Context')):
                    instruction = lines[0]
                    input_text = lines[1]
                else:
                    instruction = user_content
                    input_text = ""

                alpaca_example = {
                    "instruction": instruction,
                    "input": input_text,
                    "output": assistant_msg['content']
                }
                alpaca_data.append(alpaca_example)

        return alpaca_data

    def convert_dataset(self, input_path: str, output_path: str,
                       from_format: str, to_format: str):
        """Convert dataset from one format to another"""

        # Load input data
        with open(input_path, 'r', encoding='utf-8') as f:
            input_data = json.load(f)

        # Convert data
        if from_format == "alpaca" and to_format == "chat":
            output_data = self.alpaca_to_chat(input_data)
        elif from_format == "alpaca" and to_format == "completion":
            output_data = self.alpaca_to_completion(input_data)
        elif from_format == "chat" and to_format == "alpaca":
            output_data = self.chat_to_alpaca(input_data)
        else:
            raise ValueError(f"Conversion from {from_format} to {to_format} not supported")

        # Save output data
        with open(output_path, 'w', encoding='utf-8') as f:
            json.dump(output_data, f, indent=2, ensure_ascii=False)

        print(f"Converted {len(input_data)} examples from {from_format} to {to_format}")
        print(f"Saved to: {output_path}")

        return output_data

# Usage example
def convert_formats():
    converter = DatasetFormatConverter()

    # Convert Alpaca to Chat format
    converter.convert_dataset(
        input_path="./data/datasets/sql_dataset_alpaca.json",
        output_path="./data/datasets/sql_dataset_chat.json",
        from_format="alpaca",
        to_format="chat"
    )

    # Convert Alpaca to Completion format
    converter.convert_dataset(
        input_path="./data/datasets/sql_dataset_alpaca.json",
        output_path="./data/datasets/sql_dataset_completion.json",
        from_format="alpaca",
        to_format="completion"
    )

if __name__ == "__main__":
    convert_formats()

Best Practices for Dataset Creation

1. Quality Over Quantity

  • Minimum viable dataset: 100-500 high-quality examples
  • Production dataset: 1,000-5,000 examples
  • Research dataset: 10,000+ examples

2. Balanced Distribution

# Check dataset balance
import numpy as np
def analyze_dataset_distribution(dataset_path: str):
    with open(dataset_path, 'r') as f:
        data = json.load(f)

    # Analyze difficulty distribution
    difficulties = [ex.get('difficulty', 'unknown') for ex in data]
    difficulty_counts = pd.Series(difficulties).value_counts()

    # Analyze length distribution
    output_lengths = [len(ex['output']) for ex in data]

    print("📊 Dataset Distribution Analysis")
    print("=" * 40)
    print(f"Total examples: {len(data)}")
    print(f"\nDifficulty distribution:")
    print(difficulty_counts)
    print(f"\nOutput length statistics:")
    print(f"Mean: {np.mean(output_lengths):.1f} characters")
    print(f"Median: {np.median(output_lengths):.1f} characters")
    print(f"Min: {min(output_lengths)} characters")
    print(f"Max: {max(output_lengths)} characters")

3. Domain-Specific Considerations

For SQL Generation:

  • Cover all major SQL operations (SELECT, INSERT, UPDATE, DELETE)
  • Include various JOIN types
  • Add window functions and CTEs for advanced cases
  • Validate all SQL queries for syntax correctness

For Code Generation:

  • Include multiple programming languages
  • Cover different complexity levels
  • Add error handling examples
  • Include best practices and common patterns

For Text Analysis:

  • Ensure diverse text types and domains
  • Include various output formats
  • Cover different analysis types (sentiment, summarization, etc.)

Preparing Data for Training

Loading and Inspecting Datasets

Before we start training, it’s important to load and inspect our datasets to ensure they are in the correct format. We can use the datasets library to do this:

from datasets import load_dataset

# Load the dataset
dataset = load_dataset("json", data_files="./data/datasets/sql_dataset_alpaca.json")

# Inspect the dataset
print(dataset)

# Print the first example
print(dataset["train"][0])

Final Dataset Preparation Script

# prepare_training_data.py
import json
import pandas as pd
from sklearn.model_selection import train_test_split
from datasets import Dataset
import os

def prepare_final_dataset(dataset_path: str, test_size: float = 0.1,
                         format_type: str = "alpaca"):
    """Prepare final dataset for training"""

    # Load dataset
    with open(dataset_path, 'r', encoding='utf-8') as f:
        data = json.load(f)

    print(f"📊 Dataset Statistics:")
    print(f"Total examples: {len(data)}")

    # Split into train/validation
    train_data, val_data = train_test_split(
        data,
        test_size=test_size,
        random_state=42,
        shuffle=True
    )

    print(f"Training examples: {len(train_data)}")
    print(f"Validation examples: {len(val_data)}")

    # Format for Unsloth training
    def format_example(example):
        if format_type == "alpaca":
            if example.get('input', '').strip():
                text = f"<|begin_of_text|><|start_header_id|>system<|end_header_id|>You are an expert SQL developer.<|eot_id|><|start_header_id|>user<|end_header_id|>{example['instruction']}\n\n{example['input']}<|eot_id|><|start_header_id|>assistant<|end_header_id|>{example['output']}<|eot_id|><|end_of_text|>"
            else:
                text = f"<|begin_of_text|><|start_header_id|>system<|end_header_id|>You are an expert SQL developer.<|eot_id|><|start_header_id|>user<|end_header_id|>{example['instruction']}<|eot_id|><|start_header_id|>assistant<|end_header_id|>{example['output']}<|eot_id|><|end_of_text|>"

            return {"text": text}

    # Format training data
    formatted_train = [format_example(ex) for ex in train_data]
    formatted_val = [format_example(ex) for ex in val_data]

    # Create Hugging Face datasets
    train_dataset = Dataset.from_list(formatted_train)
    val_dataset = Dataset.from_list(formatted_val)

    # Save processed datasets
    output_dir = "./data/processed"
    os.makedirs(output_dir, exist_ok=True)

    train_dataset.save_to_disk(f"{output_dir}/train_dataset")
    val_dataset.save_to_disk(f"{output_dir}/val_dataset")

    # Also save as JSON for backup
    with open(f"{output_dir}/train_data.json", 'w') as f:
        json.dump(formatted_train, f, indent=2)

    with open(f"{output_dir}/val_data.json", 'w') as f:
        json.dump(formatted_val, f, indent=2)

    print(f"✅ Datasets saved to {output_dir}")

    # Show example
    print(f"\n📝 Training Example Preview:")
    print(formatted_train[0]['text'][:300] + "...")

    return train_dataset, val_dataset

# Usage
if __name__ == "__main__":
    train_ds, val_ds = prepare_final_dataset(
        "./data/datasets/sql_dataset_alpaca.json",
        test_size=0.15,
        format_type="alpaca"
    )

📁 Reference Code Repository

All data preparation code and examples are available in the GitHub repository:

🔗 fine-tuning-small-llms/part2-data-preparation

# Clone the repository if you haven't already
git clone https://github.com/saptak/fine-tuning-small-llms.git
cd fine-tuning-small-llms

# Quick Start: Use HuggingFace dataset (recommended for desktop)
python part2-data-preparation/src/dataset_creation.py --source huggingface --num-examples 1000 --format alpaca

# Alternative: Create manual dataset  
python part2-data-preparation/src/dataset_creation.py --source manual --format alpaca

# Quick script for common sizes
chmod +x part2-data-preparation/scripts/quick_dataset.sh
./part2-data-preparation/scripts/quick_dataset.sh --size medium  # 1K examples

# Validate the dataset
python part2-data-preparation/src/data_validation.py --dataset ./data/datasets/sql_dataset_hf_alpaca.json

The Part 2 directory includes:

  • src/dataset_creation.py - Complete dataset creation toolkit with HuggingFace integration
  • src/data_validation.py - Quality validation framework
  • src/format_converter.py - Format conversion utilities
  • src/model_selection.py - Smart model recommendation system
  • scripts/quick_dataset.sh - One-command dataset generation script
  • examples/ - Sample datasets and templates
  • Documentation and usage guides

New Features Added:

  • HuggingFace Integration: Load popular public datasets directly
  • Desktop Optimization: Configurable dataset sizes for memory constraints
  • Quick Setup Script: Generate datasets with single command
  • Backward Compatibility: All original manual creation still works

What’s Next?

Excellent work! You now have a comprehensive understanding of data preparation and model selection. In our next installment, we’ll put this knowledge to work:

Part 3: Fine-Tuning with Unsloth

In Part 3, you’ll learn:

  • Setting up Unsloth for efficient training
  • Configuring LoRA adapters for parameter-efficient fine-tuning
  • Running the complete training pipeline
  • Monitoring training progress with Weights & Biases
  • Saving and managing model checkpoints

Key Takeaways from Part 2

  1. Quality > Quantity: 500 high-quality examples beat 5,000 mediocre ones
  2. Format Matters: Choose the right format for your training approach
  3. Model Selection: Match your base model to your use case and resources
  4. Validation: Always validate your data before training
  5. Preparation: Proper formatting saves debugging time later

Resources and Tools

Dataset Creation:

Model Information:


Continue to Part 3: Fine-Tuning with Unsloth to start training your model!

Saptak Sen

If you enjoyed this post, you should check out my book: Starting with Spark.

Share this post