Building a Gambling Model: Part 1 - Data Lake Architecture

18th April 2025

In this series, I'll walk through how I built a statistical model to calculate fair gambling odds using large amounts of data. This first article focuses on the data lake architecture that forms the foundation of our system. For a detailed introduction to the mathematics behind profitable gambling, including concepts like expected value, overround, and the Kelly criterion, check out my previous article An Introduction to Probability for Gambling.

Introduction

When building a model to predict gambling odds, the quality and quantity of data are crucial. Our goal is to create a comprehensive dataset that combines multiple sources of information:

  1. Historical Race Results: Detailed data about past races, including:

    • Race outcomes and finishing positions
    • Track conditions and race times
    • Individual performance metrics
    • Trainer statistics
  2. Historical Betting Odds: Market data that includes:

    • Opening and closing odds
    • Price movements throughout the betting period
    • Market liquidity and volume
    • Starting prices and final dividends
  3. Environmental Data: External factors that might influence race outcomes:

    • Weather conditions (temperature, humidity, wind speed)
    • Track surface conditions
    • Time of day and seasonal variations
    • Venue-specific characteristics

By combining these diverse data sources, we aim to identify patterns and correlations that can help predict fair odds for future races. The data lake architecture provides the perfect foundation for this kind of project, allowing us to store and process these different types of data efficiently.

Why Betting Markets?

While financial markets have been extensively studied and modeled, betting markets present unique challenges and opportunities. The most fundamental difference is that we're betting on actual outcomes rather than market sentiment:

Outcome-Based vs. Sentiment-Based Markets

In financial markets, success depends on predicting how other market participants will behave. The price of a stock, for example, is determined by:

In contrast, betting markets are outcome-based. When you place a bet:

This fundamental difference means our modeling approach must focus on:

  1. Physical and environmental factors that affect performance
  2. Historical patterns in race outcomes
  3. Individual and team statistics
  4. Track and weather conditions

While market sentiment still affects the odds (and thus potential profits), it doesn't influence the actual outcome we're trying to predict (in the same way as the stock market).

Other Key Differences

  1. Market Sophistication: Unlike financial markets dominated by sophisticated algorithmic traders and institutional investors, betting markets offer unique opportunities:

    • Many participants make decisions based on emotion rather than data
    • Limited competition from quantitative models
    • Opportunities to exploit systematic biases in the market
    • Less efficient price discovery mechanisms While this presents advantages, it also means we face:
    • Limited historical data compared to financial markets
    • Gaps in data during off-seasons
    • Need to carefully validate our assumptions with smaller datasets
  2. Market Inefficiencies: Betting markets often exhibit:

    • Less sophisticated participants compared to financial markets
    • Interesting psycological biases
    • Regional variations in market efficiency
  3. Data Quality Challenges:

    • Inconsistent data collection across different venues
    • Missing or incomplete historical records
    • Changes in data collection methods over time
  4. Unique Market Dynamics:

    • Each event is unique and non-replicable
    • Market sentiment can be heavily influenced by public opinion

These challenges make modeling betting markets particularly interesting from a machine learning perspective. We need to be especially careful about:

Why a Data Lake?

Traditional data warehouses, while powerful, can be expensive and rigid. They require predefined schemas and often struggle with unstructured data. A data lake, on the other hand, offers several advantages:

The Data Lake Architecture

Our data lake follows a multi-tier architecture:

  1. Raw Tier: Initial landing zone for all data in its original format

    • Historical race results from the racing API
    • Weather data from meteorological services
    • Betting market data from various sources
  2. Structured Tier: Data organized into a more queryable format

    • Normalized race results
    • Standardized weather measurements
    • Consolidated betting odds
  3. Cleaned Tier: Data validated and cleaned for analysis

    • Removed outliers and anomalies
    • Handled missing values
    • Standardized formats and units
  4. ML Tier: Processed features ready for model training

    • Combined features from all data sources
    • Time-based aggregations
    • Derived metrics and indicators

This tiered approach allows us to:

Implementation

Infrastructure as Code

First, let's set up our S3 bucket using Terraform. We keep the configuration simple since we want to retain all historical data for our model:

resource "aws_s3_bucket" "gambling_data_lake" {
  bucket = "gambling-model-data-lake"
}

Data Collection Lambda

We use a Lambda function to periodically scrape and the racing data. The data is organized into several key entities:

import json
import boto3
from datetime import datetime
import pandas as pd

def lambda_handler(event, context):
    s3 = boto3.client('s3')
    bucket_name = 'gambling-model-data-lake'
    
    # Fetch data from the racing API
    raw_data = fetch_racing_data()
    
    # Process the data into structured DataFrames
    processed_data = {
        'meetings': process_meetings(raw_data),
        'events': process_events(raw_data),
        'competitors': process_competitors(raw_data),
        'trainers': process_trainers(raw_data),
        'selections': process_selections(raw_data)
    }
    
    # Upload each DataFrame to S3 with partitioning
    date_str = datetime.now().strftime('%Y-%m-%d')
    year, month, day = date_str.split('-')
    
    for table_name, df in processed_data.items():
        # Convert DataFrame to JSON
        json_data = df.to_json(orient='records', date_format='iso')
        
        # Create partitioned S3 key
        s3_key = f"raw/racing/year={year}/month={month}/day={day}/{table_name}.json"
        
        # Upload to S3
        s3.put_object(
            Bucket=bucket_name,
            Key=s3_key,
            Body=json_data,
            ContentType='application/json'
        )
    
    return {
        'statusCode': 200,
        'body': json.dumps('Data collection completed successfully')
    }

def fetch_racing_data():
    # Implementation of racing data fetching
    # This would include the actual API calls and data retrieval
    pass

def process_meetings(data):
    # Process meeting data into a DataFrame
    pass

def process_events(data):
    # Process event data into a DataFrame
    pass

def process_competitors(data):
    # Process competitor data into a DataFrame
    pass

def process_trainers(data):
    # Process trainer data into a DataFrame
    pass

def process_selections(data):
    # Process selection data into a DataFrame
    pass

The Lambda function:

  1. Fetches data from the racing API
  2. Processes the raw data into structured DataFrames
  3. Organizes the data by entity type (meetings, events, competitors, etc.)
  4. Uploads each entity to S3 with year/month/day partitioning
  5. Maintains data lineage through the partitioning structure

Financial Market Data Collection

In addition to race results and environmental data, we also collect historical betting market data from Betfair, the world's largest betting exchange. Betfair provides historical data downloads through their website, offering a comprehensive view of market movements and odds changes.

The data comes in Betfair PRO format, which is a comprehensive dataset containing detailed market information. Instead of processing these files locally, we've implemented a cloud-native approach using AWS services. This approach ensures our data never leaves AWS datacenters and scales automatically with our needs.

Our implementation uses a Lambda function that processes new data as it arrives:

def lambda_handler(event, context):
    """
    Lambda function triggered when new tar files are uploaded to the landing zone.
    Extracts the contents and organizes them in the data lake.
    """
    s3 = boto3.client('s3')
    bucket = event['Records'][0]['s3']['bucket']['name']
    key = event['Records'][0]['s3']['object']['key']
    
    # Only process files from the landing zone
    if not key.startswith('landing/betfair-pro/'):
        return
    
    with tempfile.TemporaryDirectory() as temp_dir:
        # Download the tar file from landing zone
        local_path = os.path.join(temp_dir, 'archive.tar')
        s3.download_file(bucket, key, local_path)
        
        # Extract the PRO data archive
        with tarfile.open(local_path, 'r:*') as tar:
            tar.extractall(path=temp_dir)
        
        # Process and upload the extracted .bz2 files
        pro_dir = os.path.join(temp_dir, 'PRO')
        if os.path.exists(pro_dir):
            for root, _, files in os.walk(pro_dir):
                for file in files:
                    if file.endswith('.bz2'):
                        # Maintain the original directory structure in S3
                        relative_path = os.path.relpath(root, pro_dir)
                        s3_key = f"raw/betfair-pro/{relative_path}/{file}"
                        
                        # Upload the .bz2 file directly to the raw tier
                        with open(os.path.join(root, file), 'rb') as f:
                            s3.upload_fileobj(f, bucket, s3_key)

This cloud-native architecture provides several key advantages over local processing:

  1. Landing Zone Pattern:

    • Dedicated S3 prefix (landing/betfair-pro/) for new tar files
    • Lambda function triggered automatically on upload
    • Clear separation between incoming data and processed data
    • Easy monitoring of new data arrivals
  2. Serverless Processing:

    • No need for local compute resources
    • Automatic scaling based on upload volume
    • Pay only for actual processing time
    • Built-in retry mechanisms and error handling
  3. Data Locality:

    • Data never leaves AWS datacenters
    • Reduced network transfer costs
    • Lower latency for processing
    • Better security through reduced data movement

The resulting data lake structure is organized as follows:

s3://gambling-model-data-lake/
├── landing/
│   └── betfair-pro/           # New tar files arrive here
│       └── 2024-04-18.tar
└── raw/
    └── betfair-pro/           # Extracted .bz2 files
        ├── 2024/
        │   ├── 04/
        │   │   └── 18/
        │   │       └── market_data.bz2
        │   └── ...
        └── ...

This structure supports our market data processing needs in several ways:

  1. Exchange-Based Data: Unlike traditional bookmakers, Betfair operates as an exchange where:

    • Users can both back (bet for) and lay (bet against) outcomes
    • Odds are determined by market participants
    • We can see the actual volume of money at each price point
    • The data reflects true market sentiment rather than bookmaker margins
  2. Raw Data Structure:

    • Compressed .bz2 files containing market movements
    • Each file represents a specific betting market
    • Data includes odds changes, market liquidity, and betting volumes
    • Files are organized by date in the data lake
  3. Batch Processing:

    • Historical data is uploaded in bulk using the script above
    • Files are partitioned by year/month/day for efficient querying
    • Parallel uploads optimize the ingestion process
    • Consistent naming and organization enable easy tracking
  4. Future Real-time Collection:

    • In a later article, we'll implement a Lambda architecture
    • This will enable real-time market data collection
    • Combining batch and real-time processing for comprehensive coverage
    • Ensuring we capture both historical patterns and current market dynamics

The raw market data will be processed through our data lake tiers:

  1. Raw Tier: Original .bz2 files with market movements
  2. Structured Tier: Extracted and normalized market data
  3. Cleaned Tier: Validated and processed market information
  4. ML Tier: Features derived from market movements and patterns

This market data, combined with race results and environmental factors, will help us:

Next Steps

In the next article, we'll explore how to process this raw data into our structured and cleaned tiers, preparing it for model training. We'll cover:

  1. Data quality checks and cleaning procedures
  2. Feature engineering techniques for combining different data sources
  3. Time-series processing for historical data analysis
  4. Creating training datasets for the machine learning model

Stay tuned for the next part of this series, where we'll dive into data processing and feature engineering for our gambling model.