Building a Gambling Model: Part 1 - Data Lake Architecture

18th April 2025

In this series, I'll walk through how I built a statistical model to calculate fair gambling odds using large amounts of data. This first article focuses on the data lake architecture that forms the foundation of our system. For a detailed introduction to the mathematics behind profitable gambling, including concepts like expected value, overround, and the Kelly criterion, check out my previous article An Introduction to Probability for Gambling.

Introduction

When building a model to predict gambling odds, the quality and quantity of data are crucial. Our goal is to create a comprehensive dataset that combines multiple sources of information:

Historical Race Results: Detailed data about past races, including:
- Race outcomes and finishing positions
- Track conditions and race times
- Individual performance metrics
- Trainer statistics
Historical Betting Odds: Market data that includes:
- Opening and closing odds
- Price movements throughout the betting period
- Market liquidity and volume
- Starting prices and final dividends
Environmental Data: External factors that might influence race outcomes:
- Weather conditions (temperature, humidity, wind speed)
- Track surface conditions
- Time of day and seasonal variations
- Venue-specific characteristics

By combining these diverse data sources, we aim to identify patterns and correlations that can help predict fair odds for future races. The data lake architecture provides the perfect foundation for this kind of project, allowing us to store and process these different types of data efficiently.

Why Betting Markets?

While financial markets have been extensively studied and modeled, betting markets present unique challenges and opportunities. The most fundamental difference is that we're betting on actual outcomes rather than market sentiment:

Outcome-Based vs. Sentiment-Based Markets

In financial markets, success depends on predicting how other market participants will behave. The price of a stock, for example, is determined by:

Market sentiment and psychology
Institutional trading patterns
News and media influence
Economic indicators
The actions of other traders

In contrast, betting markets are outcome-based. When you place a bet:

The outcome is determined by physical performance
Market sentiment has a different outcome on the actual result
The odds reflect the probability of a real-world event
Success depends on understanding the underlying factors that influence performance

This fundamental difference means our modeling approach must focus on:

Physical and environmental factors that affect performance
Historical patterns in race outcomes
Individual and team statistics
Track and weather conditions

While market sentiment still affects the odds (and thus potential profits), it doesn't influence the actual outcome we're trying to predict (in the same way as the stock market).

Other Key Differences

Market Sophistication: Unlike financial markets dominated by sophisticated algorithmic traders and institutional investors, betting markets offer unique opportunities:
- Many participants make decisions based on emotion rather than data
- Limited competition from quantitative models
- Opportunities to exploit systematic biases in the market
- Less efficient price discovery mechanisms While this presents advantages, it also means we face:
- Limited historical data compared to financial markets
- Gaps in data during off-seasons
- Need to carefully validate our assumptions with smaller datasets
Market Inefficiencies: Betting markets often exhibit:
- Less sophisticated participants compared to financial markets
- Interesting psycological biases
- Regional variations in market efficiency
Data Quality Challenges:
- Inconsistent data collection across different venues
- Missing or incomplete historical records
- Changes in data collection methods over time
Unique Market Dynamics:
- Each event is unique and non-replicable
- Market sentiment can be heavily influenced by public opinion

These challenges make modeling betting markets particularly interesting from a machine learning perspective. We need to be especially careful about:

Handling sparse data effectively
Dealing with missing values and outliers
Accounting for market inefficiencies
Validating model assumptions with limited data

Why a Data Lake?

Traditional data warehouses, while powerful, can be expensive and rigid. They require predefined schemas and often struggle with unstructured data. A data lake, on the other hand, offers several advantages:

Flexibility: Store any type of data (structured, semi-structured, or unstructured) without predefined schemas
Cost-effective: Raw storage is cheaper than processed warehouse storage
Scalability: Easily handle growing data volumes
Multi-tier processing: Process data incrementally as needed

The Data Lake Architecture

Our data lake follows a multi-tier architecture:

Raw Tier: Initial landing zone for all data in its original format
- Historical race results from the racing API
- Weather data from meteorological services
- Betting market data from various sources
Structured Tier: Data organized into a more queryable format
- Normalized race results
- Standardized weather measurements
- Consolidated betting odds
Cleaned Tier: Data validated and cleaned for analysis
- Removed outliers and anomalies
- Handled missing values
- Standardized formats and units
ML Tier: Processed features ready for model training
- Combined features from all data sources
- Time-based aggregations
- Derived metrics and indicators

This tiered approach allows us to:

Maintain data lineage
Process data incrementally
Keep raw data for reprocessing if needed
Optimize storage costs
Enable different access patterns for different use cases

Implementation

Infrastructure as Code

First, let's set up our S3 bucket using Terraform. We keep the configuration simple since we want to retain all historical data for our model:

resource "aws_s3_bucket" "gambling_data_lake" {
  bucket = "gambling-model-data-lake"
}

Data Collection Lambda

We use a Lambda function to periodically scrape and the racing data. The data is organized into several key entities:

import json
import boto3
from datetime import datetime
import pandas as pd

def lambda_handler(event, context):
    s3 = boto3.client('s3')
    bucket_name = 'gambling-model-data-lake'
    
    # Fetch data from the racing API
    raw_data = fetch_racing_data()
    
    # Process the data into structured DataFrames
    processed_data = {
        'meetings': process_meetings(raw_data),
        'events': process_events(raw_data),
        'competitors': process_competitors(raw_data),
        'trainers': process_trainers(raw_data),
        'selections': process_selections(raw_data)
    }
    
    # Upload each DataFrame to S3 with partitioning
    date_str = datetime.now().strftime('%Y-%m-%d')
    year, month, day = date_str.split('-')
    
    for table_name, df in processed_data.items():
        # Convert DataFrame to JSON
        json_data = df.to_json(orient='records', date_format='iso')
        
        # Create partitioned S3 key
        s3_key = f"raw/racing/year={year}/month={month}/day={day}/{table_name}.json"
        
        # Upload to S3
        s3.put_object(
            Bucket=bucket_name,
            Key=s3_key,
            Body=json_data,
            ContentType='application/json'
        )
    
    return {
        'statusCode': 200,
        'body': json.dumps('Data collection completed successfully')
    }

def fetch_racing_data():
    # Implementation of racing data fetching
    # This would include the actual API calls and data retrieval
    pass

def process_meetings(data):
    # Process meeting data into a DataFrame
    pass

def process_events(data):
    # Process event data into a DataFrame
    pass

def process_competitors(data):
    # Process competitor data into a DataFrame
    pass

def process_trainers(data):
    # Process trainer data into a DataFrame
    pass

def process_selections(data):
    # Process selection data into a DataFrame
    pass

The Lambda function:

Fetches data from the racing API
Processes the raw data into structured DataFrames
Organizes the data by entity type (meetings, events, competitors, etc.)
Uploads each entity to S3 with year/month/day partitioning
Maintains data lineage through the partitioning structure

Financial Market Data Collection

In addition to race results and environmental data, we also collect historical betting market data from Betfair, the world's largest betting exchange. Betfair provides historical data downloads through their website, offering a comprehensive view of market movements and odds changes.

The data comes in Betfair PRO format, which is a comprehensive dataset containing detailed market information. Instead of processing these files locally, we've implemented a cloud-native approach using AWS services. This approach ensures our data never leaves AWS datacenters and scales automatically with our needs.

Our implementation uses a Lambda function that processes new data as it arrives:

def lambda_handler(event, context):
    """
    Lambda function triggered when new tar files are uploaded to the landing zone.
    Extracts the contents and organizes them in the data lake.
    """
    s3 = boto3.client('s3')
    bucket = event['Records'][0]['s3']['bucket']['name']
    key = event['Records'][0]['s3']['object']['key']
    
    # Only process files from the landing zone
    if not key.startswith('landing/betfair-pro/'):
        return
    
    with tempfile.TemporaryDirectory() as temp_dir:
        # Download the tar file from landing zone
        local_path = os.path.join(temp_dir, 'archive.tar')
        s3.download_file(bucket, key, local_path)
        
        # Extract the PRO data archive
        with tarfile.open(local_path, 'r:*') as tar:
            tar.extractall(path=temp_dir)
        
        # Process and upload the extracted .bz2 files
        pro_dir = os.path.join(temp_dir, 'PRO')
        if os.path.exists(pro_dir):
            for root, _, files in os.walk(pro_dir):
                for file in files:
                    if file.endswith('.bz2'):
                        # Maintain the original directory structure in S3
                        relative_path = os.path.relpath(root, pro_dir)
                        s3_key = f"raw/betfair-pro/{relative_path}/{file}"
                        
                        # Upload the .bz2 file directly to the raw tier
                        with open(os.path.join(root, file), 'rb') as f:
                            s3.upload_fileobj(f, bucket, s3_key)

This cloud-native architecture provides several key advantages over local processing:

Landing Zone Pattern:
- Dedicated S3 prefix (landing/betfair-pro/) for new tar files
- Lambda function triggered automatically on upload
- Clear separation between incoming data and processed data
- Easy monitoring of new data arrivals
Serverless Processing:
- No need for local compute resources
- Automatic scaling based on upload volume
- Pay only for actual processing time
- Built-in retry mechanisms and error handling
Data Locality:
- Data never leaves AWS datacenters
- Reduced network transfer costs
- Lower latency for processing
- Better security through reduced data movement

The resulting data lake structure is organized as follows:

s3://gambling-model-data-lake/
├── landing/
│   └── betfair-pro/           # New tar files arrive here
│       └── 2024-04-18.tar
└── raw/
    └── betfair-pro/           # Extracted .bz2 files
        ├── 2024/
        │   ├── 04/
        │   │   └── 18/
        │   │       └── market_data.bz2
        │   └── ...
        └── ...

This structure supports our market data processing needs in several ways:

Exchange-Based Data: Unlike traditional bookmakers, Betfair operates as an exchange where:
- Users can both back (bet for) and lay (bet against) outcomes
- Odds are determined by market participants
- We can see the actual volume of money at each price point
- The data reflects true market sentiment rather than bookmaker margins
Raw Data Structure:
- Compressed .bz2 files containing market movements
- Each file represents a specific betting market
- Data includes odds changes, market liquidity, and betting volumes
- Files are organized by date in the data lake
Batch Processing:
- Historical data is uploaded in bulk using the script above
- Files are partitioned by year/month/day for efficient querying
- Parallel uploads optimize the ingestion process
- Consistent naming and organization enable easy tracking
Future Real-time Collection:
- In a later article, we'll implement a Lambda architecture
- This will enable real-time market data collection
- Combining batch and real-time processing for comprehensive coverage
- Ensuring we capture both historical patterns and current market dynamics

The raw market data will be processed through our data lake tiers:

Raw Tier: Original .bz2 files with market movements
Structured Tier: Extracted and normalized market data
Cleaned Tier: Validated and processed market information
ML Tier: Features derived from market movements and patterns

This market data, combined with race results and environmental factors, will help us:

Identify market inefficiencies
Track odds movements
Analyze betting patterns
Detect potential arbitrage opportunities

Next Steps

In the next article, we'll explore how to process this raw data into our structured and cleaned tiers, preparing it for model training. We'll cover:

Data quality checks and cleaning procedures
Feature engineering techniques for combining different data sources
Time-series processing for historical data analysis
Creating training datasets for the machine learning model

Stay tuned for the next part of this series, where we'll dive into data processing and feature engineering for our gambling model.