Building a Gambling Model: Part 1 - Data Lake Architecture
18th April 2025
In this series, I'll walk through how I built a statistical model to calculate fair gambling odds using large amounts of data. This first article focuses on the data lake architecture that forms the foundation of our system. For a detailed introduction to the mathematics behind profitable gambling, including concepts like expected value, overround, and the Kelly criterion, check out my previous article An Introduction to Probability for Gambling.
Introduction
When building a model to predict gambling odds, the quality and quantity of data are crucial. Our goal is to create a comprehensive dataset that combines multiple sources of information:
-
Historical Race Results: Detailed data about past races, including:
- Race outcomes and finishing positions
- Track conditions and race times
- Individual performance metrics
- Trainer statistics
-
Historical Betting Odds: Market data that includes:
- Opening and closing odds
- Price movements throughout the betting period
- Market liquidity and volume
- Starting prices and final dividends
-
Environmental Data: External factors that might influence race outcomes:
- Weather conditions (temperature, humidity, wind speed)
- Track surface conditions
- Time of day and seasonal variations
- Venue-specific characteristics
By combining these diverse data sources, we aim to identify patterns and correlations that can help predict fair odds for future races. The data lake architecture provides the perfect foundation for this kind of project, allowing us to store and process these different types of data efficiently.
Why Betting Markets?
While financial markets have been extensively studied and modeled, betting markets present unique challenges and opportunities. The most fundamental difference is that we're betting on actual outcomes rather than market sentiment:
Outcome-Based vs. Sentiment-Based Markets
In financial markets, success depends on predicting how other market participants will behave. The price of a stock, for example, is determined by:
- Market sentiment and psychology
- Institutional trading patterns
- News and media influence
- Economic indicators
- The actions of other traders
In contrast, betting markets are outcome-based. When you place a bet:
- The outcome is determined by physical performance
- Market sentiment has a different outcome on the actual result
- The odds reflect the probability of a real-world event
- Success depends on understanding the underlying factors that influence performance
This fundamental difference means our modeling approach must focus on:
- Physical and environmental factors that affect performance
- Historical patterns in race outcomes
- Individual and team statistics
- Track and weather conditions
While market sentiment still affects the odds (and thus potential profits), it doesn't influence the actual outcome we're trying to predict (in the same way as the stock market).
Other Key Differences
-
Market Sophistication: Unlike financial markets dominated by sophisticated algorithmic traders and institutional investors, betting markets offer unique opportunities:
- Many participants make decisions based on emotion rather than data
- Limited competition from quantitative models
- Opportunities to exploit systematic biases in the market
- Less efficient price discovery mechanisms While this presents advantages, it also means we face:
- Limited historical data compared to financial markets
- Gaps in data during off-seasons
- Need to carefully validate our assumptions with smaller datasets
-
Market Inefficiencies: Betting markets often exhibit:
- Less sophisticated participants compared to financial markets
- Interesting psycological biases
- Regional variations in market efficiency
-
Data Quality Challenges:
- Inconsistent data collection across different venues
- Missing or incomplete historical records
- Changes in data collection methods over time
-
Unique Market Dynamics:
- Each event is unique and non-replicable
- Market sentiment can be heavily influenced by public opinion
These challenges make modeling betting markets particularly interesting from a machine learning perspective. We need to be especially careful about:
- Handling sparse data effectively
- Dealing with missing values and outliers
- Accounting for market inefficiencies
- Validating model assumptions with limited data
Why a Data Lake?
Traditional data warehouses, while powerful, can be expensive and rigid. They require predefined schemas and often struggle with unstructured data. A data lake, on the other hand, offers several advantages:
- Flexibility: Store any type of data (structured, semi-structured, or unstructured) without predefined schemas
- Cost-effective: Raw storage is cheaper than processed warehouse storage
- Scalability: Easily handle growing data volumes
- Multi-tier processing: Process data incrementally as needed
The Data Lake Architecture
Our data lake follows a multi-tier architecture:
-
Raw Tier: Initial landing zone for all data in its original format
- Historical race results from the racing API
- Weather data from meteorological services
- Betting market data from various sources
-
Structured Tier: Data organized into a more queryable format
- Normalized race results
- Standardized weather measurements
- Consolidated betting odds
-
Cleaned Tier: Data validated and cleaned for analysis
- Removed outliers and anomalies
- Handled missing values
- Standardized formats and units
-
ML Tier: Processed features ready for model training
- Combined features from all data sources
- Time-based aggregations
- Derived metrics and indicators
This tiered approach allows us to:
- Maintain data lineage
- Process data incrementally
- Keep raw data for reprocessing if needed
- Optimize storage costs
- Enable different access patterns for different use cases
Implementation
Infrastructure as Code
First, let's set up our S3 bucket using Terraform. We keep the configuration simple since we want to retain all historical data for our model:
resource "aws_s3_bucket" "gambling_data_lake" {
bucket = "gambling-model-data-lake"
}
Data Collection Lambda
We use a Lambda function to periodically scrape and the racing data. The data is organized into several key entities:
import json
import boto3
from datetime import datetime
import pandas as pd
def lambda_handler(event, context):
s3 = boto3.client('s3')
bucket_name = 'gambling-model-data-lake'
# Fetch data from the racing API
raw_data = fetch_racing_data()
# Process the data into structured DataFrames
processed_data = {
'meetings': process_meetings(raw_data),
'events': process_events(raw_data),
'competitors': process_competitors(raw_data),
'trainers': process_trainers(raw_data),
'selections': process_selections(raw_data)
}
# Upload each DataFrame to S3 with partitioning
date_str = datetime.now().strftime('%Y-%m-%d')
year, month, day = date_str.split('-')
for table_name, df in processed_data.items():
# Convert DataFrame to JSON
json_data = df.to_json(orient='records', date_format='iso')
# Create partitioned S3 key
s3_key = f"raw/racing/year={year}/month={month}/day={day}/{table_name}.json"
# Upload to S3
s3.put_object(
Bucket=bucket_name,
Key=s3_key,
Body=json_data,
ContentType='application/json'
)
return {
'statusCode': 200,
'body': json.dumps('Data collection completed successfully')
}
def fetch_racing_data():
# Implementation of racing data fetching
# This would include the actual API calls and data retrieval
pass
def process_meetings(data):
# Process meeting data into a DataFrame
pass
def process_events(data):
# Process event data into a DataFrame
pass
def process_competitors(data):
# Process competitor data into a DataFrame
pass
def process_trainers(data):
# Process trainer data into a DataFrame
pass
def process_selections(data):
# Process selection data into a DataFrame
pass
The Lambda function:
- Fetches data from the racing API
- Processes the raw data into structured DataFrames
- Organizes the data by entity type (meetings, events, competitors, etc.)
- Uploads each entity to S3 with year/month/day partitioning
- Maintains data lineage through the partitioning structure
Financial Market Data Collection
In addition to race results and environmental data, we also collect historical betting market data from Betfair, the world's largest betting exchange. Betfair provides historical data downloads through their website, offering a comprehensive view of market movements and odds changes.
The data comes in Betfair PRO format, which is a comprehensive dataset containing detailed market information. Instead of processing these files locally, we've implemented a cloud-native approach using AWS services. This approach ensures our data never leaves AWS datacenters and scales automatically with our needs.
Our implementation uses a Lambda function that processes new data as it arrives:
def lambda_handler(event, context):
"""
Lambda function triggered when new tar files are uploaded to the landing zone.
Extracts the contents and organizes them in the data lake.
"""
s3 = boto3.client('s3')
bucket = event['Records'][0]['s3']['bucket']['name']
key = event['Records'][0]['s3']['object']['key']
# Only process files from the landing zone
if not key.startswith('landing/betfair-pro/'):
return
with tempfile.TemporaryDirectory() as temp_dir:
# Download the tar file from landing zone
local_path = os.path.join(temp_dir, 'archive.tar')
s3.download_file(bucket, key, local_path)
# Extract the PRO data archive
with tarfile.open(local_path, 'r:*') as tar:
tar.extractall(path=temp_dir)
# Process and upload the extracted .bz2 files
pro_dir = os.path.join(temp_dir, 'PRO')
if os.path.exists(pro_dir):
for root, _, files in os.walk(pro_dir):
for file in files:
if file.endswith('.bz2'):
# Maintain the original directory structure in S3
relative_path = os.path.relpath(root, pro_dir)
s3_key = f"raw/betfair-pro/{relative_path}/{file}"
# Upload the .bz2 file directly to the raw tier
with open(os.path.join(root, file), 'rb') as f:
s3.upload_fileobj(f, bucket, s3_key)
This cloud-native architecture provides several key advantages over local processing:
-
Landing Zone Pattern:
- Dedicated S3 prefix (
landing/betfair-pro/
) for new tar files - Lambda function triggered automatically on upload
- Clear separation between incoming data and processed data
- Easy monitoring of new data arrivals
- Dedicated S3 prefix (
-
Serverless Processing:
- No need for local compute resources
- Automatic scaling based on upload volume
- Pay only for actual processing time
- Built-in retry mechanisms and error handling
-
Data Locality:
- Data never leaves AWS datacenters
- Reduced network transfer costs
- Lower latency for processing
- Better security through reduced data movement
The resulting data lake structure is organized as follows:
s3://gambling-model-data-lake/
├── landing/
│ └── betfair-pro/ # New tar files arrive here
│ └── 2024-04-18.tar
└── raw/
└── betfair-pro/ # Extracted .bz2 files
├── 2024/
│ ├── 04/
│ │ └── 18/
│ │ └── market_data.bz2
│ └── ...
└── ...
This structure supports our market data processing needs in several ways:
-
Exchange-Based Data: Unlike traditional bookmakers, Betfair operates as an exchange where:
- Users can both back (bet for) and lay (bet against) outcomes
- Odds are determined by market participants
- We can see the actual volume of money at each price point
- The data reflects true market sentiment rather than bookmaker margins
-
Raw Data Structure:
- Compressed
.bz2
files containing market movements - Each file represents a specific betting market
- Data includes odds changes, market liquidity, and betting volumes
- Files are organized by date in the data lake
- Compressed
-
Batch Processing:
- Historical data is uploaded in bulk using the script above
- Files are partitioned by year/month/day for efficient querying
- Parallel uploads optimize the ingestion process
- Consistent naming and organization enable easy tracking
-
Future Real-time Collection:
- In a later article, we'll implement a Lambda architecture
- This will enable real-time market data collection
- Combining batch and real-time processing for comprehensive coverage
- Ensuring we capture both historical patterns and current market dynamics
The raw market data will be processed through our data lake tiers:
- Raw Tier: Original
.bz2
files with market movements - Structured Tier: Extracted and normalized market data
- Cleaned Tier: Validated and processed market information
- ML Tier: Features derived from market movements and patterns
This market data, combined with race results and environmental factors, will help us:
- Identify market inefficiencies
- Track odds movements
- Analyze betting patterns
- Detect potential arbitrage opportunities
Next Steps
In the next article, we'll explore how to process this raw data into our structured and cleaned tiers, preparing it for model training. We'll cover:
- Data quality checks and cleaning procedures
- Feature engineering techniques for combining different data sources
- Time-series processing for historical data analysis
- Creating training datasets for the machine learning model
Stay tuned for the next part of this series, where we'll dive into data processing and feature engineering for our gambling model.