Learning Reinforcement Learning Through Ad Bidding
This tutorial teaches reinforcement learning (RL) from first principles, using Promovolve’s bid optimization system as the running example. Every concept is grounded in real, production Scala code — no toy problems, no gym environments, no Python notebooks.
By the end, you’ll understand:
- What reinforcement learning is and when to use it
- How to model a real-world problem as an RL task (states, actions, rewards)
- How neural networks learn to approximate value functions
- How DQN and Double DQN work, and why they matter
- How experience replay and target networks stabilize training
- How to deploy RL in a production system that handles real money
Why ad bidding is a great RL problem
Most RL tutorials use video games or grid worlds. These are fine for building intuition, but they don’t teach you how RL behaves when:
- The environment is non-stationary — traffic patterns change by hour, day, and season
- Episodes have real economic consequences — overspending wastes advertiser money, underspending leaves revenue on the table
- Observations are noisy and delayed — you only see aggregated metrics every 15 minutes, not instant per-action feedback
- The state space is continuous — not a grid, but a blend of rates, fractions, and normalized signals
- Multiple agents interact — hundreds of campaigns compete simultaneously, each learning its own policy
Promovolve’s bid optimization agent faces all of these. It’s a ~400-line pure Scala implementation with no ML framework dependencies — you can read every line of the forward pass, backpropagation, and training loop.
Prerequisites
You should be comfortable with:
- Basic programming concepts (loops, arrays, functions)
- High school math (derivatives, basic probability)
- No prior ML/RL knowledge required — we build everything from scratch
Chapters
- The Problem: Why Bid Optimization Needs RL
- RL Fundamentals: Agent, Environment, Reward
- Building a Neural Network From Scratch
- From Q-Tables to Deep Q-Networks
- Experience Replay: Learning From the Past
- Double DQN: Fixing Overestimation
- Putting It Together: The BidOptimizationAgent
- Training in Production: Episodes, Persistence, and Day Resets