Why Promovolve?
The Problem: Programmatic Advertising Wasn’t Built for Publishers
The modern programmatic advertising stack — SSPs, DSPs, exchanges, DMPs — was designed to solve a problem for large advertisers: how to reach the right user across millions of websites in real time. It succeeded spectacularly at that. But in doing so, it created a system that works against the interests of most publishers.
Publishers lost control of their own inventory
When a user visits a publisher’s site, the ad decision happens somewhere else entirely. The publisher’s SSP fires a bid request to an exchange, which broadcasts it to dozens of DSPs, which consult their user profiles, run their bidding algorithms, and return a response — all within 100 milliseconds. The publisher never sees the bids. They never choose which ad runs. They receive a creative URL and a clearing price, and they serve it.
This architecture optimizes for advertiser reach, not publisher value. The publisher’s content — the reason the user is there — is reduced to a signal in someone else’s targeting model. A carefully researched article about travel in Kyoto and a clickbait slideshow about celebrity gossip are, to the exchange, just two different URLs carrying a user with a cookie.
Ads became something people hate
Nobody hates ads in a travel magazine. But people install ad blockers, pay for premium subscriptions to avoid ads, and describe web advertising as the worst part of the internet. What changed?
The difference isn’t the existence of ads — it’s what they became. A magazine ad for hiking boots next to a trail guide feels natural. A web ad for the shoes you looked at yesterday, following you to an unrelated news article, feels invasive. The first is a recommendation in context. The second is surveillance.
Traditional ad tech targets users, not content. A user who visited a car dealership website yesterday gets retargeted with car ads on every site they visit today — a cooking blog, a news site, a forum. The publisher’s content is irrelevant. The ad is chasing the user. The result is an experience that feels wrong to everyone involved: readers feel stalked, publishers lose control of what appears on their pages, and advertisers pay to annoy people in the wrong context.
This isn’t a privacy problem to be solved with consent banners and cookie policies. It’s a design problem. Web advertising chose to target people instead of content, and in doing so, it turned ads from something readers could tolerate — even appreciate — into something they actively resist.
Small and medium publishers are left behind
The programmatic stack has enormous fixed costs. Integrating with an SSP requires technical sophistication. Meeting exchange quality thresholds requires minimum traffic volumes. The revenue share stacks up: the exchange takes a cut, the SSP takes a cut, the DSP takes a cut, the DMP takes a cut. Each intermediary skims from the transaction, and the publisher receives whatever is left.
For niche publishers — a Japanese travel blog, a specialized cooking site, a local news outlet — the economics don’t work. Their traffic is too small for exchanges to care about. Their content is too specific for broad behavioral targeting. Their readers are too privacy-conscious for invasive tracking. These publishers either accept pennies from bottom-tier ad networks, plaster their sites with low-quality ads, or give up on monetization entirely.
The latency tax
Real-time bidding imposes a latency floor on every page load. The browser must wait for the ad request, the exchange round-trip, and the creative download before the ad slot renders. On a fast site, the ads are the slowest thing on the page. Header bidding — the industry’s attempt to improve publisher yield — made this worse by running multiple auctions sequentially or in parallel, each adding network round-trips.
Users notice. Ad-blocker adoption correlates directly with page load degradation caused by ad tech. Publishers who care about user experience are penalized for participating in the system that’s supposed to monetize that experience.
The Opportunity: What Magazines Got Right
Pick up any well-produced magazine — a travel magazine, a cooking publication, an architecture journal. Look at the ads. They’re relevant. A travel magazine shows ads for airlines, luggage, and hotels. A cooking magazine shows kitchen equipment, specialty ingredients, and culinary schools. The ads feel like they belong there. They complement the content instead of interrupting it.
And people actually looked at them. When ads weren’t cluttered, when they weren’t fighting for attention with pop-ups and auto-playing videos, when there were just a few well-placed ads that matched what you were reading — you noticed them. Sometimes you tore out a page and kept it. The ads had value because they were relevant, restrained, and respectful of the reader’s attention.
Nobody found these ads creepy. Nobody wondered how the magazine knew they were interested in travel. The answer was obvious: you bought a travel magazine. The content was the signal.
This model worked for over a century. Advertisers paid a premium for placement in publications whose readers matched their audience. Publishers curated which ads appeared, maintaining quality and relevance. Readers accepted ads as part of the experience — sometimes even valued them — because the ads were for things they actually cared about, presented in a way that respected the editorial context.
Then the internet threw it all away.
Instead of matching ads to content, the web advertising industry decided to match ads to users. Instead of asking “what is this person reading?” it asks “who is this person and what have they done?” The content of the page became irrelevant. The reader became a target. And the entire experience — for publishers, readers, and even most advertisers — got worse.
Promovolve is an attempt to get back what magazines had. Relevant ads, matched to content, with no tracking, no surveillance, and no degradation of the reading experience.
Why now?
Content-based advertising isn’t a new idea, but doing it automatically at web scale was impractical until recently. Classifying page content into advertising categories required either manual curation (expensive) or primitive keyword matching (inaccurate).
LLMs changed this. A single API call to Gemini Flash costs a fraction of a cent and returns IAB Content Taxonomy categories with confidence scores — accurately classifying a page about hiking in the Japanese Alps into Travel, Outdoor Recreation, and Asia destinations. This wasn’t economically viable five years ago. It is now.
What Promovolve Does Differently
Promovolve rebuilds ad serving from the publisher’s perspective, using content as the sole targeting signal.
Auctions happen before users arrive
Traditional systems run an auction on every page load because they need to evaluate the user in real time. Promovolve doesn’t need to — the content doesn’t change between page loads. So the auction runs when content is published or updated (via scheduled crawl), and the results are cached in a replicated in-memory index. When a user arrives, the ad is already chosen. Serve latency drops from 50-200ms to under 1ms.
Multiple candidates, not a single winner
Instead of picking one winner per auction, Promovolve shortlists multiple candidates per ad slot and caches them all. At serve time, Thompson Sampling selects among them, balancing exploration (trying new creatives to learn their click-through rate) against exploitation (serving the creative that performs best). The system continuously learns which ads work best on which content, without A/B test infrastructure or manual optimization.
Each campaign learns to bid
A per-campaign reinforcement learning agent (Double DQN) observes its own performance metrics every 15 minutes — click-through rate, win rate, budget utilization, pacing — and adjusts its bid multiplier. Over days, each campaign learns its own bidding strategy: aggressive campaigns that convert well bid higher; campaigns burning budget too fast learn to pull back. No manual bid management required.
Budget pacing adapts to reality
A PI controller with self-tuning gains, traffic shape learning, and oscillation detection smooths budget delivery across the day. It learns that traffic peaks at 10am and dips at 3pm, that weekends have a different shape than weekdays, and adjusts automatically. Publishers see steady ad delivery instead of budgets that exhaust by noon and leave empty slots all afternoon.
No user tracking — because it’s not needed
Promovolve stores no user profiles, sets no tracking cookies, and collects no cross-site identifiers. Targeting is based entirely on the content of the page being viewed. This isn’t a sacrifice for privacy compliance — it’s a consequence of the design. When you match ads to content, there’s nothing about the user you need to know. The content tells you everything.
Who Promovolve Is For
Publishers
Promovolve is for publishers who:
- Own their relationship with readers and won’t compromise it with invasive tracking
- Create quality content in specific verticals where content-based targeting is naturally strong
- Want sub-millisecond ad serving that doesn’t degrade their site performance
- Prefer simplicity over the operational complexity of SSP/exchange integrations
- Need fair economics without the intermediary tax of the programmatic supply chain
Advertisers — from local businesses to global brands
The traditional programmatic stack has a minimum viable scale. Setting up DSP campaigns, managing bid strategies, and meeting exchange minimums requires budgets and expertise that exclude most businesses. The vast majority of businesses in the world — the local restaurant, the neighborhood bookshop, the regional tour operator, the community event organizer — cannot participate.
Promovolve lowers the bar to zero. An advertiser is anyone with an image and a landing page. That could be:
- A local hiking gear shop placing an ad on a regional outdoor recreation blog
- A community festival announcing dates on a local news site
- A cooking class promoting on a food blog in their city
- A small hotel in Kyoto reaching readers of a travel article about their neighborhood
- A global brand running a campaign across a network of niche publishers
There’s no DSP to integrate with. No bid strategy to configure manually — the RL agent learns the right price. No user profiles to buy. Just: “here’s my ad, here’s my budget, here’s what my product is.” The advertiser picks an ad product category (e.g., “Travel” or “Kitchen Equipment”), and the system automatically derives which content categories match — using the official IAB mapping between Ad Product Taxonomy 2.0 and Content Taxonomy 2.1. The system handles the rest.
This is how magazine advertising worked. A local restaurant could buy a quarter-page in a neighborhood magazine. The scale matched the business. Promovolve brings that accessibility to the web.
Advertising Agencies
Advertising agencies don’t own publisher platforms — they manage campaigns on behalf of their clients and place ads across publishers’ sites. In the traditional programmatic world, this means agencies must navigate a maze of DSP contracts, exchange seat IDs, and platform-specific bid management tools, each taking a cut along the way.
With Promovolve, agencies can own the ad-serving infrastructure itself. An agency can run its own Promovolve instance, build a network of publisher relationships, and manage all of their clients’ campaigns through a system they control end-to-end. No DSP middleman. No exchange fees. No dependency on someone else’s platform. The agency becomes the platform.
The RL agent handles bid optimization per campaign, so agencies spend their time on strategy and creative rather than manual bid tuning. It’s closer to the magazine ad sales model agencies grew up with — pick the publications, choose the placements, manage client budgets — except now the agency owns the technology that makes it all work.
How This Book Is Organized
The rest of this book documents every algorithmic detail, derived from the source code:
- Architecture — Pekko cluster topology, entity hierarchy, and data flow
- Auction — The five phases of the periodic batch auction
- Serving — Thompson Sampling, cold start, and fair selection at serve time
- Pacing — PI control, self-tuning, traffic shape learning
- RL — The DQN agent’s state space, reward function, and training loop
- Distributed State — ServeIndex replication and consistency
- Comparison — Point-by-point mapping against traditional SSP/DSP/exchange patterns
Each chapter is self-contained. All formulas, thresholds, and constants come directly from the Scala source.
Chapter 1: A Publisher Joins
Yuki runs a travel blog from Kyoto. She writes about temples, hiking trails, seasonal festivals, and hidden restaurants. Her readers are people planning trips to Kansai — engaged, curious, spending real time on each article. She publishes two or three articles a week, each one carefully researched.
She wants to monetize her site without ruining it. No pop-ups. No auto-playing video ads. No “you looked at sneakers yesterday” retargeting that has nothing to do with her content. She wants ads that feel like they belong — a ryokan in Arashiyama, a hiking gear shop, a regional train pass.
She signs up with Promovolve and adds a small JavaScript snippet to her site. It creates two ad slots: a 300×250 rectangle in the sidebar and a 728×90 banner between article sections.
That’s all she does on day one. The system takes over from here.
The Crawler Wakes Up
At 2am, Promovolve’s crawler visits Yuki’s site. It’s a Playwright-based headless browser — it renders JavaScript, scrolls the page, and extracts the visible text content. It also detects the ad slots Yuki placed and records their sizes and positions.
The crawler follows links from the homepage to recent articles, up to a configurable depth (default: 2 levels). For Yuki’s blog, that means the homepage, the article listing page, and each individual article.
For each page, the crawler captures the visible text — the article body, headings, captions. This raw text is what the system will use to understand what the page is about.
The LLM Classifies the Content
The crawler’s text goes to an LLM — by default, Google’s Gemini Flash, chosen for cost (a fraction of a cent per call). The prompt asks the model to classify the content into IAB Content Taxonomy 2.1 categories, which is the ad industry’s standard vocabulary for describing what a page is about.
For Yuki’s article “Autumn Foliage Hikes in Eastern Kyoto,” the LLM returns:
{
"selected_taxonomy_ids": [
{"id": "596", "confidence": 0.95},
{"id": "564", "confidence": 0.85},
{"id": "483", "confidence": 0.70}
]
}
Translated: Travel (596) with high confidence, Hiking/Camping (564), and East Asian Culture (483). The system now knows what this page is about — in a language that advertisers understand.
If the LLM is down, the circuit breaker trips after 5 consecutive failures and stops trying for 30 seconds. The page just doesn’t get classified this crawl cycle. It’ll be picked up next time. No crash, no degradation of the serving path.
Category Ranking: What Works on This Site?
The classification isn’t the end of the story. Not all categories perform equally on Yuki’s site. “Travel” ads might get a 4% click rate, while “East Asian Culture” ads get 0.5%. Over time, the system learns this.
Each (category, site) pair has a TaxonomyRankerEntity — a tiny actor that maintains a Beta distribution of click-through performance for that category on that specific site. Early on, with no data, all categories start with Beta(1, 1) — total ignorance. As impressions and clicks accumulate, the distributions sharpen.
The ranker uses Thompson Sampling (the same algorithm used at serve time — more on that later) to assign weights to each category. Categories with strong proven performance get high weights. Categories with little data get variable weights — sometimes high (exploration), sometimes low.
The top 3 categories by weight are forwarded to the auction. For Yuki’s site with some history, that might be Travel (high proven CTR), Hiking/Camping (decent CTR), and a newer category like Food & Drink (uncertain, worth exploring).
The Page Is Ready for Auction
At this point, Promovolve knows:
- What the page is about: Travel, Hiking, East Asian Culture (from the LLM)
- Which categories perform well on this site: Travel and Hiking (from the ranker)
- What ad slots are available: 300×250 sidebar, 728×90 banner (from the crawler)
This information is stored in the AuctioneerEntity for Yuki’s site — an actor sharded by site ID that remembers the last classification for each URL. The page is now ready for advertisers to bid on.
Nothing has happened on the user-facing side yet. No reader has been slowed down. No ad has been shown. The auction infrastructure was built in the background, on a schedule, while Yuki and her readers were asleep.
The next chapter: an advertiser discovers Yuki’s site and wants to place an ad.
Technical deep dives: Page Classification · Category Ranking · System Architecture
Chapter 2: An Advertiser Joins
Takeshi runs a small ryokan in Hakone. It’s a family business — 8 rooms, a natural hot spring, views of the mountains. His guests are mostly travelers who found him through word of mouth or travel blogs. He wants to reach more of those readers.
He’s tried Google Ads. The interface was overwhelming. Keywords, bid strategies, quality scores, ad groups, campaign types — he spent more time learning the system than running the business. And the ads followed people around the internet: someone who Googled “Hakone ryokan” once would see his ad on cooking websites and news portals. That felt wrong.
With Promovolve, Takeshi uploads a photo of his ryokan (a 300×250 JPEG), enters a landing page URL, sets a daily budget of $20, a maximum CPM of $5, and selects his ad product category: Travel.
That’s it. No keywords. No audience targeting. No bid strategy to configure. Promovolve automatically figures out which content categories match his product — articles about destinations, hiking, cultural tourism — using the official IAB mapping between ad product and content taxonomies. His ad will appear next to those articles, the exact context where someone would be interested in a ryokan.
What Happens Behind the Scenes
When Takeshi creates his campaign, several things happen in the cluster:
A CampaignEntity is born. An actor, sharded by advertiser ID and campaign ID, springs to life. It holds Takeshi’s campaign state: max CPM ($5), daily budget ($20), creatives, status (active), and a fresh RL agent. The agent’s bid multiplier starts at 1.0 — meaning the first bids will be at the full $5 CPM.
The creative is stored. The ryokan photo is uploaded to R2 (Cloudflare’s S3-compatible storage), hashed by SHA-256 for deduplication, and recorded in the creative repository with its dimensions, MIME type, and Takeshi’s landing page URL.
Categories are derived. Takeshi chose “Travel” as his ad product category (IAB Ad Product Taxonomy 2.0). The system calls ContentToAdProductMapping.getContentForAdProduct() to derive the matching content categories (IAB Content Taxonomy 2.1) — a set of numeric IDs representing destinations, outdoor recreation, cultural tourism, and other content topics that match a travel product. If no direct mapping exists, it walks up the taxonomy’s parent chain until it finds one. Takeshi doesn’t need to know any of this — he just said “Travel.”
The CampaignDirectory is notified. A cluster singleton maintains a reverse index: category → set of campaigns. It registers Takeshi’s campaign under each of its derived content categories, then fans out the update to CategoryBidderEntity shards via CampaignDistributor (8 workers). Now, whenever a page is classified into one of those categories, the auction knows to ask Takeshi’s campaign for a bid.
The RL agent initializes. A small neural network (8→64→64→5, about 4,800 parameters) is created inside the CampaignEntity. It knows nothing yet — its weights are randomly initialized, its epsilon is 1.0 (fully random exploration). It won’t start learning until impressions flow in.
Publisher Approval
Here’s something that doesn’t exist in traditional programmatic advertising: the publisher gets to say no.
Before Takeshi’s ryokan ad can appear on Yuki’s travel blog, Yuki reviews it. She sees the creative, the landing page, and the advertiser’s information in her publisher dashboard. She can:
- Approve — the creative enters the ServeIndex and can be shown to readers
- Reject — the creative is removed and the next candidate moves up
- Flag — mark for review later
This approval workflow is why Promovolve runs multi-candidate auctions. If the auction only picked one winner and the publisher rejected it, the slot would be empty. With multiple candidates, rejecting one just promotes the next.
Yuki approves the ryokan ad. It fits her site perfectly.
Ready to Bid
Takeshi’s campaign is now in the system:
- Creative uploaded and approved for Yuki’s site
- Ad product category: Travel (content categories derived automatically)
- Budget: $20/day, max CPM: $5
- RL agent: initialized, bid multiplier = 1.0
The next time the AuctioneerEntity for Yuki’s site runs an auction — either from a fresh crawl or the 5-minute re-auction timer — Takeshi’s campaign will be among the bidders.
But Takeshi isn’t the only advertiser. A regional JR rail pass campaign is also targeting Travel with a $8 CPM. A hiking gear company targets Hiking/Camping at $4 CPM. A new cooking class in Kyoto targets Food & Drink at $3 CPM.
How does the system decide who gets which slot? That’s the auction.
Technical deep dives: Entity Hierarchy · Bid Collection · Candidate Shortlisting
Chapter 3: The First Auction
It’s 2:07am. The crawler has just finished classifying Yuki’s latest article — “Autumn Foliage Hikes in Eastern Kyoto.” The AuctioneerEntity for Yuki’s site receives the classification: Travel (0.95), Hiking/Camping (0.85), East Asian Culture (0.70).
Three ad slots need filling. Four campaigns are in the system. The auction begins.
Phase 1: Category Ranking
The AuctioneerEntity asks the TaxonomyRankerEntity for each category: “What’s your weight for this site?”
Each ranker samples from its Beta distribution — Thompson Sampling at the category level:
| Category | Distribution | Sample | Rank |
|---|---|---|---|
| Travel | Beta(12, 88) — proven performer | 0.14 | 1st |
| Hiking/Camping | Beta(3, 47) — decent, some data | 0.08 | 2nd |
| East Asian Culture | Beta(1, 1) — brand new, no data | 0.61 | 3rd (exploration!) |
East Asian Culture ranks 3rd despite having no data — the uniform Beta(1, 1) distribution sampled high. This is exploration: the system will try this category to learn if it works on Yuki’s site. Most of the time, the established categories win. Occasionally, a new one gets a chance.
The top 3 categories advance to bidding.
Phase 2: Bid Collection
For each ranked category, the AuctioneerEntity asks the CategoryBidderEntity: “Who wants to bid on Travel for this site?”
The CategoryBidderEntity fans out to all campaigns registered for that category. Each CampaignEntity evaluates whether it should bid:
Takeshi’s Ryokan (Travel, Hiking): Budget remaining? Yes ($20). Campaign active? Yes. Creative approved for this site? Yes. Bid: $5.00 × 1.0 (RL multiplier) = $5.00 CPM.
JR Rail Pass (Travel): Budget remaining? Yes. Bid: $8.00 × 1.0 = $8.00 CPM.
Hiking Gear Co (Hiking): Budget remaining? Yes. Bid: $4.00 × 1.0 = $4.00 CPM.
Kyoto Cooking Class (Food & Drink): This campaign isn’t registered for Travel, Hiking, or East Asian Culture. It doesn’t bid.
Three bids collected. All above the floor price ($0.50). All pass eligibility: active status, positive budget, creative size matches at least one slot.
Phase 3: Fair Candidate Selection
Now the system has to assign candidates to slots. This is where Promovolve diverges from traditional auctions.
A traditional auction would give all three slots to JR Rail Pass — they bid highest. But that’s terrible for everyone: the publisher shows the same ad three times (bad UX), the other advertisers never get a chance (no exploration), and the system never learns if Takeshi’s ryokan ad might actually get more clicks.
Promovolve uses fair selection: each campaign gets one slot before any campaign gets a second.
Slot 1 (banner): JR Rail Pass — $8.00 CPM (highest bidder, first pick)
Slot 2 (sidebar): Takeshi's Ryokan — $5.00 CPM (second highest, one slot each first)
Slot 3 (sidebar): Hiking Gear Co — $4.00 CPM (third)
Each slot gets multiple candidates (not just one), ordered by CPM but guaranteed to include at least one creative from each bidding campaign. This candidate list is what gets cached for serve-time selection.
Phase 4: Caching in the ServeIndex
The auction results are written to the ServeIndex — a replicated in-memory store backed by Pekko’s Distributed Data (DData).
Each slot gets an entry:
Key: "yuki-site|banner-top|bucket-12"
Value: [
{creative: jrpass-ad, cpm: 8.00, campaign: jrpass, advertiser: jr-west, ...},
{creative: ryokan-ad, cpm: 5.00, campaign: takeshi, advertiser: takeshi, ...}
]
Key: "yuki-site|sidebar-1|bucket-7"
Value: [
{creative: ryokan-ad, cpm: 5.00, campaign: takeshi, advertiser: takeshi, ...},
{creative: hiking-boots, cpm: 4.00, campaign: hikegear, advertiser: hikegear-co, ...}
]
The write is WriteLocal — it completes instantly on the node running the AuctioneerEntity. Within 2 seconds, gossip propagates the data to every other node in the cluster. Every API node now has these candidates in local memory.
The candidates have a TTL of 120 minutes. If no re-auction refreshes them, they expire and the slot goes empty. But re-auctions run every 5 minutes, so in practice candidates are always fresh.
What Just Happened
In about 4 seconds of background processing:
- An LLM classified the page content into advertising categories
- Thompson Sampling ranked those categories by historical performance on this site
- Eligible campaigns placed bids based on their max CPM and RL multiplier
- Fair selection ensured each campaign got representation
- Multiple candidates per slot were cached in replicated memory across the cluster
No reader was involved. No page load was delayed. The entire auction happened in the background, and the results are sitting in memory, waiting.
Now a reader arrives.
Technical deep dives: Periodic Batch Auction · Why Multi-Candidate? · ServeIndex Caching
Chapter 4: A Reader Arrives
It’s 10:15am. A traveler in Singapore is planning a November trip to Kyoto and finds Yuki’s article through a search engine: “Autumn Foliage Hikes in Eastern Kyoto.” The page loads. The article renders. And in the sidebar, an ad slot needs filling.
Yuki’s JavaScript snippet fires a request to Promovolve:
GET /v1/serve?pub=yuki-site&url=https://yukiblog.jp/autumn-hikes&slot=sidebar-1
What happens next takes less than a millisecond.
Step 1: Local DData Lookup
The API node handling this request looks up yuki-site|sidebar-1 in its local DData replica. This is a hash map lookup in the JVM’s memory — no network call, no database query. The candidates from last night’s auction (refreshed by the 5-minute re-auction cycle) are right there:
Candidates:
1. Takeshi's Ryokan — $5.00 CPM, Beta(1, 1) [new, no impressions yet]
2. Hiking Gear Co — $4.00 CPM, Beta(1, 1) [new, no impressions yet]
Both creatives were approved by Yuki. Both have budget remaining. Both are ready to serve.
Step 2: Content Recency Check
The system checks: is this page still fresh? The classification timestamp says it was crawled 8 hours ago. The recency window for Yuki’s site is 48 hours. Eight hours is well within that — the page is fresh. Proceed.
If the article were from last week and hadn’t been re-crawled, the candidates might have expired (TTL 120 minutes) and the response would be 204 No Content — an empty ad slot. This is by design: stale content doesn’t get monetized.
Step 3: The Pacing Gate
Before choosing a creative, the PI controller decides: should we serve at all?
It’s 10:15am. Takeshi’s ryokan campaign has a $20 daily budget. Based on the traffic shape the system has learned for Yuki’s site (most traffic between 8am-11am JST, another peak at 8pm), the ideal spend by 10:15am is about 35% of the budget — $7. The campaign has spent $4 so far. It’s slightly underpacing.
The PI controller computes a throttle probability: 0.12 (skip 12% of requests to stay on pace). A random number is drawn: 0.47. That’s above 0.12, so this request passes the gate.
If the campaign were overspending — say, $12 spent by 10:15am — the throttle would be higher, maybe 0.65, and more requests would be skipped. The campaign would slow down, stretching its remaining budget across the afternoon.
This happens before creative selection. If the gate rejects the request, Thompson Sampling never runs — no exploration is wasted on a throttled impression.
Step 4: Thompson Sampling
Now the system picks which creative to show. Both candidates are brand new — zero impressions, zero clicks. Their Beta distributions are both Beta(1, 1), the uniform distribution.
Thompson Sampling draws a random sample from each:
Takeshi's Ryokan: sample from Beta(1, 1) → 0.72
score = 0.72 × log(1 + 5.00) = 0.72 × 1.79 = 1.29
Hiking Gear Co: sample from Beta(1, 1) → 0.34
score = 0.34 × log(1 + 4.00) = 0.34 × 1.61 = 0.55
Takeshi’s ryokan wins this round. Not because it’s better — nobody knows yet — but because its random sample happened to be higher. Next time, the hiking gear ad might win. With Beta(1, 1) on both sides, it’s nearly a coin flip, weighted slightly by CPM.
This is pure exploration. Over the next hundred impressions, the system will learn which creative readers actually click on, and the randomness will give way to informed selection.
Step 5: Budget Reservation
Before serving the ad, the system reserves the spend. The CampaignEntity for Takeshi’s campaign receives a TryReserve request:
- Amount: $5.00 / 1000 = $0.005 (one impression at $5 CPM)
- Budget remaining: $16.00
- Result: Reserved. Budget is now $15.995.
If the budget were exhausted, the response would be InsufficientBudget, and Thompson Sampling would try the next candidate (Hiking Gear Co). If all candidates are exhausted, the response is 204 No Content. Graceful degradation, no errors.
The reservation is recorded in the CampaignEntity’s ephemeral buffer. The RL agent notes the impression: recordImpression($0.005) and recordBidOpportunity(won: true).
Step 6: The Response
The API returns a JSON response in under a millisecond:
{
"assetUrl": "https://cdn.promovolve.dev/ryokan-hakone-300x250.jpg",
"mime": "image/jpeg",
"width": 300,
"height": 250,
"clickUrl": "https://api.promovolve.dev/v1/click?pub=yuki-site&...",
"impUrl": "https://api.promovolve.dev/v1/imp?pub=yuki-site&...",
"creativeId": "ryokan-001",
"version": 1711090800000
}
The JavaScript snippet renders the image and fires the impression pixel (impUrl). The tracking URL is HMAC-signed — it can’t be forged or tampered with.
What the Reader Sees
A photo of a ryokan nestled in autumn mountains, next to an article about autumn hikes in Kyoto. The ad is relevant. It fits the context. The reader might not even register it as an “ad” in the intrusive sense — it feels like a recommendation.
This is what we set out to build: the magazine experience, on the web.
The reader reads the article. They notice the ryokan photo. They think, “that looks nice for our trip.” They click.
Technical deep dives: Thompson Sampling · Scoring Formula · Pacing Overview · Fair Selection
Chapter 5: The Click
The reader clicks on Takeshi’s ryokan ad. Their browser follows the click URL:
GET /v1/click?pub=yuki-site&url=...&cid=ryokan-001&tok=a8f3...&rid=01HX...
This single click sets off a cascade of learning across the system.
Click Validation
The TrackRoutes handler validates the click:
-
HMAC verification: The
tokparameter is checked against a signature computed from the publisher’s secret, the URL parameters, and the request ID. If anyone tampered with the URL — changed the creative ID, the campaign, or the CPM — the signature won’t match.403 Forbidden. -
Freshness check: The
bparameter encodes a time bucket (1-minute granularity). If the click URL is more than 3 minutes old, it’s rejected. This limits the replay window. -
Replay guard: The canonical URL (including the unique request ID) is checked against a sharded bloom filter. If this exact click was already recorded, it’s a duplicate.
409 Conflict.
All three pass. The click is legitimate. 204 No Content — acknowledged.
Four Systems Learn from This Click
The LearningEventLog routes the click event to four different parts of the system, each learning at a different timescale.
1. TaxonomyRankerEntity — Category Ranking (Days)
The click event reaches the ranker for Travel on Yuki’s site:
Before: Beta(12, 88) — mean CTR: 12%
After: Beta(13, 88) — mean CTR: 12.9%
One more click in the numerator. The Travel category’s score on Yuki’s site ticks up slightly. Over weeks, this shapes which categories get prioritized in auctions for this site. A category that consistently gets clicks rises; one that doesn’t fades.
The ranker uses a 7-day half-life decay — old impressions and clicks gradually lose weight. This means the ranking adapts to seasonal changes: Travel might dominate in autumn when people plan trips, while Food & Drink rises in December when people search for holiday dining.
2. AdServer — Creative Thompson Sampling (Minutes)
The click reaches Takeshi’s creative stats on the AdServer:
Before: Beta(1, 1) — uniform, no data
After: Beta(2, 1) — heavily skewed toward high CTR
This is one impression, one click — a 100% click rate. Obviously that won’t last, but it gives Takeshi’s creative a strong initial signal. The Beta distribution is now Beta(2, 1), which samples high most of the time. For the next few impressions, this creative will be favored by Thompson Sampling.
After 20 more impressions and 1 more click, it’ll be Beta(3, 19) — about 14% CTR. Still good, but more realistic. The distribution is narrowing toward the truth.
The stats use a 60-minute rolling window with 1-minute buckets. This creative’s strong early performance will influence serving decisions for the next hour, then the data starts aging out and the system stays responsive to changes.
3. CampaignEntity — RL Agent (15 Minutes)
The click is recorded in the RL agent’s window counter:
windowClicks: 0 → 1
This won’t trigger any RL action immediately — the agent only observes every 15 minutes. But when the next RLObserveTick fires, this click will be part of the reward:
reward = windowClicks - overspendPenalty = 1 - 0 = 1.0
A positive reward. The agent will store the transition: “I was in state S (full budget, morning, 1.0 multiplier), I chose to hold my bid (action 2), and I got reward 1.0.” Over hundreds of these transitions, the agent learns which states and actions lead to clicks.
4. Dashboard Projection (Seconds)
The click is written to the tracking journal — a buffered Pekko Stream that batches events and writes them to PostgreSQL. Within a few seconds, Takeshi’s advertiser dashboard updates:
Impressions: 1 | Clicks: 1 | CTR: 100% | Spend: $0.005
Obviously these numbers will normalize as more data comes in. But Takeshi can see that his campaign is live and getting engagement.
The Compound Effect
This is one click. But notice what it touched:
| System | What it learned | Timescale | Effect |
|---|---|---|---|
| Category Ranker | Travel works on this site | Days-weeks | Travel ads get more auction weight |
| Creative Stats | This creative gets clicks | Minutes-hours | Thompson Sampling favors it |
| RL Agent | This bid level produces clicks | 15-min windows | Bid multiplier adjusts over days |
| Dashboard | Campaign is performing | Seconds | Advertiser sees results |
Each system learns at its own pace. Thompson Sampling reacts within minutes — the next reader might see a different ad mix because of this click. The RL agent reacts within hours — the bid multiplier might shift by the end of the day. The category ranker reacts over weeks — Travel’s weight on Yuki’s site gradually increases.
Five layers of learning, all from one click, at five different timescales. No manual optimization. No “let me adjust the bid.” The system converges toward the right answer on its own.
Meanwhile, the Other Creative
While Takeshi’s ryokan ad got a click, the hiking gear ad has had 3 impressions and zero clicks. Its distribution is now Beta(1, 4) — mean CTR about 20%, but the distribution is starting to lean toward lower values.
Thompson Sampling will still occasionally select it — the Beta(1, 4) distribution can sample anywhere from 0 to 0.8, just with lower probability of sampling high. If it gets a click on its next impression, it recovers immediately: Beta(2, 4) is a much more competitive distribution.
If it continues to get no clicks, it fades out naturally. By the time it has 50 impressions and 0 clicks — Beta(1, 51) — its samples will almost always be near zero. It effectively stops being shown, without anyone making a decision to stop it.
This is the beauty of Thompson Sampling: bad creatives don’t need to be manually paused. They extinguish themselves.
Technical deep dives: Beta-Bernoulli Model · Reward Function · Learning Mechanisms
Chapter 6: A Day in the Life
Let’s follow Takeshi’s ryokan campaign through its first full day.
Morning: The Grace Period (8:00-8:02am)
Yuki’s site gets its first traffic of the day. The PI pacing controller has just started a new day — it doesn’t know the request rate yet. For the first 10 seconds (or 10 requests, whichever is later), the controller is in grace period: it throttles at 99%, serving almost nothing.
Why? Because the controller needs to measure the traffic rate before it can regulate it. Serving aggressively without knowing the rate could blow the budget in the first few minutes. Better to be cautious for 10 seconds and get a baseline.
After 10 requests, the TrafficObserver has computed an exponentially-weighted moving average of the request rate: about 2 requests per second at this hour. The PI controller calculates a base throttle:
ideal_serve_rate = budget_remaining / time_remaining / avg_cpm × 1000
= $20 / 86400s / $5 × 1000 = 0.046 serves/second
throttle = 1 - (ideal_serve_rate / observed_rate) = 1 - (0.046 / 2.0) = 0.977
That’s aggressive throttling — skip 97.7% of requests. But that’s correct: $20 of budget at $5 CPM is only 4,000 impressions across the entire day. At 2 requests per second, that’s about 2,000 seconds of full serving — but the day is 86,400 seconds long. The campaign needs to spread thin.
Grace period ends. Normal serving begins.
Mid-Morning: Thompson Sampling Converges (8:00-11:00am)
Three hours in. Takeshi’s ryokan has been shown 15 times, getting 2 clicks. The hiking gear ad has been shown 12 times, zero clicks.
The Thompson Sampling distributions have diverged:
Ryokan: Beta(3, 14) — mean ~18%, samples usually between 5-35%
Hiking Gear: Beta(1, 13) — mean ~7%, samples usually between 0-20%
The ryokan ad is winning most selections now. Not every time — Thompson Sampling still occasionally picks the hiking gear ad (when its sample happens to beat the ryokan’s). But the ratio has shifted from 50/50 to roughly 70/30.
If the hiking gear ad gets a click in its next few impressions, the ratio will tighten. If it doesn’t, it’ll fade further. No one needs to decide when to stop testing. The system self-regulates.
Noon: The RL Agent’s First Observation (12:00pm)
Four hours in. The RL agent’s 15-minute timer has fired 16 times since the day started. Each time, it observes:
state = [effectiveCpm, ctr, winRate, budgetRemaining,
timeRemaining, spendRate, impressionRate, costPerClick]
At noon, the observation looks like:
state = [1.0, — bidding at full CPM (multiplier = 1.0)
0.13, — 13% CTR (2 clicks / 15 impressions this window)
0.82, — winning 82% of bid opportunities
0.75, — 75% budget remaining ($15 out of $20)
0.50, — 50% time remaining (noon)
0.90, — slightly underpacing (spending at 90% of ideal rate)
0.15, — low impression volume
0.38] — cost per click / maxCpm
The reward from the last window: 1 click, no overspend penalty.
reward = 1.0 - 0.0 = 1.0
The agent stores this transition in its replay buffer and trains: sample 32 random transitions, compute Double DQN targets, one SGD step. Then it selects the next action.
The agent’s epsilon is still high (about 0.92 on day 1 — almost fully random). It randomly picks action 3 (bid 10% higher): multiplier = 1.0 × 1.1 = 1.1. The effective CPM for the next 15 minutes will be $5.50.
Will this help? Maybe. Higher bids mean winning more auctions against the JR Rail Pass ($8 CPM). Or maybe the extra cost isn’t worth it. The agent doesn’t know yet. It’s exploring.
Afternoon: Pacing Adjusts (2:00-5:00pm)
Traffic on Yuki’s site shifts. The morning peak (8-11am) is over. Afternoon traffic is lighter — about 0.8 requests per second instead of 2. The PI controller detects the drop through its rate tracker and adjusts:
Previous throttle: 0.977 (skip 97.7%)
New throttle: 0.943 (skip 94.3%)
Less throttling because the traffic rate dropped. The campaign serves a larger fraction of the smaller number of requests, maintaining a steady spend rate.
But there’s more: the traffic shape tracker has been learning Yuki’s hourly traffic pattern. After a few days (not the first day — the tracker needs data), it will know:
Hour 8: 12% of daily traffic
Hour 9: 11%
Hour 10: 10%
...
Hour 14: 4%
Hour 15: 3%
...
Hour 20: 8% (evening peak)
Instead of assuming linear time = linear spend, the pacing target will follow this shape. “Spend 12% of budget during the 8am hour, 3% during the 3pm hour.” This prevents the common failure mode of conventional pacing: spending too much during peaks and running dry, or throttling too hard during peaks and having leftover budget at night.
Evening: A Re-Auction (7:00pm)
A re-auction fires for Yuki’s site. What’s changed since 2am?
- JR Rail Pass campaign ran out of budget at 4pm. Its RL agent bid too aggressively (multiplier reached 1.4) and burned through the daily budget by mid-afternoon.
- A new advertiser appeared: a Kyoto pottery workshop, targeting East Asian Culture, $3 CPM.
The auction re-runs with the updated participants:
Slot 1 (banner): Takeshi's Ryokan — $5.50 CPM (multiplier bumped to 1.1)
Slot 2 (sidebar): Hiking Gear Co — $4.00 CPM
Slot 3 (sidebar): Pottery Workshop — $3.00 CPM (new!)
JR Rail Pass is gone — budget exhausted. But its creatives stay in the ServeIndex with a refreshed TTL (they’ll be there when budget resets tomorrow). Takeshi’s ryokan, which was the second-highest bidder this morning, is now the top bidder.
The re-auction takes about 3 seconds. The new candidates propagate to all nodes within 2 seconds of gossip. The next reader sees the updated lineup.
End of Day: Reset
At midnight (or the configured day boundary), the day resets.
CampaignEntity: Budget resets to $20. Spend counter goes to zero. The RL agent’s resetDay() fires: it stores a terminal transition (the final reward from the last window), resets the bid multiplier to 1.0, and clears the window counters. But the DQN weights survive — everything it learned today carries into tomorrow.
TrafficShapeTracker: Today’s hourly traffic volumes are blended into the stored profile with dayAlpha = 0.2. After 5 days, the profile is a smoothed average of observed traffic patterns.
Thompson Sampling stats: The 60-minute rolling window means the last hour of creative stats carries into the new day. Older stats have already aged out. The system doesn’t need an explicit reset.
Budget event published: A CampaignBudgetReset event tells the AuctioneerEntity that Takeshi’s campaign has fresh budget. A debounced re-auction fires within 1 second, and the ryokan ad is back in the candidate pool at full strength.
Day 1 is done. The system served relevant ads, learned which creatives work, paced budgets smoothly, and adapted to traffic patterns — all automatically.
Tomorrow it will be slightly smarter.
Technical deep dives: PI Control Loop · Traffic Shape Learning · Grace Periods · Re-Auction
Chapter 7: A Week Later
Seven days have passed. Let’s see what the system has learned.
Takeshi’s RL Agent Has a Strategy
On day 1, the agent’s epsilon was 0.92 — nearly every action was random. By day 7, epsilon has decayed to about 0.15. The agent is now 85% exploitation, 15% exploration.
Its learned policy, visible in the Q-values:
When budget is above 60% and it’s before noon: Bid aggressively (action 4: multiply by 1.2×). Morning traffic on Yuki’s site has high CTR for travel content. Winning these impressions is worth paying more.
When budget is below 30% with time remaining: Pull back (action 0: multiply by 0.8×). Conserve what’s left. Spreading thin across the afternoon produces more total clicks than burning out early.
When CTR is high and spend rate is normal: Hold (action 2: multiply by 1.0×). Things are working. Don’t change what isn’t broken.
When overpacing (spend rate above 1.5×): Reduce immediately (action 0 or 1). The overspend penalty in the reward function trained this response. The agent learned that overpacing leads to negative rewards.
None of these rules were programmed. They emerged from the reward signal — clicks minus overspend penalty — and thousands of training steps on the replay buffer.
Meanwhile, the JR Rail Pass agent learned a different strategy. With a higher CPM ($8) and a larger budget, it can afford to bid aggressively in the morning and let pacing handle the afternoon. Its multiplier swings more widely: up to 1.4 in morning peaks, down to 0.7 in the evening. It learned that its high base CPM means it wins most auctions even at 0.7×.
Different campaigns, different budgets, different learned strategies. Each agent adapts to its own situation.
Thompson Sampling Has Converged
After hundreds of impressions, the creative stats tell a clear story:
| Creative | Impressions | Clicks | Distribution | Mean CTR |
|---|---|---|---|---|
| Takeshi’s Ryokan | 312 | 14 | Beta(15, 299) | 4.8% |
| JR Rail Pass | 287 | 8 | Beta(9, 280) | 3.1% |
| Hiking Gear Co | 89 | 1 | Beta(2, 89) | 2.2% |
| Pottery Workshop | 45 | 3 | Beta(4, 43) | 8.5% |
Takeshi’s ryokan gets the most impressions — it has a proven CTR and a decent CPM. JR Rail Pass has a higher CPM but lower CTR; the scoring formula sampledCTR × log(1 + CPM) keeps them competitive but Takeshi’s CTR advantage matters.
The pottery workshop is interesting. It has fewer impressions (it started later in the week) but its CTR is the highest — 8.5%. Its Beta(4, 43) distribution is still fairly wide, though. Thompson Sampling is giving it more exploration to confirm whether this high CTR is real or noise.
The hiking gear ad has mostly faded out. Beta(2, 89) samples near zero most of the time. It gets about 5% of impressions — just enough exploration to detect if something changes (new creative, seasonal shift). If the advertiser uploaded a better creative, the system would detect the improvement within hours.
Nobody made any of these allocation decisions. No one paused the hiking gear campaign or boosted the pottery workshop. The system found the right distribution through pure learning.
The Category Ranker Has Opinions
The TaxonomyRankerEntity for Yuki’s site has accumulated a week of data:
Travel: Beta(45, 355) — 11.3% CTR, tight distribution
Hiking/Camping: Beta(8, 192) — 4.0% CTR, fairly confident
East Asian Culture: Beta(5, 45) — 10.0% CTR, still exploring
Food & Drink: Beta(2, 28) — 6.7% CTR, early data
Travel dominates — it gets the highest weight in most auctions. But East Asian Culture is a surprise performer. The pottery workshop (East Asian Culture category) is driving this. The ranker is giving East Asian Culture more auction weight, which means more bidding opportunities for advertisers in that category.
This creates a virtuous cycle: good category performance → more auction weight → more candidates → more data → better Thompson Sampling → better ads → higher CTR → higher category performance.
Traffic Shapes Are Calibrated
The TrafficShapeTracker now has 7 days of hourly data for Yuki’s site, blended with dayAlpha = 0.2:
Weekday profile:
Hour 0-6: 1-2% each (late night, minimal traffic)
Hour 7: 4% (morning commute)
Hour 8-10: 10-12% each (peak reading time)
Hour 11-13: 6-7% each (lunch)
Hour 14-17: 3-4% each (afternoon lull)
Hour 18-20: 7-9% each (evening reading)
Hour 21-23: 3-4% each (winding down)
The PI controller uses this shape instead of a linear time fraction. At 10am, it knows 32% of daily traffic has typically passed (not 42% if you assumed linear). This means the pacing target at 10am is “have spent about 32% of budget” — not 42%. The result: budgets stretch correctly across the day’s actual traffic pattern, not an imaginary uniform distribution.
Pacing Has Self-Tuned
The PI controller has been adjusting itself:
-
Overpace multiplier: Started at 2.0×. After detecting that Takeshi’s campaign occasionally overpaced by 20% in the morning (the RL agent bidding up), it increased to 2.8×. This means the controller responds more aggressively to overspending — a correction learned from experience.
-
Spend ratio smoothing: The adaptive EMA alpha settled at 0.25. The traffic on Yuki’s site is moderately volatile (it spikes when she publishes a new article and posts to social media). The controller learned to smooth more than the default to avoid overreacting to these spikes.
What Yuki Sees
Yuki checks her publisher dashboard:
This Week:
Impressions served: 1,847
Revenue: $9.24
Active advertisers: 4
Top category: Travel (58% of impressions)
Approval queue: 2 new creatives pending review
The revenue isn’t life-changing — it’s a small blog. But the ads are relevant, the site is fast, and her readers haven’t complained. She approves the two pending creatives (a Kyoto walking tour and a Japanese language school) and goes back to writing her next article.
What Takeshi Sees
Takeshi checks his advertiser dashboard:
This Week:
Impressions: 312
Clicks: 14
CTR: 4.5%
Spend: $1.56
Avg CPC: $0.11
Fourteen people clicked through to his ryokan’s booking page from a travel blog — exactly the kind of reader he wanted to reach. His cost per click is $0.11. He didn’t have to manage bids, adjust targeting, or learn a DSP interface. He uploaded a photo, set a budget, and the system did the rest.
He increases his daily budget to $30.
The System Keeps Learning
Day 8 begins. Epsilon decays to 0.05 — the RL agent’s floor. From now on, 95% of actions follow the learned policy, 5% continue to explore. The agent isn’t done learning — it will adapt to seasonal changes, new competitors, and traffic shifts — but the wild random exploration of the first week is over.
The system has found its rhythm: a travel blog with relevant ads, a local ryokan reaching interested travelers, creative performance continuously optimized, budgets paced smoothly, all running on sub-millisecond serving with no user tracking.
This is what advertising looks like when you start with the content instead of the user.
Technical deep dives: RL Training Loop · State Space · Traffic Shape Learning · Key Innovations
Technical Introduction
This chapter provides a concise technical overview of Promovolve’s architecture and algorithms. For the motivation behind these design choices, see Why Promovolve?.
The Five Key Mechanisms
1. Periodic Batch Auction
Auctions happen when content is published or updated (scheduled crawl + 5-minute re-auctions), not on every page load. An LLM classifies page content into IAB categories, TaxonomyRankerEntity ranks categories by site-specific performance, and CategoryBidderEntity collects bids from eligible campaigns. Results are cached in DData.
2. Multi-Candidate Caching
Instead of a single auction winner, multiple candidates per ad slot are shortlisted with per-campaign diversity guarantees and stored in the ServeIndex (replicated in-memory via DData). This enables exploration at serve time without re-running the auction.
3. Thompson Sampling at Serve Time
When a user loads a page, Thompson Sampling selects among cached candidates:
score = sampledCTR × log(1 + CPM)
CTR is sampled from a Beta-Bernoulli posterior using time-bucketed statistics (1-minute granularity, 60-minute rolling window). The log(1 + CPM) term ensures higher bids have an advantage but CTR dominates — a creative that users actually click beats one that merely bids high.
4. Double DQN Bid Optimization
Each campaign runs a per-campaign reinforcement learning agent (8→64→64→5 neural network) that observes performance every 15 minutes and adjusts a bid multiplier. The agent learns to balance click maximization against budget pacing over multi-day episodes.
5. Self-Tuning PI Pacing
A PI controller with adaptive gains, traffic shape learning (separate weekday/weekend 24-hour profiles), oscillation detection, and leaky integrator anti-windup smooths budget delivery. It learns that traffic peaks at 10am and dips at 3pm, and adjusts automatically.
The Result
Sub-millisecond ad serving. Continuous learning at five layers (per-request Thompson Sampling, 15-minute RL, per-auction category ranking, daily traffic shapes, continuous PI tuning). Graceful degradation when budgets exhaust. No user tracking. Publisher approval over every creative.
Navigating This Book
- Architecture — Pekko cluster topology, entity hierarchy, data flow
- Auction — The five phases of the periodic batch auction
- Serving — Thompson Sampling, cold start, fair selection
- Pacing — PI control, self-tuning, traffic shape learning
- RL — DQN state space, reward function, training loop
- Distributed State — ServeIndex replication and consistency
- Comparison — Point-by-point mapping against traditional ad tech
Each chapter is self-contained. All formulas, thresholds, and constants come from the Scala source code.
How Ad Tech Works (and Where Promovolve Diverges)
If you’ve never worked in ad tech, the alphabet soup of SSPs, DSPs, DMPs, and RTB can be impenetrable. This chapter explains the traditional programmatic advertising stack from the ground up, then shows how Promovolve makes different choices at each layer.
The Simplest Version: A Magazine
Before the internet, advertising was straightforward.
A magazine about cooking has readers who care about cooking. A kitchen equipment company wants to reach people who care about cooking. The magazine’s ad sales team calls the kitchen equipment company and says: “We’ll put your ad on page 47, next to our article about French sauces, for $5,000.” They shake hands. Done.
Three participants. One transaction. Everyone understands what they’re getting:
- The advertiser gets their ad in front of relevant readers
- The publisher gets paid for their audience’s attention
- The reader sees an ad that relates to what they’re already reading
This is direct sales. It works beautifully when the publisher and advertiser know each other. It doesn’t scale to millions of websites and millions of advertisers who have never met.
The Internet Problem: Too Many Strangers
A small travel blog in Kyoto has readers who love Japanese travel. A ryokan in Hakone would love to reach those readers. But the blog owner doesn’t know the ryokan exists, and the ryokan owner doesn’t know the blog exists. Neither has a sales team.
Multiply this by millions of websites and millions of businesses. The matching problem — connecting the right ad to the right page — is too large for humans to solve one deal at a time.
The ad tech industry’s answer was to automate the matching with machines. But the system they built optimized for a particular set of goals, and those goals don’t serve everyone.
The Traditional Stack: Who Does What
The Publisher’s Side: SSP (Supply-Side Platform)
The publisher (our Kyoto travel blog) signs up with an SSP — companies like Google Ad Manager, Magnite, or PubMatic. The SSP provides a piece of JavaScript that goes on every page. When a reader loads the page, this script calls the SSP: “I have an ad slot, 300x250 pixels, on this URL. Who wants it?”
The SSP’s job is to get the highest price for this impression. It does this by offering it to exchanges and DSPs.
The Advertiser’s Side: DSP (Demand-Side Platform)
The ryokan in Hakone signs up with a DSP — companies like The Trade Desk, DV360, or Amazon DSP. The ryokan uploads its ad creative, sets a budget ($50/day), defines a target audience (“people interested in travel to Japan”), and sets a maximum bid ($3 CPM).
The DSP’s job is to find the right impressions to buy and bid the right price. It does this by listening for bid requests from exchanges and deciding, in real time, whether this particular impression is worth bidding on.
The Middle: The Ad Exchange
The exchange (Google AdX, OpenX, etc.) sits between SSPs and DSPs. When the SSP says “I have an impression,” the exchange broadcasts it to every connected DSP: “Who wants this? You have 100 milliseconds to decide.”
Each DSP evaluates the impression:
- Does this page match the advertiser’s targeting criteria?
- How much is this user worth, based on their profile?
- What’s the right bid price?
The DSPs that want it send back bids. The exchange picks the highest bidder. The winning DSP’s ad is served.
The Invisible Layer: DMP (Data Management Platform)
How does the DSP know “how much this user is worth”? It consults user profiles — built from cookies, device IDs, and cross-site tracking data aggregated by DMPs. These profiles say things like: “This user visited car dealership websites last week” or “This user is 25-34, lives in Tokyo, and recently searched for flights.”
This is where the magazine model breaks down completely. The DSP isn’t asking “what is this person reading?” It’s asking “who is this person?” The travel blog’s content about Kyoto temples is irrelevant. The ad is targeting the user, not the page.
What Goes Wrong
This system works — in the narrow sense that money flows and ads get served. But it has structural problems.
The publisher becomes a commodity
In this model, the publisher’s content doesn’t matter. What matters is the user sitting on the page and the cookie in their browser. A thoughtfully researched article about Kyoto architecture and a hastily assembled listicle about “10 things in Japan” are, to the exchange, interchangeable: they carry the same user with the same cookie.
Publishers who invest in quality content get paid the same as those who don’t. The incentive is to maximize page views, not quality — because the ad system doesn’t value quality.
The user experience degrades
Each ad slot triggers a cascade of network requests. The SSP calls the exchange. The exchange calls multiple DSPs. Each DSP calls its DMP. The responses flow back. On a page with five ad slots, this happens five times in parallel. Header bidding (the publisher’s attempt to get better prices) adds another round. The result: ad-related requests often take longer than the page content itself.
And the ads themselves: because they target users, not content, they feel random and intrusive. You read about temple architecture, you see an ad for the shoes you browsed last night. The disconnect is jarring.
Small players can’t participate
This infrastructure has minimum viable scale. SSPs have traffic minimums. DSPs require campaign management expertise. The exchange’s auction mechanics favor large bidders with sophisticated real-time bidding algorithms. The ryokan in Hakone and the Kyoto travel blog — the exact pair that should be connected — can’t afford to play.
How Promovolve Rethinks Each Layer
Promovolve doesn’t try to improve the traditional stack. It replaces the fundamental assumptions.
Instead of targeting users → target content
The entire SSP/DSP/DMP chain exists because the system decided to target users. Remove that decision, and most of the machinery becomes unnecessary.
Promovolve classifies page content using an LLM into IAB Content Taxonomy 2.1 categories: “This article is about Destinations, Outdoor Recreation, Cultural Tourism.” Meanwhile, an advertiser says: “My product is Travel” (IAB Ad Product Taxonomy 2.0). The system automatically derives which content categories match that product using the official IAB mapping — no manual configuration. The match happens between content and product, not content and user. No user profile needed. No DMP. No cookies.
This is the magazine model, automated. The technology that makes it work at scale — cheap, accurate LLM classification — didn’t exist five years ago.
Instead of real-time auctions → periodic batch auctions
Traditional auctions run on every page load because they need to evaluate the user in real time. Promovolve doesn’t need to — the content doesn’t change between page loads.
The auction runs when content is crawled (scheduled + re-auctions every 5 minutes). Multiple candidates per slot are cached in a replicated in-memory store (Pekko DData). When a user loads the page, the ad is already there. No network round-trip, no exchange, no 100ms wait.
Serve latency drops from 50-200ms to under 1ms.
Instead of a single winner → multiple candidates with exploration
A traditional auction picks one winner: the highest bidder. That’s it. If a new advertiser with a potentially better creative enters, they lose to the established high bidder and never get a chance to prove themselves.
Promovolve caches multiple candidates and uses Thompson Sampling to choose at serve time. A new creative with no track record gets explored — shown to some users to learn its click-through rate. If it performs well, it earns more impressions. If not, it fades out naturally. No A/B test configuration needed. The system learns automatically.
Instead of DSP bid algorithms → per-campaign RL agents
In the traditional stack, each DSP runs sophisticated bid optimization across all its campaigns. The ryokan in Hakone doesn’t have a DSP; it can’t participate.
In Promovolve, each campaign has its own RL agent (a small neural network with ~4,800 parameters) that learns to adjust bid levels based on performance. The ryokan sets a maximum CPM and a daily budget. The agent figures out the rest: bid aggressively when the campaign is pacing well, pull back when budget is running low, learn day-over-day which bid levels produce clicks.
No DSP integration. No bid management expertise. The agent handles it.
Instead of per-impression database writes → buffered spend tracking
Traditional systems write to a database on every impression to track spend. At scale, this becomes a bottleneck.
Promovolve buffers spend events in the campaign actor (flush every 500ms or 20 events), deduplicates with a Bloom filter, and persists atomically. This reduces database writes dramatically while maintaining correctness through idempotency guarantees.
Instead of intermediary fees → direct connection
In the traditional stack, money passes through multiple intermediaries: DSP, exchange, SSP. Each takes a percentage.
Promovolve connects advertisers and publishers directly. The advertiser’s budget goes to the publisher, minus the platform’s single fee. There’s no exchange, no DSP, no DMP taking a cut.
What Promovolve Gives Up
These trade-offs are real and worth understanding:
No cross-publisher reach. A DSP campaign can target users across thousands of websites simultaneously. Promovolve works per-publisher (or per-publisher-network). An advertiser who wants broad reach across unrelated sites needs the traditional stack.
No user-level targeting. If an advertiser specifically wants to reach “women aged 25-34 in Tokyo who recently searched for hotels,” Promovolve can’t help. It can reach “readers of content about hotels in Tokyo,” which may overlap significantly, but it’s a different kind of targeting.
No real-time price discovery. Traditional exchanges reveal market-clearing prices through competitive bidding. Promovolve’s first-price model with RL adjustment doesn’t discover “what is this impression worth to the market?” — it discovers “what bid level maximizes my clicks within my budget?”
Stale auction results. Traditional RTB reflects the state of the world right now. Promovolve’s cached candidates can be up to 5 minutes old (the re-auction interval). A campaign that paused 2 minutes ago might still be served until the next re-auction.
No user retargeting. The “you looked at shoes, now see shoe ads everywhere” pattern is impossible in Promovolve. For some, this is a feature.
When Promovolve Makes Sense
Promovolve is the right choice when:
- The publisher’s content is the value proposition, not access to trackable users
- Advertisers want contextual relevance — their ad next to related content
- Page performance matters — sub-millisecond serving vs. 200ms ad waterfalls
- The publisher wants control over what appears on their site (approval workflow)
- Privacy is a genuine concern, not just a compliance checkbox
- Participants include small advertisers — local businesses, community announcements — who can’t access the programmatic stack
It’s the wrong choice when:
- The advertiser needs cross-publisher user retargeting
- Market-clearing price discovery is important for the business model
- The publisher’s value is user data, not content quality
The next chapters examine each of these differences in technical detail.
System Architecture
Promovolve runs as an Apache Pekko Cluster with three distinct node roles, Cluster Sharding for entity distribution, and Distributed Data (DData) for replicated in-memory state. Persistence uses PostgreSQL via JDBC (Slick).
High-Level Components
graph TB
subgraph Cluster["Pekko Cluster (promovolve system)"]
subgraph Singleton["Singleton Role"]
CD["Campaign Directory"]
Sched["Scheduler"]
end
subgraph Entity["Entity Role"]
Camp["Campaign"]
Auct["Auctioneer"]
CatBid["CategoryBidder"]
TaxRnk["TaxonomyRanker"]
Adv["Advertiser"]
end
subgraph API["API Role"]
HTTP["HTTP API"]
AdSrv["AdServer"]
Evt["Events"]
end
DData["DData: ServeIndex, PacingConfig, Blocklist<br/>Replicated across all nodes via gossip<br/>Durable: LMDB for shard-* and exhausted-*"]
PG["PostgreSQL: durable_state, snapshot tables<br/>20 connections, 5 min pool"]
end
Cluster Configuration
From application.conf:
| Setting | Value |
|---|---|
| Cluster roles | singleton, entity, api (env: PEKKO_CLUSTER_ROLES) |
| Number of shards | 100 |
| Remember entities | on (via DData store) |
| Passivation timeout | 5 minutes |
| Split-brain strategy | keep-majority (stable after 20s) |
| Heartbeat interval | 1s, threshold 12.0, acceptable pause 10s |
| Remote frame size | 256 KiB |
| Seed node | pekko://promovolve@127.0.0.1:25520 |
DData Configuration
| Setting | Value |
|---|---|
| Gossip interval | 2s |
| Notify subscribers interval | 500ms |
| Max delta elements | 500 |
| Durable keys | shard-*, exhausted-campaigns |
| Durable store | LMDB (100 MiB map, 200ms write-behind) |
| Pruning interval | 120s (dissemination: 300s) |
Key Design Decisions
-
100 shards with remember-entities-via-DData ensures entities survive node restarts and are automatically rebalanced (rebalance-absolute-limit: 20, relative: 0.1).
-
DData for ServeIndex means every API node has a local replica of all active ad candidates. Serve-time lookups never cross the network.
-
LMDB durability for shard metadata and exhausted-campaigns state, but ServeIndex itself is ephemeral — rebuilt from auctions on restart.
-
Separation of roles allows scaling read (API) and write (entity) workloads independently. The singleton role hosts cluster-wide coordinators like CampaignDirectory.
Entity Hierarchy & Cluster Roles
Entity Relationship Map
Advertiser (sharded by advertiserId)
├── Budget: dailyBudget, spendToday, lastResetEpochDay
├── Creatives: Map[CreativeId, Creative]
├── Site blocklist: Set[SiteId]
└── Campaigns: Set[CampaignId]
└── Campaign (sharded by advertiserId|campaignId)
├── Budget: dailyBudget, spendToday, maxCpm
├── RL Agent: BidOptimizationAgent (DQN snapshot)
├── Creative assignments: Set[CreativeId]
├── Spend buffer: 500ms / 20 events batching
├── Idempotency: BloomFilter (50K entries, 0.01% FPP)
└── Categories: Set[CategoryId]
Publisher
└── Site (sharded by siteId)
├── Config: domain, seedUrl, cronSchedule, maxDepth
├── PacingConfig: dayDuration, traffic shapes, warmupMode
├── Ad product blocklist: Set[AdProductCategoryId]
└── Slots: List[AdSlotConfig(slotId, width, height)]
AuctioneerEntity (sharded by siteId)
├── Page classifications: Map[URL, Classification]
├── Participating campaigns: Map[CampaignId, Set[URL]]
├── TaxonomyRankerEntity (sharded by category|siteId)
│ └── Thompson Sampling weights, half-life decay
└── CategoryBidderEntity (sharded by category|siteId|shard)
└── Virtual sharding: hash(siteId) % 5
CampaignDirectory (ClusterSingleton)
├── Reverse index: CategoryId → Map[CampaignId, AdvertiserId]
├── Routes updates via CampaignDistributor (8 workers)
│ └── Fan-out to CategoryBidderEntity shards
└── 60-second reconciliation cycle
Sharding Strategy
Each entity type uses a different shard key optimized for its access pattern:
| Entity | Shard Key | Shards | Rationale |
|---|---|---|---|
| AuctioneerEntity | siteId | 100 | All pages on a site auction together |
| CategoryBidderEntity | category|siteId|shard | 100 × 5 virtual | Distributes load within popular categories |
| TaxonomyRankerEntity | category|siteId | 100 | Co-located with bidder for low-latency |
| CampaignEntity | advertiserId|campaignId | 100 | Independent lifecycle, RL state per campaign |
| AdvertiserEntity | advertiserId | 100 | Budget and frequency caps per advertiser |
| CampaignDistributor | N/A | 8 workers | Routes by hash(categoryId) % 8 |
Entity Lifecycle
CampaignEntity
- Status enum:
Active,Paused - Active: Responds to bid requests, RL agent adjusts multiplier every 15 min
- Paused: Stops responding, creatives removed from ServeIndex
- Budget exhausted: Stops bidding, creatives remain in ServeIndex (budget resets daily)
- Day reset guard:
lastRolledEpochDayprevents double-roll on same calendar day - Passivation: After 5 minutes of inactivity
CampaignEntity Spend Recording
The spend path is carefully designed for correctness:
- Buffered: 500ms timer OR batch of 20 events (whichever fires first)
- Idempotency: 50K-entry Bloom filter (0.01% FPP) + 50K Scaffeine cache (5min TTL)
- At-least-once: Pending reports retry with exponential backoff (100ms → 5s, max 5 attempts)
- Persist-then-publish: State saved before
SpendUpdateevent published
AuctioneerEntity
- Activated on first crawl of a site’s page
- Tracks which campaigns participated in recent auctions (for targeted re-auction)
- Periodic re-auction: Every 5 minutes (
promovolve.auction.reauction-interval) - Cleanup: Removes classifications older than 48 hours every 5 minutes
- Passivates after 5 minutes of inactivity
AdvertiserEntity
- Tracks: Set of campaigns, Map of creatives, daily budget/spend
- Flush ID dedup: Maintains last 1000 processed flush IDs (
MaxProcessedFlushIds) - Day reset: Based on
lastResetEpochDaycomparison with current epoch day
Data Flow: Crawl vs Serve
Promovolve separates its workload into two distinct phases with fundamentally different performance characteristics.
Crawl Phase (Write Path)
The crawl phase runs on a configurable schedule (default: Quartz cron "0 0 2 * * ?" — 2am daily) and is the “heavy” computation path. Crawl configuration per site includes maxDepth (default: 2) and concurrency (default: 5), running on a dedicated crawler-dispatcher with 4 fixed threads.
graph TD
Crawler["External Crawler<br/>(4-thread pool)"] --> Classification["Page Classification<br/>(LLM: Gemini/OpenAI/Anthropic)<br/>categories + confidence scores"]
Classification --> Auctioneer["AuctioneerEntity<br/>(sharded by siteId)"]
Auctioneer --> Taxonomy["TaxonomyRankerEntity<br/>(800ms timeout)<br/>Thompson-sampled weights, 7-day half-life<br/>site-blend threshold: 20.0, min imps: 100"]
Auctioneer --> CatBid["CategoryBidderEntity fan-out<br/>(5 virtual shards)"]
CatBid --> CampDist["CampaignDistributor (8 workers)"]
CampDist --> CampResp["CampaignEntity bid responses<br/>bidCpm = max(maxCpm × multiplier, floor)"]
Auctioneer --> ServeIndex["Candidate shortlisting → ServeIndex<br/>(DData, WriteLocal, 120-min TTL)"]
Serve Phase (Read Path)
The serve phase handles every ad request and must be extremely fast.
graph TD
User["User Request (page load)"] --> API["API Node (HTTP, port 8080)"]
API --> Lookup["Lookup ServeIndex from local DData<br/>Key: siteId|slotId → Vector of CandidateView"]
Lookup --> Recency["Content Recency Filter<br/>classifiedAtMs within 48h window"]
Recency --> FreqCap["Frequency Cap Check<br/>(100ms timeout, fail-open)<br/>query AdvertiserEntity per user"]
FreqCap --> Rate["Rate Tracking<br/>(synchronous EMA, 1s window, α=0.3)"]
Rate --> Pacing["Pacing Gate (PI control)<br/>aggregate budget from CachedSpendInfo<br/>throttle probability 0.0–0.99"]
Pacing -->|"random() < throttle"| Skip["Skip (204)"]
Pacing -->|pass| TS["Thompson Sampling Selection<br/>sample Beta(clicks+1, non_clicks+1)<br/>score = sampledCTR × log(1 + CPM)<br/>argmax"]
TS --> Budget["Budget Reservation<br/>CampaignEntity.Reserve +<br/>AdvertiserEntity.GetBudgetStatus"]
Budget -->|failure| Next["Try next-best by Thompson score"]
Budget -->|success| Serve["Serve ad"]
Next -->|all exhausted| NoCandidates["NoCandidates (204)"]
Why Two Phases?
| Concern | Crawl Phase | Serve Phase |
|---|---|---|
| Latency | Seconds OK | Must be < 1ms |
| Computation | Full auction, LLM classification | Cache lookup + Beta sampling |
| Fan-out | Many entities | Zero (local DData) |
| Failure mode | Retry on next crawl | Serve cached candidates |
| Scaling | Add entity nodes | Add API nodes |
| Dispatcher | crawler-dispatcher (4 threads) | Default Pekko dispatcher |
This separation means:
- Auction complexity doesn’t affect serve latency — LLM classification and multi-entity fan-out happen in the background
- Serve capacity scales independently — adding API nodes increases request throughput without affecting auction load
- Temporary failures are invisible to users — cached candidates remain in ServeIndex until their 120-minute TTL expires
Periodic Batch Auction
The defining architectural choice of Promovolve is that auctions run ahead of time, not per-request. When content is crawled (default schedule: 2am daily via Quartz cron), the system runs a full multi-phase auction and caches results in DData for instant serve-time lookups.
Auction Pipeline
┌─────────────────────────┐
│ Page Classification │ LLM-based (Gemini/OpenAI/Anthropic)
│ │ → IAB categories + confidence scores
└────────┬────────────────┘
▼
┌─────────────────────────┐
│ Category Ranking │ TaxonomyRankerEntity per (category, site)
│ │ → Thompson-sampled weights, 7-day half-life
└────────┬────────────────┘
▼
┌─────────────────────────┐
│ Bid Collection │ CategoryBidderEntity (5 virtual shards)
│ │ → CampaignDistributor (8 workers)
│ │ → CampaignEntity bid responses
└────────┬────────────────┘
▼
┌─────────────────────────┐
│ Candidate Shortlisting │ Fair selection: 1 per campaign, fill remainder
│ │ → Top K per slot (default K=3)
└────────┬────────────────┘
▼
┌─────────────────────────┐
│ ServeIndex Caching │ DData WriteLocal, 120-minute TTL
│ │ → Replicated to all API nodes via gossip
└─────────────────────────┘
Periodic Re-Auction
Between crawl cycles, the system runs periodic re-auctions every 5 minutes (promovolve.auction.reauction-interval) for recent content within the 48-hour recency window. Additionally, event-driven re-auctions trigger on campaign/advertiser state changes.
Content Recency Window
Only pages classified within the last 48 hours participate in auctions. Every 5 minutes, AuctioneerEntity runs cleanup to remove classifications older than 48 hours.
Key Configuration
| Parameter | Value | Env Var |
|---|---|---|
| Re-auction interval | 5 minutes | REAUCTION_INTERVAL |
| Content recency | 48 hours | — |
| Crawl cron schedule | "0 0 2 * * ?" | Per-site config |
| Crawl max depth | 2 | Per-site config |
| Crawl concurrency | 5 | Per-site config |
| ServeIndex TTL | 120 minutes | — |
| Taxonomy ask timeout | 800ms | — |
Phase 1: Page Classification
Before any auction can run, the system must understand what a page is about. Page classification maps URLs to IAB Content Taxonomy 2.1 categories with confidence scores using LLM-based analysis.
Two Taxonomies, One Match
Promovolve uses two distinct IAB taxonomies that meet at auction time:
| Taxonomy | Version | Who sets it | Purpose |
|---|---|---|---|
| Ad Product Taxonomy | 2.0 | Advertiser | “What is my product?” (e.g., Travel, Kitchen Equipment) |
| Content Taxonomy | 2.1 | LLM classifier | “What is this page about?” (e.g., Destinations, Outdoor Recreation) |
The advertiser never sees content categories. They pick their product category, and ContentToAdProductMapping derives the matching content categories using the official IAB mapping file (content_2.1_to_ad_product_2.0.tsv). If no direct mapping exists for a product category, the system walks up the taxonomy’s parent chain until it finds one.
At auction time, matching is exact: the page’s content category must be in the campaign’s derived content category set. There is no fuzzy or hierarchical matching at bid time — the hierarchy is resolved once, at campaign setup.
Classification Pipeline
Promovolve supports multiple LLM providers for classification, configured in application.conf:
| Provider | Config Key | Env Var |
|---|---|---|
| Gemini | promovolve.gemini.api-key | GEMINI_API_KEY |
| OpenAI | promovolve.openai.api-key | OPENAI_API_KEY |
| Anthropic | promovolve.anthropic.api-key | ANTHROPIC_API_KEY |
Gemini is enabled by default (promovolve.gemini.enabled = true).
Classification Output
The LLM returns category IDs which are normalized to IAB Content Taxonomy 2.1 numeric IDs. Legacy IAB 1.0 format IDs (e.g., "IAB17") are converted via TieredCategory.normalize() to their 2.1 equivalents (e.g., "483"). The result is a map of category-to-confidence:
{
"url": "https://example.com/sports/nba-finals-recap",
"categories": {
"483": 0.92,
"484": 0.85,
"393": 0.45
}
}
Each Confidence value is an opaque Double in [0, 1]. All downstream matching uses these numeric Content Taxonomy 2.1 IDs.
Top-K Category Selection
AuctioneerEntity selects the top K categories (default K=3) by confidence score. Only these categories proceed to ranking and bidding.
Classification Storage
Classifications are stored in AuctioneerEntity’s state as a Map[URL, Classification], keyed by page URL and timestamped with classifiedAtMs. Every 5 minutes, a cleanup task removes entries older than the 48-hour recency window.
Role in Scoring
The confidence score feeds into category ranking:
categoryScore = classifierConfidence × rankerWeight
This composite score is stored in CandidateView.categoryScore and used as a prior for Thompson Sampling during cold start.
Phase 2: Category Ranking
After page classification identifies the top K categories, each category is assigned a ranker weight that reflects how well ads in that category have historically performed on this specific site.
TaxonomyRankerEntity
Each (category, siteId) pair has its own TaxonomyRankerEntity. Configuration from application.conf:
| Parameter | Default | Env Var |
|---|---|---|
| Half-life | 7 days | TAXONOMY_RANKER_HALF_LIFE |
| Prior α | 1.0 | TAXONOMY_RANKER_PRIOR_ALPHA |
| Prior β | 1.0 | TAXONOMY_RANKER_PRIOR_BETA |
| Flush interval | 5 seconds | TAXONOMY_RANKER_FLUSH_EVERY |
| Site blend threshold | 20.0 | — |
| Site min impressions | 100.0 | TAXONOMY_RANKER_SITE_MIN_IMPRESSIONS |
| Site stats max age | 14 days | — |
| Max sites per category | 5000 | TAXONOMY_RANKER_MAX_SITES |
Weight Calculation
The ranker uses Thompson Sampling with a Beta-Bernoulli model:
- Maintain per-category click/impression counts for this site
- Model CTR as
Beta(prior_α + clicks, prior_β + non_clicks)— default prior isBeta(1, 1)(uniform) - Sample from the Beta distribution to get a weight
- Return sampled weight to AuctioneerEntity
Site Blending
When a specific site has fewer than site-min-impressions (100) observations, the ranker blends site-specific statistics with global category statistics using the site-blend-threshold (20.0). This prevents new sites from suffering cold-start issues.
Fan-Out and Timeout
AuctioneerEntity queries all K TaxonomyRankerEntities in parallel with an 800ms timeout.
If a ranker doesn’t respond within 800ms:
- Use cached weight with half-life decay:
weight × 0.5^(ageSeconds / halfLifeSeconds) - Where
halfLifeSeconds = 7 days = 604800sby default - Fall back to prior weight (0.5) if no cached data exists
Stats Lifecycle
- Stats older than
site-stats-max-age(14 days) are pruned - Per-category site count is capped at
max-sites-per-category(5000) - Stats are flushed to persistence every
flush-every(5 seconds)
Final Category Score
categoryScore = classifierConfidence × rankerWeight
This score propagates to CandidateView.categoryScore and serves as the Thompson Sampling prior during cold start at serve time.
Phase 3: Bid Collection
For each selected category, the system fans out to all active campaigns and collects bids. This is the most distributed phase of the auction.
End-to-End Matching: From Ad Product to Page Content
To understand bid collection, it helps to see the full chain that connects an advertiser’s product to a publisher’s page. Here’s a concrete example with a gym campaign:
Campaign setup (happens once):
- Advertiser creates a campaign and selects ad product: “Gyms and Health Clubs” (IAB Ad Product 1512)
ContentToAdProductMapping.getContentForAdProduct("1512")looks up the IAB mapping- No direct mapping for 1512 → walks up to parent 1510 (Fitness Activities)
- 1510 maps to content categories {225, 227} (Fitness and Exercise, Running and Jogging)
- Campaign stores
categories = Set(225, 227)— these are the content types this campaign will bid on - CampaignDirectory registers the campaign under categories 225 and 227
- CategoryBidderEntity for categories 225 and 227 now knows this campaign exists
Page crawl (happens per page):
- SiteEntity collects demand categories from all active campaigns →
{225, 227} buildTaxonomyCandidatesexpands these with descendants →{226, 227}(Participant Sports, Running and Jogging)- This becomes the candidate list sent to Gemini — the LLM only sees categories that active campaigns are targeting
- Gemini classifies the page text using only those categories
- If it returns 225 or 227 with sufficient confidence, AuctioneerEntity fans out bid requests to CategoryBidderEntity
- CategoryBidderEntity routes to the gym campaign
- Campaign bids → candidate created → queued for publisher approval
Key design decisions:
- The LLM prompt is constrained to demand categories — it only classifies into categories that have active campaigns. This saves tokens and avoids classifying content nobody is advertising for.
- Hallucinated category IDs (where the LLM returns an ID not in the candidate list) are filtered out — only valid matches produce auctions.
- The advertiser never sees content categories. They pick their product; the IAB mapping handles the rest.
CategoryBidderEntity
Each (category, siteId) pair uses 5 virtual shards to distribute load. The shard is selected by hash(siteId) % 5, so the actual entity key is category|siteId|shardIndex.
CampaignDistributor
Within each CategoryBidderEntity, a CampaignDistributor manages fan-out to individual campaigns using 8 worker actors, routed by hash(categoryId) % 8.
Bid Request → Response
Each CampaignEntity evaluates the request and responds with eligible creatives. The bid CPM is computed as:
bidCpm = max(maxCpm × bidMultiplier, floorCpm)
Where:
maxCpm: The campaign’s configured maximum CPM (default: $5.00)bidMultiplier: The RL agent’s current multiplier, clamped to[minMultiplier, maxMultiplier]floorCpm: The system floor price (default: $0.50)
The RL agent ensures the bid never falls below floor even with a low multiplier.
Eligibility Filters (Campaign-Side)
A CampaignEntity will not respond if any of these checks fail:
- Category mismatch: The page category is not in the campaign’s
categoriesset — this is the primary filter. The campaign’s categories are derived from its Ad Product Taxonomy 2.0 ID viaContentToAdProductMapping, which maps to a set of Content Taxonomy 2.1 IDs. Matching is exact:state.categories.contains(pageCategory) - Category blocklisted: The category is in the campaign’s
categoryBlocklist(explicit exclusions) - Status paused: Campaign
status != Active - Budget exhausted:
dailyBudget - (spendToday + bufferedSpend) <= 0 - Day-aware check: If the calendar day changed since
lastResetInstant, the budget is treated as fresh (reset happens lazily) - Site blocklisted: Publisher’s site is on the advertiser’s
siteBlacklist - No matching sizes: None of the campaign’s
allowedSizesfit the slot’sAdSlotConfig(width, height)
Aggregation Rules
The CategoryBidderEntity aggregates responses:
- CPM threshold: Only candidates within top 20% of the highest CPM are kept:
cpm ≥ maxCpm × (1.0 - 0.20) - Campaign cap: Maximum 50 campaigns per category (
maxCampaignsPerCategory), ranked by CPM descending - One creative per campaign: The highest-CPM creative wins if a campaign has multiple eligible creatives
Response Structure
Each eligible creative is wrapped in a Candidate:
Candidate(
creativeId: CreativeId,
campaignId: CampaignId,
advertiserId: AdvertiserId,
cpm: CPM, // bidCpm from above
category: CategoryId,
creativeHash: String,
landingDomain: String,
preApproved: Boolean,
frequencyCap: Option[Int],
adProductCategory: Option[AdProductCategoryId]
)
Phase 4: Candidate Shortlisting
This is the critical phase where Promovolve diverges from traditional auctions. Instead of selecting a single winner, it creates a shortlist of top-K candidates per slot for serve-time exploration, using a fair selection algorithm that guarantees per-campaign diversity.
Fair Candidate Selection Algorithm
The shortlisting algorithm ensures each campaign gets representation before any campaign gets a second slot:
1. Collect all CampaignBidResponses across all categories
2. Group by campaign → pick best creative per campaign (by CPM)
3. If #campaigns ≥ #slots:
Take top campaigns by CPM, one creative each
4. Else (fewer campaigns than slots):
Each campaign gets 1 slot (guaranteed representation)
Fill remaining slots with next-best creatives from existing campaigns
5. Record participating campaigns → Map[CampaignId, Set[URL]]
Why This Algorithm?
This ensures that 3 campaigns with 1 creative each will all be represented in a 3-slot configuration, rather than having one high-CPM campaign fill all 3 slots. Only when there are fewer campaigns than slots does any campaign get multiple creatives in the shortlist.
Campaign Participation Tracking
AuctioneerEntity maintains:
participatingCampaigns: Map[CampaignId, Set[URL]]
This enables targeted re-auction: when a campaign’s state changes, the system knows exactly which pages are affected.
CandidateView Structure
Each shortlisted candidate is stored as a CandidateView:
CandidateView(
creativeId: CreativeId,
campaignId: CampaignId,
advertiserId: AdvertiserId,
assetUrl: CDNPath, // URI to CDN-hosted creative asset
mime: MimeType, // imageJpeg, imagePng, imageGif, imageWebp, videoMp4
width: Int,
height: Int,
category: CategoryId,
cpm: CPM,
classifiedAtMs: Long, // when the page content was classified
categoryScore: Double, // classifierConfidence × rankerWeight (default 0.5)
frequencyCap: Option[Int],
adProductCategory: Option[AdProductCategoryId],
landingDomain: String
)
Note: impression and click statistics are tracked separately in CreativeStats at the AdServer level, not stored in the CandidateView itself. This allows stats to accumulate across auction cycles.
Standard Ad Sizes
Promovolve supports IAB standard sizes defined as AdSize opaque type (Int, Int):
| Name | Size |
|---|---|
| Medium Rectangle | 300 × 250 |
| Leaderboard | 728 × 90 |
| Wide Skyscraper | 160 × 600 |
| Mobile Banner | 320 × 50 |
| Billboard | 970 × 250 |
| Half Page | 300 × 600 |
| Large Mobile Rectangle | 320 × 100 |
Image assets are subject to the IAB LEAN Ad limit: max file size 50 KiB (promovolve.image-limits.max-file-size, configurable via IMAGE_MAX_FILE_SIZE).
Phase 5: ServeIndex Caching
The final auction phase stores shortlisted candidates in the distributed in-memory cache (ServeIndex) for instant retrieval at serve time.
ServeIndex Write
After shortlisting, AuctioneerEntity writes the candidate set to ServeIndex:
Key: siteId|slotId
Value: ServeView(
candidates: Vector[CandidateView],
version: Long, // auction timestamp
expiresAtMs: Long // currentTimeMillis + TTL
)
Write Semantics
| Operation | Consistency | Use Case |
|---|---|---|
| Put (full replacement) | WriteLocal | Fresh auction results |
| Append (single candidate) | WriteLocal + dedup by creativeId | Adding orphaned creative |
| Remove | WriteMajority(800ms) + retry (max 5, backoff 200ms) | Creative/campaign takedown |
| CPM update | WriteLocal | Best-effort CPM refresh |
| FilterByCreativeIds | WriteLocal | Keep only valid creatives |
TTL
Each entry has a default TTL of 120 minutes. On budget exhaustion events, TTL is refreshed to dayDurationSeconds × 1.1 × 1000ms to ensure entries survive until the next daily budget reset.
Replication
ServeIndex uses Pekko DData with gossip-based replication:
- Gossip interval: 2 seconds
- Notify subscribers: 500ms
- Max delta elements: 500 per gossip round
Every API node gets a complete local copy within seconds of a write.
Bucketing
Entries are partitioned into 32 buckets (power-of-2) by hash of the key. Each bucket is an independent LWWMap[String, ServeView]. This keeps CRDT delta sizes small — an update to one bucket doesn’t generate deltas for entries in other buckets.
Removal Operations
ServeIndex supports granular removal:
RemoveCampaignFromKey: Remove all candidates from a specific campaign across slotsRemoveCreativeFromKey: Remove a specific creative across all slotsRemoveBySite: Batch removal for all slots on a site
All removals use WriteMajority with retries for durability.
Re-Auction & Event Triggers
Between crawl cycles, the system keeps the ServeIndex fresh through periodic and event-driven re-auctions.
Periodic Re-Auction
AuctioneerEntity runs a full re-auction every 5 minutes (promovolve.auction.reauction-interval, env: REAUCTION_INTERVAL) for all pages within the 48-hour content recency window.
Event-Driven Re-Auction
Campaign-Level Events (Targeted)
These trigger re-auction only for pages where the affected campaign participated (using the participatingCampaigns map):
| Event | ServeIndex Action | Re-Auction Scope |
|---|---|---|
CampaignBudgetExhausted | Refresh TTL (keep entry) | Participating pages |
CampaignBudgetReset | Refresh TTL | Participating pages |
CampaignPaused | Remove from ServeIndex | Participating pages |
CampaignAdProductChanged | Remove from ServeIndex | Participating pages |
CpmUpdated | Update CPM in ServeIndex | Participating pages |
CreativeStatusChanged(isActive=false) | Remove creative | Participating pages |
Advertiser-Level Events (Full Site)
These affect all campaigns under an advertiser:
| Event | ServeIndex Action | Re-Auction Scope |
|---|---|---|
AdvertiserBudgetExhausted | Refresh TTL (keep entry) | All recent pages on site |
AdvertiserBudgetReset | Refresh TTL | All recent pages on site |
AdvertiserSuspended | Remove from ServeIndex | All recent pages on site |
Budget Exhaustion: Keep, Don’t Remove
When a campaign or advertiser budget is exhausted, creatives are not removed from ServeIndex. Instead:
- TTL is refreshed to
dayDurationSeconds × 1.1 × 1000ms(extends past next budget reset) - The serve-time pacing gate checks budget before serving
- If budget is exhausted, the candidate is skipped and Thompson Sampling selects another
- When budget resets (next day), the creative resumes serving without re-auction
Why? Budget exhaustion is temporary — budgets reset daily. Removing and re-inserting entries would:
- Create unnecessary DData churn (WriteMajority removes are expensive)
- Lose the creative’s approval status
- Require a full re-auction to restore the entry
Permanent Removal
Only these events warrant actual removal from ServeIndex:
- Creative paused/deactivated
- Campaign paused
- Campaign ad product category changed (may violate publisher blocklist)
- Advertiser suspended
These use WriteMajority consistency with up to 5 retries and 200ms initial backoff.
Content Cleanup
Every 5 minutes, AuctioneerEntity prunes classifications older than 48 hours from its internal Map[URL, Classification], ensuring stale content naturally ages out.
Published Events
Re-auction and budget events are published as domain events (extending CborSerializable) for cross-entity coordination:
SpendUpdate: Published every ~500ms or 20 events from CampaignEntity, includesdailyBudget,todaySpend,dayStartPendingCreativesQueued: Triggers SSE notifications for publisher approval workflow
Why Multi-Candidate?
The decision to keep multiple candidates per slot — rather than selecting a single auction winner — is the most important architectural choice in Promovolve.
The Problem with Single-Winner Auctions
In a traditional ad exchange, each auction produces one winner:
- Exploitation trap: The highest bidder always wins, even with terrible CTR
- No exploration: No mechanism to discover if a lower-bidding creative performs better
- Fragile serving: If the winner’s budget runs out, the system must re-auction or show nothing
- Misaligned incentives: Exchange optimizes for revenue, not user experience
How Multi-Candidate Solves This
Promovolve’s fair selection algorithm guarantees per-campaign diversity (one creative per campaign first, then fill remainder), and Thompson Sampling explores among them at serve time.
The Scoring Formula
score = sampledCTR × log(1 + CPM)
Where sampledCTR is drawn from Beta(clicks + 1, non_clicks + 1) using time-bucketed statistics (1-minute granularity, 60-minute rolling window).
Exploration in Action
Slot candidates after fair selection:
Campaign A: CPM $5.00, Beta(6, 146) → sample: 0.032
Campaign B: CPM $4.20, Beta(3, 19) → sample: 0.091
Campaign C: CPM $3.80, Beta(1, 1) → sample: 0.647
Scores:
A: 0.032 × log(6.00) = 0.032 × 1.79 = 0.057
B: 0.091 × log(5.20) = 0.091 × 1.65 = 0.150
C: 0.647 × log(4.80) = 0.647 × 1.57 = 1.016
→ C wins (exploration of unknown creative)
Graceful Degradation
When Campaign A exhausts its budget:
- Pacing gate checks budget before Thompson Sampling
- Campaign A is filtered out
- Thompson Sampling runs on B and C only
- No re-auction needed — no DData operations
- When A’s budget resets next day, it resumes serving (entry was kept in ServeIndex)
Publisher Alignment
The sampledCTR factor naturally favors creatives users actually click on. High-CPM but low-CTR creatives lose to engaging ones over time, aligning publisher interests (engagement, user trust) with advertiser interests (actual clicks).
The Trade-off
Multi-candidate selection means the highest bidder doesn’t always win. This reduces short-term CPM revenue but increases:
- Long-term revenue: Better CTR → more clicks → better campaign ROI → higher advertiser retention
- System resilience: Fallback candidates reduce re-auction frequency
- Learning: Thompson Sampling converges to the best performer without any exploration rate to tune
Publisher Creative Approval
In traditional ad tech, publishers have no say over what appears on their site. The exchange picks a winner, and the publisher’s ad server renders it — sight unseen. If an inappropriate ad slips through, the publisher’s only recourse is to file a complaint after the fact.
Promovolve inverts this. Every creative must be approved by the publisher before it can be shown to readers. This isn’t a bolt-on compliance feature — it’s a core design constraint that shapes the auction system, the multi-candidate architecture, and the serving pipeline.
Why Approval Matters
Magazine advertising always had publisher approval. An editor at a cooking magazine would review every ad before it ran — no gambling ads next to a recipe, no competitor ads next to a feature article. The publisher’s editorial judgment was part of the product.
Promovolve restores this for the web. A publisher running a Japanese travel blog can:
- Approve a ryokan ad that complements their Kyoto temple article
- Reject a fast-food chain ad that doesn’t fit their editorial voice
- Block entire ad product categories (gambling, alcohol) site-wide
- Revoke a previously approved creative if their standards change
This is also why Promovolve uses multi-candidate auctions. If the system only picked one winner and the publisher rejected it, the slot would be empty. With multiple candidates queued, rejecting one simply promotes the next.
The Approval Lifecycle
A creative goes through distinct states as it moves through the system:
stateDiagram-v2
[*] --> Pending: Auction Result
Pending --> Approved: Publisher approves
Approved --> Serving: Added to ServeIndex
Pending --> Rejected: Publisher rejects
Approved --> Revoked: Publisher revokes
1. Auction produces candidates
The AuctioneerEntity shortlists multiple candidates per ad slot and sends them to the AdServer. Each candidate carries a preApproved flag — but the AdServer doesn’t trust it blindly.
2. AdServer determines actual approval status
Instead of relying on the preApproved flag (which comes from a probabilistic Cuckoo filter and can have false positives), the AdServer queries the ServeIndex to see which creatives are actually serving:
existingCreativeIds =
creatives in this slot's ServeIndex
+ creatives approved at any other slot on this site (inverted index)
+ creatives loaded from DB on startup (persisted approvals)
This three-source merge means:
- A creative approved at one slot is recognized site-wide
- Approvals survive process restarts (loaded from PostgreSQL)
- Re-auctions don’t re-queue already-approved creatives
3. Partition: approved vs pending
The AdServer partitions candidates into two groups:
approved = candidates whose creativeId is in existingCreativeIds
pending = everything else
Approved creatives go straight to the ServeIndex — they’re already trusted. The AdServer fetches their category scores from the TaxonomyRankerEntity, builds CandidateView objects with CDN asset URLs and dimensions, and writes them to DData. They can be served immediately.
Pending creatives are queued in the PendingSelectionStore (PostgreSQL) for the publisher to review. They cannot be served until approved.
4. Blocklist filtering
Before any of this happens, candidates are filtered against two blocklists:
- Domain blocklist: Publishers can block specific landing domains. A creative linking to a competitor’s site is filtered out before it ever reaches the pending queue.
- Ad product category blocklist: Publishers can block entire product categories (e.g., gambling, alcohol, firearms). Distributed via DData, this filter runs at auction time.
Blocked creatives are silently dropped — they never appear in the publisher’s approval queue.
The Pending Queue
The pending queue is the publisher’s inbox for new creatives. It’s persisted in PostgreSQL (table: pending_selection) so it survives restarts.
Data model
Each pending entry is a Selection — an ordered list of candidates for a specific (publisher, URL, slot) combination:
Selection
publisherId: String
url: String
slotId: String
ordered: Vector[Candidate] — ranked by CPM
idx: Int — index of current candidate being reviewed
state: Pending
expiresAt: Instant — TTL-based expiration
The idx pointer tracks which candidate the publisher is currently reviewing. When a creative is rejected, the pointer advances to the next candidate.
Key operations
| Operation | What happens |
|---|---|
upsertPending | Write/overwrite a pending selection for a slot |
getPending | Fetch current pending for a slot |
pendingQueue | List all pending items for a publisher (for the dashboard) |
removeCreativeFromPending | Remove a specific creative after approval, keep the rest |
rejectAndPromote | Reject current candidate, advance to next in queue |
purgeExpired | Clean up expired selections (TTL-based) |
flagCreative | Quarantine a creative with a reason (for later review) |
unflagCreative | Return a quarantined creative to the pending queue |
Budget exhaustion cleanup
When a campaign or advertiser runs out of budget, their creatives are removed from the pending queue — there’s no point asking the publisher to review an ad that can’t pay:
| Event | Cleanup |
|---|---|
| Campaign budget exhausted | removeByCampaignId — remove all pending creatives for this campaign |
| Advertiser budget exhausted | removeByAdvertiserId — remove all pending creatives for this advertiser |
| Creative paused | removeCreativeFromAll — remove from all pending slots |
| Landing domain blocked | removeByLandingDomain — remove all creatives with this domain |
| Ad product category blocked | removeByAdProductCategory — remove all creatives in this category |
The Three Publisher Actions
Approve
The publisher reviews a pending creative and approves it:
- Validate the creative ID matches the current candidate in the queue
- Fetch category scores from TaxonomyRankerEntity
- Build a
CandidateViewwith CDN asset URL, dimensions, and metadata - Append to ServeIndex via DData — the creative is now live
- Persist approval to PostgreSQL (
insertApproved) — survives restarts - Update AdvertiserEntity with
ApprovalStatus.Approved - Remove from pending queue
- Broadcast SSE event:
approved
The creative begins serving to readers on the next page load.
Reject
The publisher reviews a pending creative and rejects it:
- Update AdvertiserEntity with
ApprovalStatus.Rejected— recorded in a Bloom filter so the creative won’t be re-submitted in future auctions for this site - Remove from ServeIndex (if it was somehow there)
- Call
rejectAndPromoteto advance the queue to the next candidate - If the queue is exhausted (no more candidates), trigger a re-auction so other campaigns can fill the slot
- Broadcast SSE event:
rejected
Rejection is permanent for this site — the Bloom filter prevents the same creative from appearing in future pending queues.
Revoke
The publisher changes their mind about a previously approved creative:
- Remove from ServeIndex — the creative stops serving immediately
- Clear from both approved and rejected filters in AdvertiserEntity
- Broadcast SSE event:
revoked
Unlike rejection, revocation is reversible — the creative can be re-queued for approval later (e.g., after the advertiser updates it).
Bulk approve
For publishers who trust an advertiser or want to quickly clear their queue:
POST /v1/publishers/{publisherId}/sites/{siteId}/creatives/bulk-approve
Approves all pending creatives for a slot in one operation. Each creative goes through the same approval flow (ServeIndex update, DB persistence, AdvertiserEntity notification). A single SSE event (bulk-approved) is broadcast with the count.
Real-Time Notifications (SSE)
Publishers don’t have to poll for new creatives. Promovolve streams events in real time via Server-Sent Events:
GET /v1/publishers/{publisherId}/sites/{siteId}/events
Event types
| Event | When | Payload |
|---|---|---|
pending-updated | New creatives queued for review | siteId, url, slotId, count, topCreativeId |
approved | Creative approved and now serving | siteId, url, slotId, creativeId |
rejected | Creative rejected | siteId, url, slotId, creativeId |
bulk-approved | Multiple creatives approved at once | siteId, url, slotId, approvedCount |
revoked | Approval revoked, creative removed from serving | siteId, creativeId |
creative-status-changed | Creative paused or reactivated by advertiser | creativeId, campaignId, status |
campaign-status-changed | Campaign status changed | campaignId, status |
| heartbeat | Keep-alive ping | (empty, every 30 seconds) |
Architecture
The PendingEventHub actor manages SSE subscribers grouped by site:
PendingEventHub
└── subscribers: Map[siteId → Set[ActorRef[PendingEvent]]]
- Site-specific events (pending, approved, rejected) go to subscribers for that site
- Cross-site events (creative-status-changed, campaign-status-changed) broadcast to all subscribers
- Subscribers auto-unsubscribe when the SSE stream terminates
- Stale subscribers are cleaned up via actor death-watch
Pre-Approved: The Auction Tiebreaker
When the AuctioneerEntity sorts candidates, pre-approved creatives get a tiebreaker advantage:
sort key = (-CPM, if preApproved then 0 else 1)
At equal CPM, a pre-approved creative ranks higher than an unapproved one. This has two effects:
- Faster time-to-serve: Pre-approved creatives skip the pending queue and go straight to the ServeIndex, so they start earning impressions sooner
- Re-auction stability: When a re-auction runs, already-approved creatives maintain their position rather than being displaced by new, unapproved ones that would sit in the queue
How Approval Enables Multi-Candidate Auctions
The approval workflow is the reason Promovolve uses multi-candidate auctions in the first place. Consider the alternative:
Single-winner auction without approval: The exchange picks one winner. It starts serving immediately. The publisher sees an ad for online gambling on their children’s education blog. Damage done.
Single-winner auction with approval: The exchange picks one winner. The publisher rejects it. The slot is empty until the next auction. Readers see no ad. Revenue is zero.
Multi-candidate auction with approval: The auction shortlists three candidates. The publisher rejects the first one. The second candidate is already queued and ready. The slot is never empty. Revenue continues. The publisher maintains editorial control.
This is the design that makes approval practical at scale — without it, publisher approval would mean empty slots and lost revenue every time a creative is rejected.
Approval Persistence
Approvals are stored in two places for different purposes:
| Storage | Purpose | Survives restart? |
|---|---|---|
| ServeIndex (DData) | Fast serve-time lookups | No (ephemeral, rebuilt from auctions) |
PostgreSQL (approved_creatives) | Approval state of record | Yes |
keysByCreative (in-memory inverted index) | Site-wide approval recognition | No (rebuilt from ServeIndex on startup) |
persistedApprovedIds (loaded from DB) | Bootstrap approvals on startup | Yes (loaded from PostgreSQL) |
On startup, the AdServer loads persistedApprovedIds from PostgreSQL. When a re-auction runs, creatives in this set are recognized as already approved and skip the pending queue — the publisher doesn’t have to re-approve creatives they already reviewed.
Thompson Sampling from Scratch
You have three ads for a travel blog. You don’t know which one readers prefer. How do you find out — without wasting thousands of impressions on the worst one?
This is the multi-armed bandit problem, and Thompson Sampling is Promovolve’s answer. This chapter builds the intuition from zero.
The Problem: Explore vs Exploit
Imagine you’re in a casino with three slot machines. Each has a different (unknown) payout rate. You have 1,000 coins. How do you maximize your winnings?
- Pure exploitation: Play the first machine that pays out, stick with it forever. Problem: you might have gotten lucky. The other machines might be better.
- Pure exploration: Play all three equally, 333 times each. Problem: you’re wasting coins on the worst machine long after you know it’s bad.
The optimal strategy is somewhere in between: explore early to learn which machine is best, then gradually shift to exploiting the best one. This is the explore-exploit trade-off.
In Promovolve, the “slot machines” are ad creatives. The “payout” is whether the user clicks. The “coins” are impressions — each one costs the advertiser money and uses the publisher’s ad slot. You want to show the best-performing creative most of the time, but you also need to try new or uncertain ones to learn if they’re better.
A Bad Solution: A/B Testing
The standard approach to “which creative is best?” is A/B testing: show each creative to an equal number of users, wait for statistical significance, then declare a winner and show it to everyone.
This works, but it’s wasteful. If creative A has a 5% click rate and creative B has a 0.5% click rate, equal splitting means half your traffic sees the bad creative for the entire test duration. And when the test ends, you stop learning — if a new creative arrives, you need to start a new test.
What you really want is a system that:
- Tries each creative a few times
- Quickly figures out which ones are good
- Shifts traffic toward the good ones
- Never completely stops trying — in case a creative’s performance changes
Thompson Sampling does all of this automatically.
The Key Idea: Uncertainty as Exploration
Here’s the core insight. Instead of tracking a single number (“creative A has a 5% click rate”), track your uncertainty about that number.
After 2 impressions and 1 click, you think creative A has about a 50% click rate — but you’re not very sure. It could be anywhere from 10% to 90%.
After 200 impressions and 10 clicks, you think it has about a 5% click rate — and you’re fairly confident. It’s probably between 3% and 8%.
Thompson Sampling uses this uncertainty directly: sample a random value from each creative’s uncertainty distribution, then pick the creative with the highest sample.
A creative you know little about has a wide distribution — sometimes it samples high, sometimes low. So it gets tried occasionally (exploration). A creative with lots of data has a narrow distribution centered on its true click rate. It samples consistently near its actual performance (exploitation).
Exploration happens naturally, proportional to uncertainty. No tuning parameters. No explicit explore/exploit switch.
Beta Distributions: Modeling Click Rates
A click rate is a probability: a number between 0 and 1. The Beta distribution is the natural way to represent uncertainty about a probability.
A Beta distribution has two parameters: α (alpha) and β (beta).
- α counts “successes” (clicks) plus a prior
- β counts “failures” (impressions without clicks) plus a prior
Starting from Beta(1, 1) — a uniform distribution, meaning “I have no idea, any click rate from 0% to 100% is equally possible” — each observation updates the distribution:
Start: Beta(1, 1) — uniform, total ignorance
1 click: Beta(2, 1) — probably high CTR, but uncertain
1 no-click: Beta(2, 2) — back toward 50%, still very uncertain
8 no-clicks: Beta(2, 10) — probably low CTR (~17%), getting more sure
2 clicks: Beta(4, 10) — ~29%, narrowing
The mean of Beta(α, β) is α / (α + β). But Thompson Sampling doesn’t use the mean — it samples a random value from the distribution. That’s what makes it work.
A Worked Example
Three creatives cached for a travel blog ad slot. After some impressions:
| Creative | Impressions | Clicks | Distribution | Mean CTR |
|---|---|---|---|---|
| A (hotel ad) | 150 | 6 | Beta(7, 145) | 4.6% |
| B (tour ad) | 20 | 2 | Beta(3, 19) | 13.6% |
| C (new airline ad) | 0 | 0 | Beta(1, 1) | 50.0% |
A reader loads the page. Thompson Sampling draws one random sample from each:
A: sample from Beta(7, 145) → 0.038 (probably near its true ~5%)
B: sample from Beta(3, 19) → 0.091 (wide distribution, sampled somewhat high)
C: sample from Beta(1, 1) → 0.647 (uniform distribution, sampled very high)
Creative C wins this round — not because we think it has a 65% click rate, but because we know nothing about it and the sample happened to be high. This is exploration.
If C turns out to have low CTR, after 20 impressions its distribution narrows (say, Beta(2, 19)) and it stops sampling high. If C actually has great CTR, its distribution stays high and it earns more impressions.
The system converges on the truth without ever deciding to “start a test” or “end a test.”
Scoring: Combining CTR with Bid Price
Click rate isn’t the only thing that matters. The publisher also cares about revenue. A creative with 2% CTR at $8 CPM might be more valuable than one with 3% CTR at $2 CPM.
Promovolve’s scoring formula balances both:
score = sampledCTR × log(1 + CPM)
Why log(1 + CPM) instead of just CPM?
Consider two creatives:
- A: $2 CPM, 4% CTR → score = 0.04 × log(3) = 0.04 × 1.10 = 0.044
- B: $10 CPM, 1% CTR → score = 0.01 × log(11) = 0.01 × 2.40 = 0.024
The logarithm compresses the CPM range. Bidding 5× more doesn’t give you 5× the score — it gives you about 2× the score. This means a creative has to actually perform well (high CTR) to win consistently. You can’t just outbid everyone with a terrible ad.
This aligns publisher and advertiser incentives: the publisher gets revenue AND engaged readers, not just the highest bidder’s money.
Cold Start: What About Brand New Creatives?
When a creative has zero impressions, its distribution is Beta(1, 1) — uniform. It can sample anywhere from 0 to 1. This gives it a natural exploration boost, but it’s random.
Promovolve adds structured cold start strategies:
Full cold start (all creatives have 0 impressions): Use the category score from the auction as a prior, plus small random noise. This bootstraps from the content classification — a travel ad on a travel page starts with a reasonable guess.
Warmup (all creatives have fewer than 10 impressions): Round-robin by impression count. The creative with the fewest impressions goes next. This ensures every creative gets a minimum number of trials before Thompson Sampling takes over.
Partial cold start (mix of new and established creatives): New creatives get a 30% exploration boost — they’re selected with epsilon-greedy probability even if their sample is low.
Steady state (all creatives have 10+ impressions): Pure Thompson Sampling. The distributions are informative enough to drive good decisions.
These strategies are in ThompsonSampling.scala, specifically the select() method.
Time-Bucketed Statistics
Click rates change over time. An ad that performed well last week might be stale now. Promovolve uses a 60-minute rolling window with 1-minute granularity:
- Impressions and clicks are recorded in 1-minute buckets
- When scoring, only the last 60 buckets are counted
- Older data automatically falls off
This means:
- The system adapts within an hour if a creative’s performance changes
- A creative that was good this morning but bad this afternoon gets corrected quickly
- The Beta distribution is always based on recent, relevant data
Where Thompson Sampling Sits in the Pipeline
Thompson Sampling doesn’t run on every request. It sits behind several gates:
Request arrives
→ Content recency check (is the page fresh enough?)
→ Frequency cap check (has this user seen this ad too many times?)
→ Rate tracking (record this request for pacing)
→ Pacing gate (PI controller: should we serve or skip?)
→ Thompson Sampling (which creative to show?)
→ Budget reservation (does the winning campaign have budget?)
→ Serve the ad
The pacing gate runs before Thompson Sampling. This is important: if the pacing controller decides to skip this request (to conserve budget), Thompson Sampling never runs. This prevents exploration from being wasted on throttled requests.
Why Thompson Sampling Over Other Approaches
vs Epsilon-Greedy (ε-greedy): Show the best creative 90% of the time, random 10%. Simple, but the exploration rate is fixed and doesn’t adapt. A creative you’ve shown 10,000 times still gets explored at the same rate as one you’ve shown 10 times. Thompson Sampling naturally explores uncertain creatives more.
vs UCB (Upper Confidence Bound): Pick the creative with the highest mean + confidence_bonus. Deterministic — same state always picks the same creative. Thompson Sampling’s randomness is a feature: two users loading the same page at the same time might see different creatives, which produces diverse data faster.
vs A/B Testing: Fixed allocation, fixed duration, manual setup. Thompson Sampling is continuous, adaptive, and automatic.
From Theory to Code
| Concept | File | Key method |
|---|---|---|
| Beta sampling (Marsaglia-Tsang) | ThompsonSampling.scala | sampleBeta() |
| Score = sampledCTR × log(1+CPM) | ThompsonSampling.scala | score() |
| Cold start strategies | ThompsonSampling.scala | select() |
| Time-bucketed creative stats | AdServer.scala | CreativeStats |
| Pacing gate before TS | SelectionLogic.scala | shouldServe() then select() |
The next chapters cover each of these components in detail with exact formulas and configuration values.
Thompson Sampling (MAB)
Thompson Sampling is the core serve-time algorithm in Promovolve. It selects which creative to show from the shortlisted candidates, balancing exploration of uncertain options with exploitation of known performers.
The Algorithm
For each serve request (after pacing gate and frequency cap filtering):
For each candidate c in the slot:
stats = creativeStats[c.creativeId] // 1-minute bucketed, 60-min window
impressions = stats.totalImpressions
clicks = stats.totalClicks
if impressions == 0:
sampledCTR = categoryScore + random(-0.15, +0.15)
else:
α = clicks + 1
β = impressions - clicks + 1
sampledCTR = sampleBeta(α, β)
score = sampledCTR × log(1 + CPM)
Select candidate with highest score
The log(1 + CPM) factor ensures bid price matters with diminishing returns — a $10 CPM is not 10x better than a $1 CPM.
Time-Bucketed Statistics
Unlike simple counters, Promovolve tracks impressions and clicks in 1-minute time buckets over a 60-minute rolling window:
case class CreativeStats(
buckets: Map[Long, (Int, Int)] = Map.empty, // minute → (impressions, clicks)
windowMinutes: Int = 60
)
On each impression or click:
val minute = now.getEpochSecond / 60
val (imps, clks) = buckets.getOrElse(minute, (0, 0))
// Update the relevant counter, then prune old buckets:
buckets.filter { case (min, _) => min > cutoffMinute }
Why time-bucketed?
- Automatic recency: old data prunes naturally, no manual decay needed
- Late click handling: a click at 10:22 for an impression at 10:15 creates a new bucket entry — both contribute to totals
- Clean window: exactly 60 minutes of data, not “all time” which would make exploration decay too slowly
- Persistence: stats snapshot to DB hourly, loaded on startup via
CreativeStatsLoaded
Selection Pipeline Position
Thompson Sampling runs after the pacing gate and frequency cap filter:
ServeIndex lookup → Content recency → Frequency cap → Pacing gate → Thompson Sampling → Budget reservation
This ordering is critical — the pacing gate decides whether to serve at all (volume gating), while Thompson Sampling decides which creative to show (choice). Running pacing before TS prevents exploration bias.
Sub-chapters
- Beta-Bernoulli Model — the probabilistic model behind Thompson Sampling
- Scoring Formula — why
sampledCTR × log(1 + CPM)and not alternatives - Cold Start Strategies — handling candidates with zero or few impressions
- Beta Distribution Sampling — the Marsaglia-Tsang method used in production
Beta-Bernoulli Model
Thompson Sampling in Promovolve uses a Beta-Bernoulli conjugate model to represent uncertainty about each candidate’s click-through rate (CTR).
The Model
Each ad impression is a Bernoulli trial: click (success) or no click (failure). The unknown CTR p is represented by a Beta distribution.
Conjugacy
The Beta distribution is the conjugate prior for the Bernoulli likelihood:
Prior: Beta(α, β)
Likelihood: Bernoulli(p)
Posterior: Beta(α + clicks, β + non_clicks)
Updates are trivial — just add counts. No MCMC, no variational inference, no gradient descent. Critical for serve-time performance.
Prior
Promovolve uses Beta(1, 1) — uniform over [0, 1]:
Beta(1, 1) = Uniform(0, 1)
Mean: 0.5
Variance: 0.083
→ Maximum uncertainty
Posterior from Time-Bucketed Stats
The posterior uses aggregated statistics from the 60-minute rolling window of 1-minute buckets:
impressions = sum of all bucket impression counts
clicks = sum of all bucket click counts
Posterior: Beta(clicks + 1, impressions - clicks + 1)
Posterior Evolution
After 0 impressions: Beta(1, 1) mean=0.500 — wide, pure exploration
After 10 imp, 1 click: Beta(2, 10) mean=0.167 — starting to narrow
After 100 imp, 3 clk: Beta(4, 98) mean=0.039 — fairly confident
After 1000 imp, 30 clk: Beta(31, 971) mean=0.031 — very confident
As data accumulates, the variance shrinks and samples cluster near the true CTR. This automatically reduces exploration for well-known creatives and maintains exploration for uncertain ones.
60-Minute Window Effect
Because stats are windowed to 60 minutes, the posterior resets as old data prunes. A creative that performed well an hour ago but has no recent data returns to higher uncertainty, enabling re-exploration. This is appropriate because CTR can vary by time of day, competing content, and audience composition.
Why Not Just Use Mean CTR?
Using the mean (greedy strategy) would never explore. Once a creative gets lucky with early clicks, it dominates forever. Thompson Sampling uses the full distribution — the variance captures uncertainty and drives exploration proportionally.
Scoring Formula
The Thompson Sampling score from the source code:
score = sampledCTR * math.log(1.0 + cpm)
Why Multiply CTR and CPM?
The score represents expected value combining:
- sampledCTR: Likelihood of engagement (publisher value)
- CPM: Advertiser willingness to pay (revenue value)
Both dimensions must be reasonable. A $100 CPM with 0.001% CTR scores poorly. A 10% CTR with $0.10 CPM also scores poorly.
Why log(1 + CPM)?
Diminishing Returns
$10 / $1 = 10x advantage (linear)
log(11) / log(2) = 2.40 / 0.69 = 3.5x advantage (log)
The log compresses CPM differences, preventing high bidders from overwhelming better-performing creatives.
The +1 Offset
log(CPM) would be undefined at CPM=0 and negative for CPM < 1. The +1 ensures:
- CPM = 0 → log(1) = 0 (free ads score zero)
- CPM = 1 → log(2) = 0.69
- CPM = 10 → log(11) = 2.40
Numerical Examples
| Candidate | CPM | True CTR | Sample | log(1+CPM) | Score |
|---|---|---|---|---|---|
| A | $8.00 | 2.0% | 0.025 | 2.20 | 0.055 |
| B | $3.50 | 4.5% | 0.038 | 1.50 | 0.057 |
| C | $1.20 | 7.0% | 0.082 | 0.79 | 0.065 |
Despite Campaign C paying 6.7x less than A, it wins because its CTR advantage outweighs the log-compressed CPM difference. This is the publisher-aligned outcome.
Cold Start Variant
When impressions == 0 for a candidate, the score uses categoryScore instead of Beta sampling:
sampledCTR = categoryScore + random(-0.15, +0.15)
score = sampledCTR * math.log(1.0 + cpm)
The ±0.15 noise range ensures cold candidates still have variance for exploration. See Cold Start Strategies for details.
Cold Start Strategies
New candidates enter the system with zero impressions. Promovolve uses three distinct strategies depending on the state of the candidate pool.
Strategy 1: Full Cold Start
Condition: All candidates in the slot have 0 impressions.
Algorithm: Use categoryScore from the auction phase as a prior, with noise:
sampledCTR = categoryScore + random(-0.1, +0.1)
score = sampledCTR × log(1 + CPM)
The categoryScore = classifierConfidence × rankerWeight provides a signal from the TaxonomyRankerEntity. The ±0.1 noise ensures different candidates are selected across requests even when they have identical category scores.
Strategy 2: Warmup Phase
Condition: All candidates have fewer than 10 impressions (WarmupImpressions = 10).
Algorithm: Round-robin — always select the candidate with the fewest impressions:
select = argmin(candidate.impressions)
No Thompson Sampling runs during warmup. This guarantees every candidate gets at least 10 impressions before exploitation begins.
Why 10? At 10 impressions with a typical 2-5% CTR, the expected number of clicks is 0-1. The Beta distribution Beta(1, 10) or Beta(2, 9) has sufficient shape to distinguish different CTRs but is still wide enough for continued exploration after warmup ends.
Strategy 3: Partial Cold Start
Condition: Some candidates have data (≥ 10 impressions) and some are new (0 impressions).
Algorithm: Epsilon-greedy with ExplorationRate = 0.30:
if random() < 0.30:
select randomly from cold candidates (impressions == 0)
else:
run Thompson Sampling on all candidates
The 30% rate is aggressive by design — new candidates need data quickly. Once they accumulate impressions, Thompson Sampling’s Beta posterior handles exploration naturally.
Note: When Thompson Sampling runs in the else branch, it runs on all candidates including cold ones. Cold candidates use categoryScore + random(-0.15, +0.15) as their sampled CTR, so they still have a chance to win through the normal scoring mechanism.
Strategy Selection Flow
Are all candidates at 0 impressions?
└── Yes → Full Cold Start (categoryScore ± 0.1 noise)
└── No → Are all candidates under 10 impressions?
└── Yes → Warmup (round-robin by fewest impressions)
└── No → Are some candidates at 0 impressions?
└── Yes → Partial Cold Start (30% epsilon-greedy)
└── No → Standard Thompson Sampling
Key Constants
| Constant | Value | Location |
|---|---|---|
ExplorationRate | 0.30 | ThompsonSampling.scala |
WarmupImpressions | 10 | ThompsonSampling.scala |
| Cold noise range | ±0.15 | ThompsonSampling.scala |
| Full cold noise range | ±0.1 | ThompsonSampling.scala |
Beta Distribution Sampling
Thompson Sampling requires drawing random samples from Beta distributions on every serve request. The implementation uses the Marsaglia-Tsang method for Gamma variates, then converts to Beta.
Beta from Gamma
Beta(α, β) = X / (X + Y)
where X ~ Gamma(α, 1) and Y ~ Gamma(β, 1)
Gamma Sampling: Marsaglia-Tsang
Case 1: shape ≥ 1 (Rejection Sampling)
d = shape - 1/3
c = 1 / sqrt(9 × d)
repeat:
x ~ Normal(0, 1)
v = (1 + c × x)³
u ~ Uniform(0, 1)
until v > 0 AND log(u) < 0.5 × x² + d - d × v + d × log(v)
return d × v
Acceptance rate is ~98% for shape ≥ 1, making this efficient for production use.
Case 2: shape < 1 (Recursion + Power Trick)
Gamma(shape, 1) = Gamma(shape + 1, 1) × U^(1/shape)
where U ~ Uniform(0, 1)
Reduces to Case 1 since shape + 1 ≥ 1.
Why Marsaglia-Tsang?
| Alternative | Problem |
|---|---|
| Inverse CDF | Beta quantile function requires regularized incomplete beta — expensive |
| Pre-computed tables | Unbounded (α, β) pairs as stats change per impression |
| Normal approximation | Breaks for small α + β — exactly the exploration-critical case |
Numerical Stability
The implementation handles edge cases:
- α or β very small (< 0.01): clamped to avoid division by zero in power trick
- Very large shape: Marsaglia-Tsang is naturally stable
- Sample = 0 or 1: clamped to [ε, 1-ε] to avoid log(0) in downstream scoring
Performance
| Operation | Cost |
|---|---|
| One Beta sample | ~3 uniform random draws + arithmetic |
| Per-candidate scoring | 1 Beta sample + 1 log + 1 multiply |
| Full selection (K=3) | 3 Beta samples + argmax |
Total overhead: negligible compared to the DData lookup. The sampling is synchronous and runs on the Pekko dispatcher thread handling the serve request.
Fair Candidate Selection
The serve-time selection pipeline applies multiple filters before Thompson Sampling runs, ensuring only eligible candidates are considered.
Complete Selection Pipeline
From the AdServer source code, the exact order of operations:
1. Lookup ServeView from DData
Key: "siteId|slotId"
→ Vector[CandidateView]
2. Content Recency Filter
Keep if: (now - classifiedAtMs) ≤ contentRecencyWindowMs (48h)
3. Frequency Cap Check (if userId provided AND any caps exist)
→ Group candidates by advertiserId
→ Query AdvertiserEntity for user impression counts (100ms timeout)
→ Filter: keep if impressions < frequencyCap
→ Fail open on timeout (include all)
4. Rate Tracking (synchronous)
→ TrafficObserver.recordRequest(nowMs)
→ Update EMA-smoothed request rate (1s window, α=0.3)
→ BEFORE any async operations
5. Pacing Gate (BEFORE Thompson Sampling)
→ Fetch CachedSpendInfo for all participating campaigns
→ Compute aggregate PacingContext
→ PacingStrategy.throttleProbability(ctx) → [0.0, 0.99]
→ if random() < throttleProb: return NoCandidates (204)
→ Pacing gates VOLUME, not CHOICE
6. Thompson Sampling Selection
→ Cold start strategy selection (full cold / warmup / partial / standard)
→ Score: sampledCTR × log(1 + CPM)
→ Select argmax
7. Budget Reservation
→ CampaignEntity.Reserve(spend estimate)
→ AdvertiserEntity.GetBudgetStatus()
→ On failure: loop to next-best Thompson score candidate
→ All exhausted: return NoCandidates
Why Pacing Before Thompson Sampling?
If pacing ran after TS:
- TS picks a creative → pacing throttles → wasted exploration (we learned nothing)
- TS would consistently select a high-CTR creative that gets throttled, biasing future selection
With pacing before TS:
- Throttle decision is independent of creative choice
- When a request passes the gate, TS explores the full eligible set
- Every Thompson Sampling decision contributes useful data
Campaign Mix Change Detection
When the set of participating campaigns changes between requests:
if lastCampaignSet.nonEmpty && currentCampaignSet != lastCampaignSet:
log campaign mix changed (added/removed)
pacingStrategy.reset() // Don't let PI compensate for mix changes
This prevents the PI controller from making corrections based on stale campaign data.
Orphaned Creative Preservation
When new auction results arrive, creatives from the previous auction that aren’t in the new set are preserved as “orphaned”:
orphanedCreatives = existingCandidates.filterNot(c =>
newAuctionCreativeIds.contains(c.creativeId)
)
mergedCandidates = (newCandidates ++ orphanedCreatives).distinctBy(_.creativeId)
This ensures multi-campaign diversity survives across auction cycles and approval status is preserved.
Per-Campaign Diversity
Promovolve ensures diversity through two mechanisms: the auction-time fair selection algorithm and serve-time aggregate pacing.
Auction-Time Diversity
The candidate shortlisting algorithm (in AuctioneerEntity) guarantees per-campaign representation:
1. Group candidates by campaign
2. Pick best creative per campaign (by CPM)
3. If #campaigns ≥ #slots:
Take top campaigns by CPM → one creative each
4. Else:
Each campaign gets 1 slot (guaranteed)
Fill remaining slots with next-best creatives
This means 3 campaigns competing for 3 slots each get exactly 1 slot, rather than a single high-CPM campaign filling all 3.
Serve-Time Aggregate Pacing
The pacing gate operates on aggregate campaign metrics, not per-campaign:
PacingContext(
dailyBudget = sum of all participating campaign budgets,
todaySpend = sum of all campaign spends (including pending),
avgCpm = CPM-weighted average across campaigns,
competingCampaigns = count of campaigns with budget remaining,
...
)
Why Aggregate?
Per-campaign pacing would allow a high-budget campaign to crowd out a low-budget one:
- Campaign A ($1000/day): barely paced, always serving
- Campaign B ($10/day): heavily paced, rarely serving
Aggregate pacing asks: “Given the total budget of all campaigns here, is the combined spend rate appropriate?” This naturally balances delivery.
Thompson Sampling as Natural Diversifier
Thompson Sampling itself provides diversity without explicit constraints:
- Each creative has its own
Beta(clicks+1, impressions-clicks+1)posterior - Sampling from Beta naturally introduces variance — even a dominant creative samples low sometimes
- New creatives have wide distributions → high variance → get explored
- Per-creative independence means every creative gets its own learning trajectory
Ad Product Blocklist
Publishers can configure per-site ad product category blocklists:
adProductBlocklist: Set[AdProductCategoryId]
Distributed via DData (AdProductBlocklistKey), this filter runs at auction time to exclude entire categories of ads (e.g., gambling, alcohol) from the publisher’s inventory.
Creative Deduplication
When merging new auction results with existing candidates:
mergedViews = (newCandidates ++ orphanedCreatives).distinctBy(_.creativeId)
This prevents the same creative from appearing multiple times, which would bias Thompson Sampling.
Frequency Capping
Frequency capping limits how many times a single user sees ads from the same advertiser, preventing ad fatigue.
How It Works
Per-User, Per-Creative Caps
Each campaign can specify a frequencyCap: Option[Int] — the maximum number of impressions per user for creatives from that campaign.
Check Process
At serve time, before the pacing gate:
// 1. Filter candidates with frequency caps
val cappedCandidates = candidates.filter(_.frequencyCap.isDefined)
// 2. Group by advertiser
val byAdvertiser = cappedCandidates.groupBy(_.advertiserId)
// 3. Query each AdvertiserEntity for user impression counts
// Timeout: 100ms, fail-open
// 4. Filter
filtered = candidates.filter { c =>
c.frequencyCap match {
case None => true // No cap, always eligible
case Some(cap) =>
val impressions = impressionCountsMap.getOrElse(c.creativeId, 0)
impressions < cap
}
}
Fail-Open Semantics
If the AdvertiserEntity doesn’t respond within 100ms:
On timeout → include all candidates from that advertiser
Why fail-open? Frequency capping is a quality optimization. The alternative (fail-closed) would mean network issues cause no ads to show. It’s better to occasionally over-serve than to block serving entirely.
Pipeline Position
Frequency capping runs after content recency but before the pacing gate and Thompson Sampling:
Content recency → Frequency cap → Rate tracking → Pacing gate → Thompson Sampling
Running before TS ensures:
- TS never wastes exploration on capped candidates
- The filtered pool may be smaller but TS works correctly with any size ≥ 1
- If all candidates are capped, no ad is shown (NoCandidates)
Interaction with Pacing
Frequency capping and pacing are independent filters. A candidate must pass both:
Candidates → Frequency Filter → Pacing Gate → Thompson Sampling
Running frequency cap first reduces the number of candidates the pacing gate needs to evaluate.
Pacing Overview
Budget pacing ensures campaigns spend their daily budgets smoothly throughout the day. Promovolve uses a self-tuning PI control loop with traffic shape awareness, leaky integrator anti-windup, oscillation detection, and cross-day learning.
Why Pacing Matters
Without pacing, a campaign with a $100 daily budget and $5 CPM would exhaust after 20,000 impressions. If those arrive in the morning peak, the campaign goes dark for the remaining day.
Why PI Control Works Here
PI (Proportional-Integral) control is a technique from industrial process control — thermostats, motor speed regulation, chemical plant flow rates. It works well when the controller can observe the system’s output and adjust a single input to drive it toward a target.
Budget pacing in Promovolve fits this model because it’s a closed system. The platform controls both sides of the equation:
- The input: the throttle probability (what fraction of ad requests to serve)
- The output: the spend rate (how fast budget is consumed)
- The target: even delivery across the day (spend rate = budget / time remaining)
The controller observes the spend rate, compares it to the target, and adjusts the throttle. Overspending? Throttle up (skip more requests). Underspending? Throttle down (serve more). The feedback loop is tight and the response is predictable.
This wouldn’t work in the traditional programmatic stack. In RTB, the publisher submits impressions to an exchange and has no control over whether a campaign wins any given auction — that depends on competing bids from unknown DSPs. The campaign’s delivery rate is a function of market dynamics the pacing controller can’t observe or influence. You can adjust your bid, but you can’t control the outcome.
In Promovolve, there’s no external auction to compete in at serve time. The candidates are already cached. The pacing gate is a simple yes/no decision on each request, and the controller has full authority over that decision. This makes the system controllable in the control-theory sense — the input (throttle) directly determines the output (delivery rate), with no unobservable external disturbances.
That’s why a PI controller — a well-understood, stable, analytically tractable technique — works for a problem that the traditional ad tech industry solves with heuristic rules and hope.
Promovolve’s Approach
graph LR
TO["Traffic Observer<br/>(EMA rate, 1s)"] --> PI["PI Control<br/>(self-tuning)"]
PI --> Throttle["Throttle Prob<br/>[0.0, 0.99]"]
PI --> Shape["Traffic Shape<br/>(weekday/weekend<br/>24h buckets)"]
Shape --> TO
Throttle --> Bernoulli["Bernoulli<br/>Serve or Skip"]
Key Components
- Rate Tracking (EMA): Synchronous, 1-second window, α=0.3
- PI Control Loop: Self-tuning gains, asymmetric response, leaky integrator
- Traffic Shape Learning: Separate weekday/weekend 24-hour profiles
- Grace Periods: Startup protection with MaxThrottleProb (0.99)
Pipeline Position
Pacing operates as a volume gate before Thompson Sampling:
Content recency → Frequency cap → Rate tracking → Pacing gate → Thompson Sampling
The pacing gate makes a Bernoulli decision: if random() < throttleProbability → skip (204). This gates volume, not choice — Thompson Sampling only runs for requests that pass the gate.
Key Constants (from AdaptivePacing.scala)
| Constant | Value |
|---|---|
MaxThrottleProb | 0.99 (1.0 reserved for hard-stop) |
DefaultKp | 0.5 |
DefaultKi | 0.3 |
BaseOverpaceGainMultiplier | 2.0 |
IntegralDecayFactor | 0.995 (leaky integrator) |
SpendRatioSmoothingAlpha | 0.3 |
DefaultAvgCpm | $5.00 |
Rate Tracking (EMA)
Accurate rate measurement is the foundation of the pacing system. Promovolve uses a synchronous Exponential Moving Average (EMA) with a 1-second sliding window.
TrafficObserver
From pacing/TrafficObserver.scala:
class TrafficObserver(
rateWindowMs: Long = 1000, // 1-second window
rateEmaAlpha: Double = 0.3 // EMA smoothing factor
)
Recording (Synchronous)
Called on every Select request, before any async operations:
recordRequest(nowMs):
if windowStartMs == 0: windowStartMs = nowMs
requestsInWindow += 1
windowElapsed = nowMs - windowStartMs
if windowElapsed >= rateWindowMs: // window closed
windowSec = windowElapsed / 1000.0
instantRate = requestsInWindow / windowSec
smoothedRate = α × instantRate + (1 - α) × smoothedRate
windowStartMs = nowMs
requestsInWindow = 0
return smoothedRate
EMA Behavior
With α = 0.3:
Window 1: instant=100, smoothed = 0.3×100 + 0.7×0 = 30
Window 2: instant=120, smoothed = 0.3×120 + 0.7×30 = 57
Window 3: instant=110, smoothed = 0.3×110 + 0.7×57 = 73
Window 4: instant=105, smoothed = 0.3×105 + 0.7×73 = 83
Window 5: instant=100, smoothed = 0.3×100 + 0.7×83 = 88
Converges within ~5 windows. Spikes are dampened:
Window 6: instant=500, smoothed = 0.3×500 + 0.7×88 = 212 (spike dampened)
Window 7: instant=100, smoothed = 0.3×100 + 0.7×212 = 178 (recovering)
Why Synchronous?
The rate tracking call is synchronous and runs on the same thread handling the serve request. This ensures:
- Every request is counted exactly once
- No race conditions from async updates
- Rate is always current when the pacing gate runs
Stabilization
The grace period requires EmaStabilizationWindows = 3 windows of data before the EMA is considered stable. During these initial windows, the grace period remains active to prevent PI corrections based on noisy rate estimates.
Usage in PI Control
The smoothed rate feeds into the base throttle calculation:
baseTargetImpsPerSec = (dailyBudget / dayDurationSeconds) / (avgCpm / 1000.0)
baseThrottle = 1.0 - (baseTargetImpsPerSec / requestRate)
Where requestRate is the EMA-smoothed rate from TrafficObserver.
PI Control Loop
Promovolve uses a self-tuning Proportional-Integral (PI) controller with adaptive gains, asymmetric response, a leaky integrator, and oscillation detection.
Core Algorithm (from AdaptivePacing.scala)
// 1. Hard stops
if remainingBudget ≤ 0: return 1.0
if remainingHours ≤ 0: return 1.0
// 2. Base throttle from target impressions per second
baseTargetImpsPerSec = (dailyBudget / dayDurationSec) / (avgCpm / 1000.0)
// 3. Apply traffic shape multiplier (if available)
if trafficShape exists:
shapeMultiplier = trafficShape.relativeVolumeWithFeedforward(elapsed, feedforwardWindow)
baseTargetImpsPerSec *= shapeMultiplier
baseThrottle = 1.0 - (baseTargetImpsPerSec / requestRate)
// 4. Compute error
error = 1.0 - spendRatio
// positive → under-spending, negative → over-spending
// 5. Asymmetric gains
if error < 0 (over-pacing):
effectiveKp = kp × overpaceGainMultiplier // default: kp × 2.0
effectiveKi = ki × overpaceGainMultiplier
else:
effectiveKp = kp
effectiveKi = ki
// 6. Leaky integrator (anti-windup)
integralError *= IntegralDecayFactor // 0.995 per update
integralError += error × dt
integralError = clamp(integralError, -1.0, 1.0)
// 7. PI adjustment
adjustment = effectiveKp × error + effectiveKi × integralError
// 8. Final throttle
finalThrottle = clamp(baseThrottle - adjustment, 0.0, MaxThrottleProb)
// MaxThrottleProb = 0.99 (1.0 reserved for hard-stop)
Spend Ratio Smoothing
Raw spend ratio is noisy. The system applies EMA smoothing:
smoothedSpendRatio = α × rawSpendRatio + (1 - α) × previousSmoothed
Default SpendRatioSmoothingAlpha = 0.3, but the alpha itself is self-tuned:
- If oscillation detected (stddev > 0.08): decrease alpha toward
MinSmoothingAlpha(0.1) — more dampening - If stable (stddev < 0.04): increase alpha toward
MaxSmoothingAlpha(0.5) — more responsive
Self-Tuning Overpace Multiplier
The asymmetric gain multiplier is not fixed — it adapts over time:
Every 20 samples (and at least 500ms apart):
if persistent overspend (avg spendRatio > 1.05):
overpaceMultiplier *= OverspendBoostFactor (1.15)
capped at MaxOverpaceGainMultiplier (5.0)
elif well-paced (avg spendRatio < 1.02):
overpaceMultiplier *= WellPacedDecayFactor (0.95)
floored at MinOverpaceGainMultiplier (1.5)
This means the system becomes progressively more aggressive at correcting overspend if it keeps recurring, and relaxes when pacing is good.
Adaptive Gains by Traffic Volatility
PI gains scale with the coefficient of variation (CV) of request rates:
| Volatility (CV) | Kp | Ki | Behavior |
|---|---|---|---|
| 0.0 (flat) | 0.3 | 0.2 | Gentle corrections for uniform traffic |
| 0.5 (typical) | 0.5 | 0.3 | Moderate response |
| 1.0+ (spiky) | 1.0 | 0.6 | Aggressive corrections for bursty traffic |
Gains are linearly interpolated between these points based on the observed CV from the TrafficShapeTracker.
Leaky Integrator
The integral term decays by IntegralDecayFactor = 0.995 on every update. This prevents windup — where a prolonged error accumulates a large integral that then overshoots when conditions change.
The integral is also hard-clamped to [-1.0, 1.0] as a safety bound.
Cross-Day Learning
At day rollover, the system checks if the budget was exhausted too early:
prepareForRollover(budgetExhausted, remainingFraction):
if budgetExhausted && remainingFraction > EarlyExhaustionThreshold (0.05):
overpaceMultiplier *= (1.0 + remainingFraction)
// If exhausted with 30% of day remaining → boost by 1.3x
This carries forward the lesson: “I should have been more conservative” into the next day’s pacing, even though the PI state itself resets.
Traffic Shape Learning
Web traffic follows daily patterns. Promovolve learns these patterns separately for weekdays and weekends using a TrafficShapeTracker with 24 hourly buckets.
TrafficShapeTracker (from source)
class TrafficShapeTracker(
bucketCount: Int = 24, // hourly buckets
alpha: Double = 0.1, // EMA learning rate
interpolateVolumes: Boolean = false // sharp vs smooth peaks
)
Separate Weekday/Weekend Profiles
private val weekdayShape: Array[Double] = Array.fill(24)(1.0)
private val weekendShape: Array[Double] = Array.fill(24)(1.0)
private val todayCount: Array[Long] = Array.fill(24)(0L) // reset daily
The active shape is selected via setDayType(isWeekend: Boolean) at the start of each day.
Recording & Learning
Per-Request Recording
recordRequest(bucket, time):
todayCount[bucket] += 1
On Bucket Boundary Change
When traffic moves to a new hour:
observation = requestsInBucket / max(1.0, emaBucketRequests)
shape[bucket] = α × observation + (1 - α) × shape[bucket]
emaBucketRequests = α × requestsInBucket + (1 - α) × emaBucketRequests
Day Rollover Blending
At end of day:
rolloverDay(dayAlpha = 0.2):
todayNormalized[i] = todayCount[i] / avgCount
shape[i] = 0.2 × todayNormalized[i] + 0.8 × shape[i]
reset todayCount
The 0.2 blend rate means about 5 days of data to significantly influence the profile.
CDF for Expected Spend
The traffic shape provides a cumulative distribution function that replaces the linear time fraction in expected spend calculations:
cumulativeFractionAtTime(elapsedSeconds):
bucket = floor(elapsedSeconds / bucketDurationSec)
fractionIntoBucket = (elapsedSeconds % bucketDurationSec) / bucketDurationSec
prevCumulative = sum(shape[0..bucket-1])
currentContribution = shape[bucket] × fractionIntoBucket
return (prevCumulative + currentContribution) / sum(all buckets)
Without traffic shape: expectedSpendFraction = elapsedTime / totalTime (linear)
With traffic shape: expectedSpendFraction = cumulativeFractionAtTime(elapsed) (shaped)
Relative Volume (for Base Target)
The base target impressions-per-second is scaled by the current hour’s relative volume:
relativeVolumeWithFeedforward(elapsedSeconds, feedforwardWindow):
bucket = current hour
currentVol = shape[bucket]
nextVol = shape[(bucket + 1) % 24]
if feedforwardWindow > 0 AND near end of bucket:
// Smooth transition using ease-in-out curve
blendFactor = position within feedforward window [0, 1]
smoothBlend = blendFactor² × (3 - 2 × blendFactor)
effectiveVol = currentVol + smoothBlend × (nextVol - currentVol)
else:
effectiveVol = currentVol
avgVol = sum(all buckets) / 24
return effectiveVol / avgVol
The feedforward window (default: 0.0 = disabled) allows the system to anticipate the next hour’s traffic pattern and begin adjusting before the bucket boundary.
Volatility Measurement
The coefficient of variation (CV = stddev / mean) of the shape buckets is used to auto-tune PI gains:
- Low CV → uniform traffic → gentle PI gains
- High CV → spiky traffic → aggressive PI gains
Site-Level Configuration
Per-site traffic shapes can be pre-configured via PacingConfig:
PacingConfig(
weekdayShapeVolumes: Option[Vector[Double]], // 24 hourly values
weekendShapeVolumes: Option[Vector[Double]],
dayDurationSeconds: Int = 86400,
warmupMode: Boolean = false
)
When warmupMode = true, the system records traffic patterns but does not serve ads — useful for learning the traffic shape of a new site before enabling monetization.
Grace Periods & Hybrid Modes
PI control needs stable input signals. During startup and after periods of inactivity, signals are noisy or meaningless. Grace periods protect the controller during these transients.
Grace Period: No Serving
During grace, the throttle probability is set to MaxThrottleProb = 0.99 — effectively no serving. This is the opposite of what you might expect: rather than serving freely during warmup, Promovolve suppresses serving until it has enough data to pace correctly.
Why suppress, not serve freely? Free serving during startup could exhaust the budget before the PI controller activates. The 0.99 throttle (not 1.0) lets through ~1% of requests as “sensors” to build rate data.
Grace Period Conditions
Grace is active until both conditions are met:
graceSeconds = max(MinGraceSeconds, DefaultGracePeriodFraction × dayDurationSeconds)
= max(10.0, 0.01 × dayDurationSeconds)
graceRequests = MinGraceRequests = 10
Grace ends when:
elapsedSeconds >= graceSeconds AND requestCount >= graceRequests
Additionally, the EMA needs EmaStabilizationWindows = 3 windows of data to stabilize.
Staleness Reset
If no requests arrive for a configurable period, grace re-enters:
staleThreshold = BaseStaleRateThresholdMs = 30,000ms
(scaled proportionally for simulated short days)
(min: MinStaleRateThresholdMs = 1,000ms)
if (nowMs - lastRequestMs) > staleThreshold:
resetGracePeriod()
Why? After 30 seconds of silence, the EMA-smoothed rate has decayed and no longer represents current traffic. PI corrections with stale rate data would produce erratic throttle swings.
Grace Period Constants
| Constant | Value | Purpose |
|---|---|---|
DefaultGracePeriodFraction | 0.01 (1% of day) | Base grace duration |
MinGraceSeconds | 10.0 | Minimum grace regardless of day length |
MinGraceRequests | 10 | Minimum requests before PI activates |
MaxGraceRequests | 50 | (Not currently used as lower bound) |
EmaStabilizationWindows | 3 | EMA warmup windows |
BaseStaleRateThresholdMs | 30,000 | Staleness detection |
MinStaleRateThresholdMs | 1,000 | Min staleness for short simulated days |
MaxThrottleProb | 0.99 | Throttle during grace |
Grace Period Timeline
Time Event Mode Throttle
00:00 Site pacing starts Grace 0.99 (~1% through)
00:05 5 requests arrived Grace (count) 0.99
00:10 10s elapsed, 10+ requests Grace (EMA) 0.99
00:13 3 EMA windows stable PI active Computed
...
01:30 No requests for 35s Stale reset 0.99
01:31 Requests resume Grace 0.99
01:41 Grace conditions met PI active Computed
Simulated Days
For testing and simulation, dayDurationSeconds can be set shorter than 86400 (e.g., 600 seconds for a 10-minute “day”). Grace periods, stale thresholds, and the RL observation interval all scale proportionally, ensuring the system behaves consistently regardless of the simulated time scale.
Reinforcement Learning from Scratch
This chapter teaches RL from zero, using Promovolve’s bid optimization as the running example. By the end, you’ll understand every line of DQNAgent.scala and BidOptimizationAgent.scala. No prior machine learning knowledge required.
The Problem: Too Many Decisions, Too Little Time
Imagine you’re managing a single advertising campaign. You have a daily budget of $100 and a maximum bid of $5 CPM (cost per thousand impressions). Every 15 minutes, you look at how things are going and decide: should I bid higher to win more auctions, or bid lower to conserve budget?
You could try to figure this out manually. But the ad marketplace is a living thing — competitors change their bids, user traffic surges and dips, some content drives more clicks than others. Any fixed rule you write today will be wrong tomorrow.
What you really want is a system that learns from experience, the same way a new employee gets better at their job over the first few weeks. That’s reinforcement learning.
The Setup: Agent, Environment, Reward
Every RL problem has the same three ingredients.
The agent is the decision-maker. In Promovolve, the agent is BidOptimizationAgent — the piece of code that decides how aggressively to bid. It’s not an actor — it’s a plain Scala object embedded inside the CampaignEntity actor, called synchronously during each observation.
The environment is everything the agent can’t control: other advertisers’ bids, how many users are browsing, what content they’re viewing. The agent can observe the environment, but can’t change it directly. It can only respond to it.
The reward is the signal that tells the agent how it’s doing. After each decision, the environment hands back a single number: positive if things went well, negative if they didn’t. The agent’s entire goal is to maximize the total reward it collects over time.
That’s it. Everything in this chapter is just machinery for making this loop work.
Why Not Just Write Rules?
The first instinct is to hard-code the logic: “if budget is more than 60% remaining and it’s past noon, bid higher.” But rules like this break for a few reasons:
- You don’t know the right thresholds in advance. Is 60% the right cutoff? Maybe it should be 55%, or 70%.
- The optimal rule depends on your competitors’ behavior, which you can’t see and which changes constantly.
- Traffic patterns vary by day, season, and content type.
- Different campaigns have different optimal strategies.
RL sidesteps all of this. Instead of writing rules, you define what “good” means (the reward), then let the agent figure out the rules on its own through trial and error.
What the Agent Observes
Every 15 minutes, the agent looks at the world and builds a snapshot of the current situation. In Promovolve, that snapshot — called the state — contains eight numbers:
| Measurement | Example |
|---|---|
| Current effective CPM | $3.20 |
| Click-through rate this window | 0.03 (3%) |
| Bid win rate this window | 0.65 (65%) |
| Fraction of daily budget remaining | 0.70 (70%) |
| Fraction of day remaining | 0.50 (50%) |
| Current spend rate vs. ideal pace | 1.10 (10% ahead of pace) |
| Impression rate | 0.80 |
| Cost per click | $0.40 |
These eight numbers, taken together, describe the situation the agent is in. The same way a chess player looks at the board position before deciding their move, the agent looks at this state before deciding its action.
What the Agent Can Do
Given a state, the agent picks one of seven actions:
| Action | Multiplier | Meaning |
|---|---|---|
| 0 | 0.7× | Bid much lower — conserve budget aggressively |
| 1 | 0.8× | Bid lower |
| 2 | 0.9× | Bid slightly lower |
| 3 | 1.0× | Hold steady |
| 4 | 1.1× | Bid slightly higher |
| 5 | 1.2× | Bid higher |
| 6 | 1.4× | Bid much higher — be aggressive |
The multiplier is applied cumulatively to the current bid level and clamped to [0.5, 2.0]. The agent can’t change the market. It can only adjust how aggressively it bids, and observe what happens next.
The Reward Signal
After each 15-minute window, the agent receives a reward. In Promovolve, it’s calculated as:
pacingScore = max(0.1, 1.0 - |spendRate - 1.0|)
reward = clicks × pacingScore - exhaustionPenalty × timeRemaining
In plain language:
- Clicks scaled by pacing quality. When the campaign is perfectly on pace (spendRate = 1.0), every click counts fully. When overspending at 2×, each click is worth only 10%. When underspending at 0.5×, each click is worth 50%. Both directions are penalized — the agent can’t game it by overbidding.
- Budget exhaustion is penalized proportionally. Running out of budget with 80% of the day left is 4× worse than running out with 20% left. This teaches the agent that early exhaustion wastes more potential clicks than late exhaustion.
This single formula is the only definition of “success” the agent ever gets. Everything it learns flows from this signal. If you change the reward function, the agent will learn a completely different strategy.
Learning from Experience: The Core Idea
Here’s the fundamental question: how does the agent learn which action to take in which situation?
The short answer is: by remembering what happened and adjusting its expectations.
Suppose the agent is at 70% budget with a high click-through rate, and it decides to bid higher. It wins more auctions, gets more clicks, and earns a good reward. Now it knows that “bid higher in this situation” was a profitable move.
But here’s the subtlety: the agent doesn’t just care about the immediate reward. It cares about the total reward it’ll collect for the rest of the day. Bidding aggressively might earn a lot of clicks right now, but if it burns through the budget by 2pm, the rest of the day earns nothing — and the pacing penalty kicks in hard.
This is why RL cares about sequences of decisions, not just individual ones.
Q-Values: Scoring Every Option
To reason about sequences, the agent uses a concept called a Q-value.
Q(state, action) is the agent’s estimate of the answer to this question: “If I’m in this situation and I take this action, then play optimally for the rest of the day — how much total reward will I collect?”
For example:
| Action | Multiplier | Q-value |
|---|---|---|
| 0 | 0.7× | 7.3 |
| 1 | 0.8× | 9.1 |
| 2 | 0.9× | 10.0 |
| 3 | 1.0× | 11.2 |
| 4 | 1.1× | 12.5 |
| 5 | 1.2× | 11.8 |
| 6 | 1.4× | 9.4 |
Given these Q-values, the agent picks action 4 (bid slightly higher) — Q-value 12.5, the best estimated outcome. Notice that the most aggressive action (1.4×) has a lower Q-value than moderate ones — the agent has learned that overbidding hurts pacing.
The agent doesn’t need to think about the future explicitly. It just looks up its Q-values and picks the highest one. All the long-term reasoning is baked into the Q-values themselves.
The hard part, of course, is having accurate Q-values in the first place. That’s what training is for.
The Bellman Equation: Where Q-Values Come From
How do Q-values get calculated? Through a beautifully simple observation called the Bellman equation.
Think of it this way: the value of an action equals two things added together:
- The immediate reward you get right now
- The best you can do from the next state onward
Written as a formula:
Q(state, action) = immediate_reward + γ × max Q(next_state, any_action)
The γ (gamma) is a number between 0 and 1 — in Promovolve, it’s 0.99. It represents how much the agent values future rewards compared to immediate ones.
Why discount the future at all? Because a reward now is worth more than the same reward later — you might run out of budget before you get there. With γ = 0.99, a reward one step from now is worth 0.99 of a reward right now. A reward 100 steps from now is worth only 0.99^100 ≈ 0.37. The agent prefers results sooner.
The Bellman equation turns a hard problem (predict total future reward) into a tractable one: predict immediate reward, then bootstrap from the next state.
The Approximation Problem
If states were a small, fixed set — like the 64 squares of a chessboard — you could store a Q-value table in memory. One row per state, one column per action.
But Promovolve’s state is eight continuous numbers. Budget remaining can be 0.70, or 0.701, or 0.7001. There are infinitely many possible states. A table is out of the question.
This is where a neural network comes in — not because neural networks are magic, but because they’re very good at one specific thing: learning a function from examples.
Neural Networks: A Function Approximator
A neural network is just a mathematical function with a lot of adjustable dials.
You give it some numbers as input. It does a sequence of calculations. It produces some numbers as output. The dials — called weights — determine what those calculations are.
In Promovolve, the network looks like this:
Input: [0.7, 0.03, 0.65, 0.70, 0.50, 1.1, 0.8, 0.4]
(8 numbers describing the current state)
↓
[64 intermediate values]
↓
[64 intermediate values]
↓
Output: [7.3, 9.1, 10.0, 11.2, 12.5, 11.8, 9.4]
(Q-values for each of the 7 actions)
The intermediate layers exist to let the network learn non-obvious patterns. Maybe “high win rate AND low budget remaining” signals something that neither measurement reveals on its own. The intermediate layers give the network room to discover those combinations.
This is DenseNetwork.forward() in the code. The network has roughly 5,200 parameters total.
The key insight: this network generalizes. Once trained, it can produce reasonable Q-value estimates for states it has never seen before, by interpolating from similar states it has seen. That’s the whole point of using a neural network instead of a lookup table.
Training: Adjusting the Dials
Training is the process of adjusting the network’s weights so that its Q-value estimates become accurate.
After every action the agent takes, it observes a transition:
(state, action, reward, next_state)
For example: “I was at state [0.7, 0.03, …], I picked action 4 (bid higher slightly), I earned a reward of 2.7, and now I’m at state [0.65, 0.028, …].”
From this transition, training works in three steps:
1. Predict. Run the current state through the network to get the predicted Q-value for the action taken.
2. Calculate what it should be. Using the Bellman equation:
target = 2.7 + 0.99 × (best Q-value from next state)
3. Adjust. The weights are nudged slightly so that the prediction moves closer to the target. If the prediction was 10.0 and the target is 12.5, every weight is adjusted a tiny amount in the direction that would have pushed the output toward 12.5.
This nudging process — called gradient descent — is what DenseNetwork.train() implements. Repeat this tens of thousands of times, and the network’s estimates gradually converge toward accurate Q-values.
Experience Replay: Breaking Correlation
There’s a practical problem with learning from transitions one at a time.
The state at 2:15pm is very similar to the state at 2:30pm. The state at 2:30pm is similar to the state at 2:45pm. If the agent learns from each transition as it comes in, it ends up adjusting the weights based on a highly repetitive stream of near-identical situations. The weights get tuned specifically for the last hour and “forget” what they learned about other situations.
Experience replay fixes this with a simple trick: don’t learn immediately. Instead, store each transition in a buffer:
buffer = [
(state from day 1, action, reward, next_state),
(state from day 3, action, reward, next_state),
(state from day 1 again, different action, ...),
...
]
Then, for each training step, pull a random batch of 32 transitions from anywhere in the buffer. This batch might include a transition from this morning, one from three days ago, and one from a completely different budget level. The variety helps the network learn broadly, not just for the current moment.
In the code, this is ReplayBuffer — a circular buffer that holds up to 10,000 transitions. When full, new arrivals overwrite the oldest ones. buffer.sample(32, rng) pulls a random batch.
Double DQN: Correcting a Bias
Standard DQN has a subtle flaw. Look at how the training target is calculated:
target = reward + γ × max(network.forward(next_state))
The max operation picks the best-looking action from the next state. But the estimates coming out of the network are noisy, especially early in training when the weights are barely tuned. When estimates are noisy, the highest one tends to be an overestimate — it happens to be high partly because it’s genuinely good, and partly due to random noise in the current weight settings.
By always taking the max, the agent systematically overestimates Q-values. This makes the agent overconfident and can destabilize training. In ad bidding, this manifests as the agent thinking aggressive bidding is more rewarding than it actually is — leading to budget blowouts.
Double DQN fixes this by splitting the job between two separate networks:
- Q-network: The main network, updated after every training step.
- Target network: A copy of the Q-network, updated only every 100 steps.
Instead of using the same network to both select the best action and evaluate it, Double DQN separates these:
best_action = argmax(q_network.forward(next_state)) // Q-net selects
target = reward + γ × target_network.forward(next_state)[best_action] // target-net evaluates
The target network is a stable reference point. Because it only updates every 100 steps, it doesn’t chase its own tail. The Q-network learns against a slowly-moving target rather than a constantly shifting one.
This is implemented in lines 83–93 of DQNAgent.trainStep().
Exploration vs. Exploitation
If the agent always picks the action with the highest Q-value, it never discovers whether some other action might be even better. It could settle into a mediocre routine and never escape.
This tension — use what you know vs. try something new — is called the explore/exploit tradeoff. It’s one of the fundamental problems in RL.
Promovolve solves it with epsilon-greedy exploration. The parameter ε (epsilon) is a probability:
if (rng.nextDouble() < epsilon)
take a random action // explore
else
take the best-known action // exploit
Epsilon starts at 1.0 (fully random) and decays multiplicatively by 0.995 after every training step, down to a floor of 0.05.
In practice this means:
- Day 1: Almost entirely random. The agent knows nothing and is gathering raw experience.
- Days 3–5: Mostly exploiting its learned policy, but still 15–30% random. The agent has useful knowledge but is still refining it.
- Day 8+: 95% exploitation, 5% exploration. The agent trusts what it’s learned, but keeps a small random element to catch opportunities its current policy might be missing.
The decay schedule is deliberate: explore aggressively when ignorant, exploit confidently once informed.
Putting It All Together
Here’s the complete cycle as it runs inside each CampaignEntity actor.
Every bid request (fast path)
bid_cpm = campaign.max_cpm × agent.bid_multiplier
The multiplier is just a cached number. No neural network computation happens here. This path is sub-microsecond.
Every impression
agent.record_impression(cpm)
agent.record_bid_opportunity(won=true)
Increments window counters. Also sub-microsecond.
Every click
agent.record_click()
Increments the window click counter.
Every 15 minutes (slow path)
1. Build state vector from window metrics:
[effective_cpm, ctr, win_rate, budget_remaining,
time_remaining, spend_rate, impression_rate, cost_per_click]
2. If we have a previous state:
a. Compute pacing-scaled reward
b. Store (prev_state, prev_action, reward, state, done) in replay buffer
c. Sample batch of 32 from buffer
d. For each sample: compute Double DQN target, train network
3. Select action: epsilon-greedy (one of 7 multiplier adjustments)
4. Apply: bid_multiplier *= action_adjustment (e.g., 1.1×)
5. Clamp: bid_multiplier stays in [0.5, 2.0]
6. Reset window counters
At day end
1. Store terminal transition (done=true)
2. Reset bid_multiplier to 1.0
3. Keep all network weights — the learned policy carries over to tomorrow
On entity restart
1. Restore network weights from persisted snapshot
2. Replay buffer is empty — agent resumes with learned policy but needs
to re-accumulate experience before it can train again
What the Agent Actually Learns
After several days of operation, a well-trained agent discovers patterns like these — without ever being told they exist:
- Early in the day with full budget: Bid moderately to capture impressions while maintaining pace.
- Budget running low with time remaining: Pull back before the pacing score tanks the reward.
- High click-through rate content: Worth bidding up — clicks are the reward, and good pacing amplifies them.
- Overspending: The pacing score immediately reduces the value of every click — the agent learns to self-correct within one or two observations.
- End of day with leftover budget: Bid up to use remaining budget productively — underspending also reduces the pacing score.
None of these are programmed in. They emerge from the reward signal, through thousands of training steps.
Key Hyperparameters
These are all found in DQNAgent.Config and BidOptimizationAgent.Config.
| Parameter | Value | What it does |
|---|---|---|
| γ (gamma) | 0.99 | How much future rewards matter. High = patient. |
| Learning rate | 0.001 | How aggressively to adjust weights per training step. |
| ε start | 1.0 | Initial exploration rate (fully random). |
| ε end | 0.05 | Minimum exploration (always 5% random). |
| ε decay | 0.995 | How fast to shift from explore to exploit. |
| Buffer size | 10,000 | How much experience to remember. |
| Batch size | 32 | Transitions pulled per training step. |
| Target sync | 100 steps | How often the target network copies the Q-network. |
| Hidden layers | [64, 64] | Size of intermediate layers in the network. |
| Actions | 7 | Multipliers: [0.7, 0.8, 0.9, 1.0, 1.1, 1.2, 1.4] |
| Multiplier range | [0.5, 2.0] | Hard bounds on the cumulative bid multiplier. |
| Q-clip | ±100 | Prevents extreme Q-value estimates. |
| Grad clip | ±5.0 | Prevents catastrophically large weight adjustments. |
From Theory to Code
| Concept | File | Key method |
|---|---|---|
| Neural network (forward + training) | DenseNetwork.scala | forward(), train() |
| Experience replay buffer | ReplayBuffer.scala | store(), sample() |
| Double DQN + epsilon-greedy | DQNAgent.scala | selectAction(), trainStep() |
| State / reward / action design | BidOptimizationAgent.scala | toState(), computeReward(), observe() |
| Integration with campaign actor | CampaignEntity.scala | RLObserveTick, TryReserve |
The next chapters cover each of these components in detail with exact formulas and configuration values.
DQN Agent Overview
Each campaign in Promovolve has its own Double DQN reinforcement learning agent that learns to adjust the campaign’s bid multiplier over time. The agent observes campaign performance metrics every 15 minutes and outputs an action that adjusts the bid.
Architecture
graph TB
subgraph CampaignEntity
subgraph BidOptimizationAgent
subgraph DQNAgent
QNet["Q-network: 64, 64"]
Target["Target network (copy)"]
Replay["ReplayBuffer (10,000)"]
Eps["ε-greedy (1.0 → 0.05)"]
end
BidMult["bidMultiplier: min, max"]
Window["Window: imps, clicks, spend"]
end
BidCpm["bidCpm = max(maxCpm × mult, floor)"]
end
Two Speed Paths
Fast Path (Per Bid Request)
1. Receive CampaignBidRequest
2. Check budget, eligibility, size match
3. Compute: bidCpm = max(maxCpm × bidMultiplier, floorCpm)
4. Return CampaignBidResponse with eligible creatives
No RL computation — bidMultiplier is a cached scalar.
Slow Path (Every 15 Minutes)
1. Timer fires (rlObserveInterval = 15 minutes)
2. Compute timeRemaining = max(0, 1.0 - elapsed / rlDayDurationSeconds)
3. Call bidOptAgent.observe(observation)
a. Build 8-dimensional state vector
b. Compute reward from previous window
c. Store transition (s, a, r, s') in replay buffer
d. Select action via ε-greedy
e. Apply action: adjust bidMultiplier
f. Train DQN on batch from replay buffer
4. Reset window counters
Event Recording (Per Request)
recordImpression(spendAmount) → accumulates window metrics
recordClick() → increments window click counter
recordBidOpportunity(won) → tracks win rate
Day Reset
At daily budget rollover:
- Store terminal transition with
done=true(final reward = last window clicks) - Reset
bidMultiplierto 1.0 - Reset window counters and day stats
- Preserve DQN weights — learned policy carries across days
- Guard:
lastRolledEpochDayprevents double-roll on same calendar day
Inference Mode
val action = if (inferenceOnly)
dqn.selectGreedy(state) // Pure exploitation, no exploration
else
dqn.selectAction(state) // ε-greedy (training mode)
The inferenceOnly flag allows deploying a trained agent without further exploration.
Persistence
The DQN agent’s state is serialized as a DQNAgent.Snapshot (weights, biases, epsilon, step counters) and stored in CampaignEntity.State.rlSnapshot. This survives process crashes and restarts — the agent resumes training from where it left off.
State Space
The DQN agent observes an 8-dimensional state vector computed from the 15-minute observation window.
State Dimensions (from BidOptimizationAgent.scala)
| Index | Name | Formula | Range | Signal |
|---|---|---|---|---|
| 0 | effectiveCpm | clamp(2.0, (maxCpm × bidMultiplier) / maxCpm) | [0, 2.0] | Current bid level |
| 1 | ctr | min(1.0, windowClicks / windowImpressions) | [0, 1.0] | Engagement quality |
| 2 | winRate | windowWins / windowBidOpportunities (default 0.5) | [0, 1] | Competitive position |
| 3 | budgetRemaining | clamp(1.0, budgetRemaining / dailyBudget) | [0, 1.0] | Budget utilization |
| 4 | timeRemaining | clamp(1.0, 1.0 - elapsed / rlDayDurationSeconds) | [0, 1.0] | Time pressure |
| 5 | spendRate | min(3.0, actualSpend / expectedSpend) | [0, 3.0] | Pacing accuracy |
| 6 | impressionRate | min(2.0, windowImpressions / 100.0) | [0, 2.0] | Delivery volume |
| 7 | costPerClick | min(2.0, (windowSpend / windowClicks) / maxCpm) | [0, 2.0] | Efficiency |
Dimension Details
effectiveCpm (index 0)
The normalized bid level: bidMultiplier itself, since maxCpm × bidMultiplier / maxCpm = bidMultiplier. Clamped to [0, 2.0]. Tells the agent what it decided last time.
ctr (index 1)
Click-through rate in the current 15-minute observation window. Zero if no impressions. Provides immediate feedback on creative quality in the current traffic mix.
winRate (index 2)
Fraction of bid opportunities that resulted in the creative being shortlisted. If no bid opportunities occurred, defaults to 0.5 (neutral). Low win rate → bid too low relative to competition.
budgetRemaining (index 3)
Remaining budget as a fraction of daily budget. Combined with timeRemaining, this tells the agent whether it’s on pace:
- High budget + low time → can bid aggressively
- Low budget + high time → must conserve
timeRemaining (index 4)
Fraction of the delivery day remaining. Computed as 1.0 - elapsedSeconds / rlDayDurationSeconds where rlDayDurationSeconds defaults to 86400 (real 24h day) but can be configured shorter for simulation via RL_DAY_DURATION_SECONDS.
spendRate (index 5)
Ratio of actual to expected spend, capped at 3.0x. Expected spend assumes even linear distribution: dailyBudget × (elapsed / totalTime). A spend rate of 1.0 = perfect pacing. Above 1.0 = over-spending.
impressionRate (index 6)
Impressions per 15-minute window, normalized by a baseline of 100 impressions. Capped at 2.0x. Independent of spend — captures delivery volume.
costPerClick (index 7)
Spend per click normalized by maxCpm. Only meaningful when clicks > 0 (returns 0.0 otherwise). High CPC relative to maxCpm suggests the bid is too high for the achieved CTR.
Why These 8 Dimensions?
The state captures minimal sufficient statistics for bidding:
| Pair | Signal |
|---|---|
| Budget + Time | Should I be aggressive or conservative? |
| Win Rate + CPM | Am I competitive? |
| CTR + CPC | Am I getting good value? |
| Spend Rate + Impression Rate | Am I on pace? |
Normalization
All dimensions are bounded (mostly [0, 1] or [0, 2-3]) via min() or clamp(). This is critical for the neural network — unbounded features cause gradient issues. The capping prevents outliers from destabilizing Q-value estimates.
Action Space
The DQN agent selects from 5 discrete actions (configurable), each representing a multiplicative adjustment to the current bid multiplier.
Default Actions (from BidOptimizationAgent.scala)
| Action Index | Adjustment Factor | Effect |
|---|---|---|
| 0 | 0.8x | Bid 20% less — conserve budget |
| 1 | 0.9x | Bid 10% less — slight reduction |
| 2 | 1.0x | Hold — no change |
| 3 | 1.1x | Bid 10% more — slight increase |
| 4 | 1.2x | Bid 20% more — aggressive bidding |
Cumulative Application
Actions are applied cumulatively to the existing multiplier:
newMultiplier = clamp(
minMultiplier,
maxMultiplier,
_bidMultiplier × actionMultipliers(action)
)
Example Sequence
Step 0: multiplier = 1.0 (start of day)
Step 1: action=4 (1.2x) → 1.0 × 1.2 = 1.20
Step 2: action=3 (1.1x) → 1.2 × 1.1 = 1.32
Step 3: action=0 (0.8x) → 1.32 × 0.8 = 1.056
Step 4: action=4 (1.2x) → 1.056 × 1.2 = 1.267
Multiplier Bounds
The multiplier is clamped to [minMultiplier, maxMultiplier] (configurable per agent):
- Minimum: Prevents the bid from becoming uncompetitive
- Maximum: Prevents overpaying
The effective bid is always floored at floorCpm:
bidCpm = max(maxCpm × bidMultiplier, floorCpm)
Why Discrete Actions?
Alternative: Continuous Actions
Continuous actions (e.g., output multiplier directly) would require DDPG or SAC:
- More complex, separate actor and critic networks
- Harder to explore in bounded spaces
- Not worth the complexity for this problem size
Advantages of Discrete
- DQN is well-understood and stable
- 5 actions provide sufficient granularity (10-20% adjustments per step)
- Each action has a clear semantic interpretation
- Cumulative application allows reaching any multiplier within bounds over multiple steps
Symmetric Design
Unlike some bid optimization systems that use asymmetric action ranges, Promovolve’s 5 actions are symmetric around the hold action (1.0x):
- Two decrease levels: 0.8x, 0.9x
- Two increase levels: 1.1x, 1.2x
- One hold: 1.0x
This symmetry means the agent has equal capacity to increase or decrease bids, with no built-in bias toward either direction.
Reward Function
The reward function defines what the DQN agent optimizes for. From BidOptimizationAgent.scala:
Formula
reward = clickReward - overspendPenalty
where:
clickReward = windowClicks.toDouble
overspendPenalty = if (spendRate > 1.5)
config.overspendPenalty × (spendRate - 1.5)
else 0.0
Default penalty factor: overspendPenalty = 2.0
Component Breakdown
Clicks (Primary Signal)
Raw number of clicks in the 15-minute observation window. This is the positive signal — the agent maximizes clicks because that’s what advertisers care about.
Why clicks, not impressions?
- Impressions don’t indicate value — they’re “free” from the user’s perspective
- Clicks represent actual engagement
- Maximizing clicks naturally selects for high-CTR placements
Why clicks, not revenue?
- Revenue (CPM × impressions) would incentivize bidding as high as possible
- This contradicts the advertiser’s interest in efficient spending
- Clicks align the agent with advertiser ROI
Overspend Penalty
overspendPenalty = 2.0 × max(0, spendRate - 1.5)
- Threshold at 1.5x: No penalty for spending up to 50% faster than target. Gives the agent freedom to bid aggressively when opportunities are good.
- 2.0x factor: Each unit of overspend above 1.5x costs 2.0 reward points
- Continuous: Allows the agent to learn the trade-off rather than hitting a hard wall
Examples:
spendRate = 1.0 → penalty = 0 (on pace)
spendRate = 1.5 → penalty = 0 (at threshold)
spendRate = 2.0 → penalty = 1.0 (moderate overspend)
spendRate = 3.0 → penalty = 3.0 (severe overspend)
Episode Termination
The episode terminates when:
done = (budgetRemaining <= 0.0) || (timeRemaining <= 0.0)
At termination, a special terminal transition is stored:
val terminalState = Array.fill(stateSize)(0.0) // Zero vector
val terminalReward = windowClicks.toDouble // Final clicks (no penalty)
dqn.store(prevState, prevAction, terminalReward, terminalState, done = true)
The done = true flag tells DQN not to bootstrap future rewards beyond the episode boundary.
Reward Examples
| Window | Clicks | spendRate | Penalty | Reward |
|---|---|---|---|---|
| Normal pacing | 3 | 1.0 | 0 | 3.0 |
| Good CTR | 8 | 1.2 | 0 | 8.0 |
| Slight overspend | 5 | 1.8 | 0.6 | 4.4 |
| Severe overspend | 2 | 3.0 | 3.0 | -1.0 |
| At threshold | 4 | 1.5 | 0 | 4.0 |
Design Simplicity
Note what the reward function does not include:
- No exhaustion penalty — the episode simply ends when budget hits zero
- No CPA signal — conversion tracking is sparse, clicks are a sufficient proxy
- No win-rate bonus — win rate is in the state space, letting the agent learn its own trade-offs
This simplicity makes the reward signal clean and easy to interpret. The agent learns that clicks are good and overspending is bad — everything else it figures out from the state space.
Double DQN Architecture
Promovolve uses Double DQN (Van Hasselt et al., 2016) with a custom pure-Scala neural network implementation — no external ML framework dependencies.
The Overestimation Problem
Standard DQN uses the same network to both select and evaluate the best action:
target = reward + γ × max_a Q(s', a; θ)
The max operator introduces positive bias: noisy Q-values get selected at their peaks, systematically overestimating. Over many updates, this compounds into over-optimistic Q-values and suboptimal policies (e.g., over-bidding).
Double DQN Solution
Decouple selection from evaluation:
a* = argmax_a Q(s', a; θ) ← Q-network selects action
target = reward + γ × Q(s', a*; θ⁻) ← Target network evaluates it
Since θ and θ⁻ have different parameters, their noise is independent, breaking the correlation.
Network Architecture (DenseNetwork.scala)
graph TD
Input["Input Layer: 8 neurons<br/>(state dimensions)"] --> H1["Hidden Layer 1: 64 neurons<br/>ReLU activation"]
H1 --> H2["Hidden Layer 2: 64 neurons<br/>ReLU activation"]
H2 --> Output["Output Layer: 5 neurons<br/>(Q-value per action, linear)"]
Both Q-network and target network share this architecture.
Weight Initialization
Xavier initialization with Gaussian sampling:
scale = sqrt(2.0 / fanIn)
weight = rng.nextGaussian() × scale
Forward Pass
Sequential layer computation:
- Hidden layers:
output = ReLU(W × input + bias)whereReLU(x) = max(0, x) - Output layer:
output = W × input + bias(linear, no activation)
Backpropagation
Standard SGD with MSE loss:
loss = sum((output[i] - target[i])²) / outputSize
gradient_output: delta[i] = 2.0 × (output[i] - target[i]) / outputSize
gradient_hidden: delta[k] = if (activation[k] > 0) nextDelta[k] else 0 (ReLU derivative)
weight_update: w[j][k] -= learningRate × delta[j] × activation[k]
bias_update: b[j] -= learningRate × delta[j]
Loss is applied only to the taken action (one-hot).
Target Network Sync
if (trainSteps % targetSyncInterval == 0):
targetNetwork.copyFrom(qNetwork) // Full weight copy via System.arraycopy
Initial sync on agent creation ensures both networks start identical.
Q-Value Clipping
target[action] = clamp(-qClip, qClip, target[action])
Default qClip = 100.0. Safety measure against divergence during early training.
Why Pure Scala?
The DQN implementation doesn’t depend on TensorFlow, PyTorch, or DL4J:
- Deployment simplicity: No native library dependencies, runs on any JVM
- Integration: Lives inside the Pekko actor system, no inter-process communication
- Scale: The network is tiny (8→64→64→5 = ~4,800 parameters) — framework overhead would dominate
- Persistence: Weights serialize as
Array[Double], stored in Pekko’s durable state alongside campaign data
Training Loop & Hyperparameters
Hyperparameters (from DQNAgent.scala and BidOptimizationAgent.scala)
| Parameter | Default | Source |
|---|---|---|
| Hidden layers | [64, 64] | DQNAgent.Config |
| Learning rate | 0.001 | DQNAgent.Config |
| Gamma (discount) | 0.99 | DQNAgent.Config |
| Replay buffer size | 10,000 | DQNAgent.Config |
| Batch size | 32 | DQNAgent.Config |
| Min buffer size | 100 | DQNAgent.Config (before first training step) |
| Target sync interval | 100 steps | DQNAgent.Config |
| Epsilon start | 1.0 | DQNAgent.Config |
| Epsilon end | 0.05 | DQNAgent.Config |
| Epsilon decay | 0.995 | DQNAgent.Config |
| Q-value clip | [-100, 100] | DQNAgent.Config |
| Observation interval | 15 minutes | promovolve.rl.observe-interval |
| Day duration | 86400s | promovolve.rl.day-duration-seconds |
Training Schedule
Every 15 minutes (96 times per real day):
┌───────────────────────────────┐
│ 1. Timer fires in Campaign │ rlObserveInterval = 15 min
│ 2. Compute timeRemaining │ 1.0 - elapsed / dayDuration
│ 3. Build observation │ windowImps, clicks, spend, etc.
│ 4. Call bidOptAgent.observe() │
│ a. Build 8-dim state │
│ b. Compute reward │ clicks - overspendPenalty
│ c. Store (s,a,r,s',done) │ → replay buffer
│ d. ε-greedy action select │ random if ε, else argmax Q(s)
│ e. Apply action │ bidMultiplier *= adjustment
│ f. Sample batch (32) │ from replay buffer
│ g. Compute Double DQN loss │
│ h. Backprop + weight update │
│ i. Maybe sync target net │ every 100 train steps
│ 5. Reset window counters │
└───────────────────────────────┘
Epsilon-Greedy Exploration
if (rng.nextDouble() < epsilon):
action = rng.nextInt(actionSize) // random exploration
else:
action = argmax(qNetwork.forward(state)) // exploitation
Epsilon decays after each training step:
epsilon = max(epsilonEnd, epsilon × epsilonDecay)
Decay Timeline (96 steps/day)
Day 1: ε ≈ 1.00 → 100% random (pure exploration)
Day 2: ε ≈ 0.62 → 62% random
Day 3: ε ≈ 0.38 → 38% random
Day 5: ε ≈ 0.15 → 15% random
Day 8: ε ≈ 0.05 → 5% random (hits floor)
Day 8+: ε = 0.05 → 5% random (steady-state)
The 5% floor ensures ongoing exploration to adapt to changing conditions.
Replay Buffer (ReplayBuffer.scala)
Structure
private val states: Array[Array[Double]]
private val actions: Array[Int]
private val rewards: Array[Double]
private val nextStates: Array[Array[Double]]
private val dones: Array[Boolean]
Mechanics
- Capacity: 10,000 transitions (~104 days at 96 steps/day)
- Circular buffer:
writeIdx = (writeIdx + 1) % capacity - Sampling: Uniform random (
indices.map(rng.nextInt(currentSize))) - Min size: Training starts only after 100 transitions are stored
- No prioritization: All transitions are equally likely to be sampled
Why Uniform?
- State space is small (8-D) — network learns quickly from uniform samples
- Prioritized Experience Replay adds complexity (sum trees, importance sampling) for marginal benefit
- 15-minute windows already provide stable, low-noise transitions
Terminal Transitions
At day rollover:
if prevState exists:
terminalState = Array.fill(stateSize)(0.0) // zero vector
terminalReward = windowClicks.toDouble
dqn.store(prevState, prevAction, terminalReward, terminalState, done = true)
The done = true flag prevents Q-value bootstrapping across day boundaries:
if done:
target = reward // no future rewards
else:
target = reward + γ × Q(s', argmax Q(s'; θ); θ⁻) // Double DQN
Convergence Characteristics
- Days 1-3: Mostly random (ε > 0.38), building replay buffer, Q-network starts distinguishing good/bad actions
- Days 4-7: Agent develops basic policy (bid up when budget ample, down when overspending)
- Day 8+: ε hits floor (0.05), policy stabilizes with 5% ongoing exploration
- Cross-day: Weights persist, but multiplier resets to 1.0 daily — the agent re-learns optimal trajectory each day using its accumulated Q-network knowledge
Monitoring
// Q-values for inspection
def qValues(state: Array[Double]): Array[Double] = qNetwork.forward(state)
// Day statistics
DayStats(impressions, clicks, spend, observations, totalReward)
Available via DQNAgent.Snapshot: epsilon, totalSteps, trainSteps, network weights.
Distributed State from Scratch
When a user loads a page, Promovolve needs to find the right ad in under a millisecond. Why not just query a database?
This chapter explains why Promovolve uses replicated in-memory state instead of a database, how it keeps multiple copies in sync without a leader, and why “eventually consistent” is not just acceptable but actually the right choice for ad serving.
The Latency Problem
A PostgreSQL query takes 1-5 milliseconds on a fast local connection. That sounds fast. But the ad serving path runs on every page load, for every ad slot, for every user. At 1,000 requests per second with 3 ad slots each, that’s 3,000 database queries per second — each adding latency to the user’s page load.
Worse, database latency has a long tail. The median might be 2ms, but the 99th percentile might be 20ms. One slow query blocks the response. Under load, connection pool contention adds more. For a publisher who cares about page performance, every millisecond of ad serving latency is a tax on their readers.
Promovolve’s target: serve an ad in under 1 millisecond. No database can deliver that consistently under load. The data needs to be in memory, on the same machine that handles the request.
The Obvious Solution (and Why It Doesn’t Work)
“Just put it in a local cache.” Load the auction results into a HashMap on each API node. Reads are nanoseconds. Problem solved.
Not quite. Promovolve runs as a cluster of multiple nodes (for reliability and throughput). If node A runs an auction and updates its local cache, nodes B and C don’t know about it. A user whose request lands on node B gets stale data — or no data at all.
You need the data replicated across all nodes. The question is: how?
Option 1: Leader-Based Replication
The traditional approach: one node is the “leader” (or “primary”). All writes go through the leader. The leader replicates to followers.
Write → Leader → Follower 1
→ Follower 2
→ Follower 3
This is how PostgreSQL replication, Redis Sentinel, and most databases work. It provides strong consistency — all nodes see the same data after each write.
The problems:
- The leader is a bottleneck. All writes go through one node. If that node is slow or down, writes stall.
- Leader failure requires election. Detecting a dead leader, electing a new one, and catching up takes seconds. During that time, writes fail.
- Network partitions are ugly. If the leader can’t reach some followers, it must choose: keep accepting writes (risking divergence) or stop accepting writes (sacrificing availability).
For a database backing a billing system, these trade-offs are worth it. For an ad serving cache that refreshes every 5 minutes, they’re overkill.
Option 2: Replicate Without a Leader
What if every node can write, and writes automatically propagate to all other nodes?
Node A writes → gossip → Node B receives
Node B writes → gossip → Node A receives
No leader. No election. No single point of failure. Every node accepts writes locally (fast) and syncs with others in the background (eventually).
The problem: what happens when two nodes write different values for the same key at the same time? With a leader, this can’t happen — all writes go through one place. Without a leader, you need a way to resolve conflicts automatically.
CRDTs: Data Structures That Merge Themselves
A CRDT (Conflict-free Replicated Data Type) is a data structure designed so that concurrent writes can always be merged without conflict. The merge is deterministic — no matter what order updates arrive, all nodes converge to the same result.
The simplest example: a counter. Instead of storing “count = 5”, each node stores “my contribution is X”:
Node A: my_count = 3
Node B: my_count = 2
Merge: total = 3 + 2 = 5
Both nodes increment their own counter. The merge just sums them. No conflict possible.
Promovolve uses a more sophisticated CRDT: LWWMap (Last-Writer-Wins Map). It’s a key-value map where each entry has a timestamp. When two nodes write different values for the same key, the one with the later timestamp wins:
Node A writes: key="ad-slot-1" → creative_X at t=1000
Node B writes: key="ad-slot-1" → creative_Y at t=1003
Merge: creative_Y wins (later timestamp)
This is simple and predictable. The “conflict resolution” is just “most recent write wins” — the same semantics as overwriting a variable.
How Promovolve Uses DData
Pekko’s Distributed Data (DData) implements CRDTs with a gossip protocol. Every few seconds, each node shares its data changes with random peers. Changes propagate through the cluster like gossip in a social network — eventually reaching every node.
Promovolve’s ServeIndex stores the auction results that the serve path needs:
Key: "site-123|slot-banner-top|bucket-7"
Value: LWWMap of creative candidates
"creative-abc" → ServeView(assetUrl, cpm, ctr_stats, expires_at, ...)
"creative-def" → ServeView(...)
When the AuctioneerEntity completes an auction, it writes results to DData with WriteLocal — the write completes immediately on the local node. Within 2 seconds (the gossip interval), other nodes receive the update.
When the serve path needs to select an ad, it reads from the local DData replica — a local in-memory lookup, no network hop.
Why “Eventually Consistent” Is Fine Here
“Eventually consistent” sounds scary. What if a user gets stale data?
For ad serving, consider what “stale” means in practice:
A creative was updated 2 seconds ago but this node hasn’t received the gossip yet. The user sees the previous creative. Is this a problem? No — the creative was valid 2 seconds ago. It’s still a legitimate ad with a valid tracking URL. The user has no way to notice the difference.
A campaign ran out of budget but the serve node still shows its creative. The pacing gate and budget reservation catch this. Even if the ServeIndex has a stale entry, the budget check (which queries the CampaignEntity directly) will reject the serve and fall back to the next candidate. The stale cache entry is harmless.
An auction ran and produced new candidates, but this node still has the old ones. The old candidates are at most 5 minutes stale (the re-auction interval). They were valid winners of the previous auction. Serving them for 2 more seconds until gossip arrives is fine.
The key insight: the serve path is approximate by design. Thompson Sampling adds randomness. Pacing throttles probabilistically. Click-through rates are estimates. Adding 2 seconds of gossip delay to a system that already operates on statistical estimates doesn’t meaningfully degrade the outcome.
Strong consistency would give you a guarantee you don’t need, at a cost (leader bottleneck, cross-node coordination latency) that directly hurts the thing you do need: speed.
Bucketing: Keeping CRDTs Small
One problem with LWWMap: if you put thousands of entries in a single map, every gossip cycle transmits the entire delta (all changes since last sync). With frequent updates across many creatives, deltas grow large.
Promovolve splits each namespace into 32 buckets by hashing the key:
bucket = hash(creativeId) % 32
key = "site-123|slot-banner|bucket-7"
Each bucket is a separate LWWMap. An auction that updates 10 creatives touches maybe 8-10 buckets, not all 32. Gossip only transmits the buckets that changed. This keeps delta sizes small and gossip efficient.
Why 32? It’s a balance. More buckets means smaller deltas but more DData keys to manage. Fewer means larger deltas but simpler bookkeeping. 32 works well for the typical case of dozens to hundreds of creatives per site.
Write Consistency: Fast Writes, Safe Deletes
Not all writes are equal. Promovolve uses different consistency levels depending on the operation:
Writes (Put, Append, Update CPM): WriteLocal
The write succeeds immediately on the local node. Gossip propagates it. If the local node crashes before gossip, the write is lost — but the next auction (within 5 minutes) will repopulate it.
This is the right trade-off for the hot path. Auction results are ephemeral and refreshed frequently. Speed matters more than durability.
Deletes (Remove campaign, Remove creative): WriteMajority
Removing a creative should be seen by all nodes quickly — you don’t want a paused campaign’s ad to keep serving because one node missed the delete. WriteMajority waits for acknowledgment from a majority of nodes (e.g., 2 out of 3) before confirming.
If WriteMajority times out (800ms), Promovolve retries up to 5 times with 200ms backoff. Removing an ad that shouldn’t serve is more important than speed.
What About Node Restarts?
DData is in-memory. If a node restarts, its local replica is empty. What happens?
In a multi-node cluster: The restarted node receives the full state from other nodes via DData’s anti-entropy protocol. Within one gossip cycle (2 seconds), it has a complete replica.
In a single-node cluster (development): The data is gone. The next PeriodicReauction timer (within 5 minutes) re-runs auctions and repopulates the ServeIndex. During the gap, the serve path returns NoContent (HTTP 204) — the ad slot is empty. Not ideal, but bounded.
Promovolve deliberately does not persist ServeIndex to disk (unlike the shard-* keys that use LMDB). Persisting the hot serve path would add disk I/O to every auction write, which defeats the purpose of in-memory state for sub-millisecond reads. The 5-minute recovery window is an acceptable trade-off.
The Full Picture
Auction completes
→ AuctioneerEntity writes to ServeIndex (WriteLocal, ~0ms)
→ Gossip propagates to all nodes (~2 seconds)
→ Every node has candidates in local memory
User loads page
→ API node reads local ServeIndex replica (~0.1ms)
→ Pacing gate + Thompson Sampling (~0.1ms)
→ Ad response sent (<1ms total)
No database in the serve path. No network hop for reads. No leader to bottleneck. No election to delay. The system trades strong consistency (which it doesn’t need) for speed (which it does).
From Theory to Code
| Concept | File | Key method |
|---|---|---|
| ServeIndex DData actor | ServeIndexDData.scala | Put, Append, Remove commands |
| Bucketed LWWMap keys | ServeIndexDData.scala | mapKey(pub, bucket) |
| WriteLocal vs WriteMajority | ServeIndexDData.scala | Replicator.WriteLocal, Replicator.WriteMajority |
| TTL sweep (expire stale entries) | ServeIndexDData.scala | Sweep command |
| Gossip and replication config | application.conf | pekko.cluster.distributed-data |
| DData adapter in serve path | AdServer.scala | ServeIndexDData lookup |
The next chapters cover the bucketed LWWMap design, TTL expiration, and write consistency levels in detail.
ServeIndex & DData
The ServeIndex is Promovolve’s distributed in-memory cache storing auction results for instant serve-time lookups, built on Pekko Distributed Data (DData).
Why DData?
Every API node must serve ads without network round-trips:
| Alternative | Problem |
|---|---|
| Database (PostgreSQL) | 1-10ms per query |
| Remote cache (Redis) | ~0.5ms network hop |
| Sharded in-memory | Requires request routing |
| DData | Local replica on every node, gossip replication |
Data Model
ServeIndex
└── Bucket[0..31] (32 buckets, power-of-2)
└── LWWMap[String, ServeView]
Key: "siteId|slotId"
Value: ServeView
ServeView
case class ServeView(
candidates: Vector[CandidateView],
version: Long, // e.g., auction timestamp
expiresAtMs: Long // epoch millis; for TTL sweep
) extends CborSerializable
CandidateView
case class CandidateView(
creativeId: CreativeId,
campaignId: CampaignId,
advertiserId: AdvertiserId,
assetUrl: CDNPath, // URI to CDN-hosted asset
mime: MimeType, // imageJpeg, imagePng, imageGif, imageWebp, videoMp4
width: Int,
height: Int,
category: CategoryId,
cpm: CPM,
classifiedAtMs: Long, // when page content was classified
categoryScore: Double = 0.5, // classifierConfidence × rankerWeight
frequencyCap: Option[Int] = None,
adProductCategory: Option[AdProductCategoryId] = None,
landingDomain: String = ""
) extends CborSerializable
DData Configuration (from application.conf)
| Setting | Value |
|---|---|
| Gossip interval | 2s |
| Notify subscribers | 500ms |
| Max delta elements | 500 |
| Durable keys | shard-*, exhausted-campaigns |
| Durable store | LMDB (100 MiB, 200ms write-behind) |
| Pruning interval | 120s |
Note: ServeIndex entries are not in the durable keys list — they are ephemeral and rebuilt from auctions on restart. Only shard metadata and exhausted-campaign flags are LMDB-durable.
Bucketed LWWMap Design
The ServeIndex partitions entries into 32 buckets to keep CRDT delta sizes small.
Why Buckets?
A single LWWMap containing all entries would produce large deltas on any change. Bucketing partitions the keyspace:
bucket = abs(key.hashCode) % 32
Each bucket is an independent LWWMap. An update to bucket 7 only produces a delta for bucket 7.
LWWMap (Last-Writer-Wins Map)
Conflicts are resolved by timestamp — the value with the higher timestamp wins. This is safe because:
- Auction results are timestamped by the auction itself
- Newer auctions should always override older ones
- Concurrent auctions for the same slot are impossible (AuctioneerEntity is sharded by siteId)
Bucket Count: Why 32?
- Too few (4): ~25% of entries per bucket → large deltas
- Too many (1024): CRDT management overhead outweighs savings
- 32: With 10,000 entries, ~312 per bucket. Balanced.
Per-Publisher Namespace
The composite key "siteId|slotId" naturally partitions entries by publisher. Slots from different sites land in different buckets (usually) due to hash distribution.
DData Gossip Impact
With 32 buckets and max-delta-elements of 500:
- Each gossip round can propagate up to 500 changes across all buckets
- A single auction updating 10 slots affects at most 10 buckets
- Other buckets’ gossip is unaffected
TTL Sweep & Expiration
ServeIndex entries have a time-to-live to prevent stale ads from serving indefinitely.
TTL Assignment
When writing auction results:
expiresAtMs = System.currentTimeMillis() + ttlDurationMs
Default TTL: 120 minutes. Under normal operation, the next auction refreshes the entry before TTL expires.
Budget Exhaustion TTL Refresh
On CampaignBudgetExhausted or AdvertiserBudgetExhausted:
expiresAtMs = System.currentTimeMillis() + (dayDurationSeconds × 1.1 × 1000)
The 1.1x factor ensures the entry survives until well past the next daily budget reset.
Periodic Sweep
From ServeIndexDData.scala:
SweepInterval = 2.minutes
MaxKeysRemovePerRun = 500
Every 2 minutes, each node scans all 32 buckets:
for each bucket:
entries = bucket.entries
expired = entries.filter(e => now > e.expiresAtMs)
remove up to 500 expired entries from this bucket
Bounded Removals
The 500-per-bucket limit prevents a large batch of expirations from overwhelming DData:
- 32 buckets × 500 = up to 16,000 entries per sweep
- In practice, expirations are spread across time, so batches are smaller
Why Not Instant Expiration?
| Approach | Problem |
|---|---|
| Instant expiry | Clock skew between nodes → entries flicker |
| Individual removes | Many small deltas → gossip overhead |
| Batched sweep | Predictable load, clock-skew tolerant |
The 2-minute sweep interval means an expired entry might serve for up to 2 extra minutes. This is acceptable — the pacing gate and budget checks provide additional safety at serve time.
Write Consistency Levels
DData supports different consistency levels. Promovolve uses different levels depending on operation criticality.
Consistency Choices (from ServeIndexDData.scala)
| Operation | Consistency | Timeout | Retries | Rationale |
|---|---|---|---|---|
| Put (full replacement) | WriteLocal | — | — | Speed; next auction refreshes |
| Append (single candidate) | WriteLocal | — | — | Speed; dedup prevents issues |
| CPM update | WriteLocal | — | — | Best-effort price refresh |
| FilterByCreativeIds | WriteLocal | — | — | Batch cleanup |
| Remove (takedown) | WriteMajority | 800ms | 5 (200ms backoff) | Must be durable |
| RemoveCampaignFromKey | WriteMajority | 800ms | 5 | Must be durable |
| RemoveCreativeFromKey | WriteMajority | 800ms | 5 | Must be durable |
| RemoveBySite | WriteMajority | 800ms | 5 | Must be durable |
Why WriteLocal for Puts?
Auction results are written frequently and losing one write is not catastrophic:
- The next crawl cycle produces fresh results
- Gossip replicates to other nodes within seconds (2s gossip interval)
- Stale data is caught by the TTL sweep
Why WriteMajority for Removes?
Removes must be durable. If a remove only reaches one node and that node crashes:
- The entry reappears on restart from other nodes’ copies
- A “zombie” creative that was supposed to be taken down continues serving
- This is a compliance/safety concern (paused campaigns, suspended advertisers)
WriteMajority ensures the remove is acknowledged by a majority of nodes before returning.
Retry Strategy
MaxRemoveRetries = 5
InitialRetryBackoff = 200.millis
If WriteMajority times out (800ms), the remove is retried with exponential backoff. After 5 failures, the removal is logged and will be caught by the next TTL sweep.
Eventual Consistency Window
WriteLocal operations have a brief window (typically <2s, matching gossip interval) where different API nodes see different ServeIndex contents. This means:
- Two concurrent requests to different nodes might get different creatives
- A just-written entry might not be visible everywhere immediately
These are acceptable because:
- Thompson Sampling already introduces per-request randomness
- The 15-minute RL window averages over many decisions
- Budget and pause checks at serve time catch any “shouldn’t serve” cases
Learning Reinforcement Learning Through Ad Bidding
This tutorial teaches reinforcement learning (RL) from first principles, using Promovolve’s bid optimization system as the running example. Every concept is grounded in real, production Scala code — no toy problems, no gym environments, no Python notebooks.
By the end, you’ll understand:
- What reinforcement learning is and when to use it
- How to model a real-world problem as an RL task (states, actions, rewards)
- How neural networks learn to approximate value functions
- How DQN and Double DQN work, and why they matter
- How experience replay and target networks stabilize training
- How to deploy RL in a production system that handles real money
Why ad bidding is a great RL problem
Most RL tutorials use video games or grid worlds. These are fine for building intuition, but they don’t teach you how RL behaves when:
- The environment is non-stationary — traffic patterns change by hour, day, and season
- Episodes have real economic consequences — overspending wastes advertiser money, underspending leaves revenue on the table
- Observations are noisy and delayed — you only see aggregated metrics every 15 minutes, not instant per-action feedback
- The state space is continuous — not a grid, but a blend of rates, fractions, and normalized signals
- Multiple agents interact — hundreds of campaigns compete simultaneously, each learning its own policy
Promovolve’s bid optimization agent faces all of these. It’s a ~400-line pure Scala implementation with no ML framework dependencies — you can read every line of the forward pass, backpropagation, and training loop.
Prerequisites
You should be comfortable with:
- Basic programming concepts (loops, arrays, functions)
- High school math (derivatives, basic probability)
- No prior ML/RL knowledge required — we build everything from scratch
Chapters
- The Problem: Why Bid Optimization Needs RL
- RL Fundamentals: Agent, Environment, Reward
- Building a Neural Network From Scratch
- From Q-Tables to Deep Q-Networks
- Experience Replay: Learning From the Past
- Double DQN: Fixing Overestimation
- Putting It Together: The BidOptimizationAgent
- Training in Production: Episodes, Persistence, and Day Resets
The Problem: Why Bid Optimization Needs RL
Imagine you are running an online ad campaign. You have a daily budget of $100, a maximum CPM (cost per thousand impressions) of $5, and a full day to spend that budget wisely. Your goal is simple: get as many clicks as possible without blowing through the money too early or leaving it unspent at the end of the day.
How hard can it be?
The naive approach: bid the same amount all day
The simplest strategy is to bid the same amount on every auction throughout the day. Set your CPM to, say, $3 and let it ride.
The problem is that web traffic is not uniform. At 10am, when office workers are browsing news sites, traffic surges. There are more impressions available but also more competing advertisers, which drives up prices. Your $3 bid might win almost nothing during this peak because others are bidding $4 or $5.
Then at 3am, the audience shrinks but so does the competition. Cheap inventory is available at $1 CPM, but you are still bidding $3 — overpaying for impressions you could have won for less.
A flat bidding strategy fails because the market is not flat.
The rule-based approach: manually adjust throughout the day
You could try writing rules:
- “If we’ve spent more than half the budget before noon, reduce the bid by 20%.”
- “If traffic is low and we’re under-spending, increase the bid by 10%.”
This works better than the naive approach, but you quickly run into problems:
-
The thresholds are arbitrary. Why 20% and not 15%? Why noon and not 1pm? You picked these numbers by intuition, and different campaigns will need different ones.
-
The market changes. A new competitor launches a campaign on Tuesday that was not there on Monday. Your carefully tuned rules are now wrong.
-
Interactions are complex. Raising your bid affects your win rate, which affects your spend rate, which triggers your spend-pacing rule, which lowers your bid, which drops your win rate. Rule-based systems tend to oscillate or get stuck.
-
There are too many variables. Budget remaining, time remaining, current win rate, click-through rate, cost per click, impression volume — these all interact in non-obvious ways. Writing rules that handle every combination correctly is a combinatorial nightmare.
The environment is non-stationary
This is the deeper issue. The advertising marketplace is not a static puzzle you solve once. It is a living system:
- Traffic patterns shift. A viral news story drives a spike of readers to a publisher site. A holiday changes user behavior.
- Competitors enter and leave. Another advertiser launches a campaign targeting the same audience, driving prices up. An hour later, they exhaust their budget and disappear, and prices drop.
- Content changes. The topics on a publisher’s site affect what ads perform well. A sports article attracts different clicks than a finance article.
- Your own actions change the environment. If you bid more aggressively, you win more auctions, which depletes your budget faster, which means you need to bid less aggressively later.
No static set of rules can keep up. What you need is a system that adapts — one that observes its own performance and adjusts its strategy continuously.
The key insight: learn from experience
Here is the core idea behind using reinforcement learning for this problem:
The agent does not know the optimal bidding strategy in advance. It must learn from experience by observing the results of its own actions.
What if we had a system that, every 15 minutes, looked at how the campaign is doing — how fast it is spending, whether it is getting clicks, what the win rate looks like — and then adjusted the bid amount? Not by following a fixed rule, but by choosing the adjustment that its accumulated experience suggests will lead to the best outcome over the rest of the day?
That is exactly what Promovolve’s BidOptimizationAgent does.
Two speeds: fast path and slow path
A real-time bidding system has to respond to ad requests in milliseconds. You cannot run a neural network inference for every single bid. Instead, Promovolve splits the work into two layers:
CampaignEntity (fast path, per-request):
- Eligibility checks (canBid)
- Budget reservation (TryReserve)
- Bid response: maxCpm * bidMultiplier
BidOptimizationAgent (slow path, every 15 minutes):
- Observes: spend, clicks, impressions, win rate, time/budget remaining
- Outputs: new bidMultiplier
- Trains DQN on accumulated experience
The CampaignEntity handles every incoming ad request. It runs on the fast path — simple arithmetic, no ML involved. For each request, it checks whether the campaign is eligible to bid, reserves budget, and responds with a bid price. That bid price is just maxCpm * bidMultiplier.
The BidOptimizationAgent runs on the slow path. Every 15 minutes, it wakes up, looks at the campaign’s performance metrics from the last window, and decides how to adjust the bidMultiplier. It also uses this experience to train a neural network so that future decisions improve.
This separation is critical. The fast path is lightweight and deterministic — it just multiplies two numbers. All the learning and adaptation happens offline, on a 15-minute cadence.
The bidMultiplier: one number to rule them all
The RL agent’s output is remarkably simple: a single number called the bidMultiplier.
The campaign already has a maxCpm set by the advertiser (say, $5). The bidMultiplier scales that value:
| bidMultiplier | Effective CPM | Meaning |
|---|---|---|
| 0.5 | $2.50 | Bid conservatively — save budget for later |
| 1.0 | $5.00 | Bid at the full max — default starting point |
| 1.4 | $7.00 | Bid aggressively — win auctions now |
| 2.0 | $10.00 | Maximum aggression — ceiling |
The multiplier is clamped between 0.5 and 2.0. This means the agent can never bid more than double or less than half the advertiser’s max CPM. These guardrails keep the agent from doing anything catastrophic.
Why a multiplier instead of an absolute bid amount? Because different campaigns have different base CPMs. A multiplier of 0.8 means “bid 20% less than the max” regardless of whether the max is $2 or $20. The agent learns a strategy (when to be aggressive, when to conserve) that transfers across campaigns.
What is ahead
In the next chapter, we will formalize this setup using the language of reinforcement learning: states, actions, rewards, and episodes. You will see exactly how Promovolve encodes the campaign’s situation into numbers the agent can reason about, and how the reward function encourages the behavior we want — maximizing clicks while pacing the budget smoothly through the day.
RL Fundamentals: Agent, Environment, Reward
Reinforcement learning has a clean conceptual framework. Once you understand its four key pieces, everything else — Q-learning, neural networks, exploration strategies — becomes a detail of how rather than what.
Here is the loop:
graph LR
A[Agent] -->|action| E[Environment]
E -->|reward + new state| A
- The agent observes the current state of the world.
- The agent chooses an action.
- The environment responds with a reward (a number: higher is better) and a new state.
- Repeat.
The agent’s goal is to choose actions that maximize the total reward it accumulates over time. Not just the immediate reward — the cumulative reward, including what it expects to earn in the future.
That is it. That is the entire framework. Let us now map each piece to the ad bidding problem.
Agent: BidOptimizationAgent
In Promovolve, each campaign gets its own BidOptimizationAgent. This is the agent. It wakes up every 15 minutes, looks at how the campaign has been performing, and decides whether to increase, decrease, or hold the bid multiplier.
The agent does not see individual ad requests. It does not know which users clicked or which publishers served the ads. It sees only aggregate metrics from the last 15-minute window: total impressions, total clicks, total spend, win rate. This is deliberate — the agent operates on a coarse time scale, making strategic decisions, not tactical ones.
Environment: the ad marketplace
The environment is everything outside the agent’s control: other advertisers’ bids, publisher traffic volume, user behavior, time of day. The agent cannot observe the environment directly; it only sees the environment’s effects through the metrics it receives.
This is a crucial distinction. The agent does not know that a competitor just increased their bid by 30%. What it does know is that its win rate dropped from 0.6 to 0.3 in the last window. It must figure out the right response from that indirect signal.
State: what the agent observes
Every 15 minutes, the agent constructs an 8-dimensional state vector from the campaign’s current metrics. Each dimension is a number, typically normalized to a range around 0 to 2. Here is the toState() method from BidOptimizationAgent.scala:
private def toState(obs: Observation): Array[Double] = {
val maxCpm = if (obs.maxCpm > 0) obs.maxCpm else 1.0
val dailyBudget = if (obs.dailyBudget > 0) obs.dailyBudget else 1.0
Array(
// 0: effective CPM (normalized)
math.min(2.0, (obs.maxCpm * _bidMultiplier) / maxCpm),
// 1: CTR in window
if (windowImpressions > 0) math.min(1.0, windowClicks.toDouble / windowImpressions)
else 0.0,
// 2: win rate
if (windowBidOpportunities > 0) windowWins.toDouble / windowBidOpportunities
else 0.5,
// 3: budget remaining fraction
math.max(0.0, math.min(1.0, obs.budgetRemaining / dailyBudget)),
// 4: time remaining fraction
math.max(0.0, math.min(1.0, obs.timeRemaining)),
// 5: spend rate vs ideal (1.0 = on pace)
spendRate(obs),
// 6: impression rate (normalized by expected)
normalizedImpressionRate(obs),
// 7: CPC (normalized)
if (windowClicks > 0) math.min(2.0, (windowSpend / windowClicks) / maxCpm)
else 0.0
)
}
Let us walk through each dimension and why it matters.
Dimension 0: effectiveCpm
This is the agent’s current bid level — maxCpm * bidMultiplier, normalized by maxCpm so it always sits around 1.0. This tells the agent “where is my bid right now?” If the multiplier is 1.0, this value is 1.0. If the multiplier is 0.7, this is 0.7.
Why it matters: The agent needs to know its own current bid to reason about whether to raise or lower it. Without this, it would have no memory of its previous decisions.
Dimension 1: ctr (click-through rate)
The ratio of clicks to impressions in the last 15-minute window. If the campaign served 200 impressions and got 4 clicks, CTR is 0.02.
Why it matters: CTR is the direct signal of ad quality. If CTR is high, the current bid level is winning good inventory. If CTR drops, the agent might be winning low-quality impressions (cheap but unclicked) or the audience mix has changed.
Dimension 2: winRate
The fraction of bid opportunities that resulted in a win. If the campaign bid on 500 auctions and won 300, win rate is 0.6.
Why it matters: Win rate tells the agent how competitive it is. A low win rate (say 0.1) means the agent is being outbid most of the time — it might need to increase the multiplier. A very high win rate (say 0.95) might mean the agent is overbidding — it could lower the multiplier and save budget while still winning plenty of auctions.
Dimension 3: budgetRemaining
The fraction of the daily budget that has not yet been spent. Starts at 1.0, drops toward 0.0.
Why it matters: This is the scarcity signal. If it is 2pm and the budget is already at 0.2, the agent needs to conserve. If it is 2pm and the budget is at 0.8, the agent can afford to be more aggressive.
Dimension 4: timeRemaining
The fraction of the delivery period remaining. Starts at 1.0 at the beginning of the day, drops to 0.0 at the end.
Why it matters: Time context is critical. Having 50% of the budget left means very different things depending on whether 50% or 10% of the day remains. The agent needs both dimensions to reason about pacing.
Dimension 5: spendRate
The ratio of actual spending pace to the ideal even pace. A value of 1.0 means the campaign is spending exactly on track. A value of 2.0 means it is spending twice as fast as it should. A value of 0.5 means it is under-spending.
This is computed by comparing how much has been spent to how much should have been spent given the elapsed time:
private def spendRate(obs: Observation): Double = {
if (obs.dailyBudget <= 0 || obs.timeRemaining >= 1.0) return 1.0
val elapsed = 1.0 - obs.timeRemaining
if (elapsed <= 0) return 1.0
val expectedSpend = obs.dailyBudget * elapsed
if (expectedSpend <= 0) return 1.0
val actualSpend = obs.dailyBudget - obs.budgetRemaining
math.min(3.0, actualSpend / expectedSpend) // cap at 3x overspend
}
Why it matters: This is the single most important pacing signal. It directly tells the agent whether it needs to speed up or slow down. The budgetRemaining and timeRemaining dimensions give raw position; spendRate gives velocity.
Dimension 6: impressionRate
The number of impressions in the last window, normalized by a baseline expectation of 100 impressions per window.
Why it matters: This tells the agent about traffic volume. If impression rate is 2.0, the market is busy — lots of inventory available. If it is 0.1, traffic is thin. The agent can learn to bid differently depending on market conditions: for example, saving budget during low-traffic periods (when impressions are scarce and expensive) and spending during high-traffic periods (when cheap inventory is plentiful).
Dimension 7: costPerClick
The average cost per click in the last window, normalized by maxCpm.
Why it matters: This is an efficiency metric. A high CPC means the agent is paying a lot for each click — maybe it should bid lower to find cheaper inventory. A low CPC means clicks are coming cheaply — a good time to be aggressive.
Why normalization matters
Notice that every dimension is capped and normalized. CTR is capped at 1.0. Effective CPM is capped at 2.0. Spend rate is capped at 3.0. This is important because neural networks work best when inputs are in similar numeric ranges. If one dimension ranged from 0 to 1 and another from 0 to 10,000, the large dimension would dominate training and the small one would be ignored.
Action: 7 discrete bid adjustments
When the agent observes a state, it must choose one of 7 actions. Each action maps to a multiplicative adjustment applied to the current bid multiplier:
| Action | Adjustment | Meaning |
|---|---|---|
| 0 | 0.7x | Bid 30% less — strong pullback |
| 1 | 0.8x | Bid 20% less — conserve budget |
| 2 | 0.9x | Bid 10% less — slight reduction |
| 3 | 1.0x | Hold current bid |
| 4 | 1.1x | Bid 10% more — slight increase |
| 5 | 1.2x | Bid 20% more — be aggressive |
| 6 | 1.4x | Bid 40% more — strong push |
These adjustments are cumulative. If the multiplier is currently 1.0 and the agent picks action 5 (1.2x), the new multiplier becomes 1.2. If the agent then picks action 4 (1.1x), the multiplier becomes 1.2 * 1.1 = 1.32.
The multiplier is clamped to [0.5, 2.0]:
val adjustment = config.dqnConfig.multiplierForAction(action)
_bidMultiplier = math.max(
config.minMultiplier,
math.min(config.maxMultiplier, _bidMultiplier * adjustment)
)
Notice the action space is asymmetric — there is no 1.3x on the upward side, but there is a 0.7x on the downward side. The strongest downward action (0.7x) is a sharper cut than the strongest upward action (1.4x) relative to the hold action. This design reflects a practical observation: it is usually more urgent to reduce spending (budget exhaustion is irreversible) than to increase it (under-spending can be corrected later). A single “emergency brake” action of 0.7x can cut the bid sharply when needed.
Why discrete actions instead of a continuous output? Two reasons. First, discrete action spaces are simpler to learn with DQN (the algorithm Promovolve uses). Second, discrete actions make the agent’s behavior interpretable — you can look at a log and see “the agent chose action 0 (cut bid by 30%)” rather than “the agent output 0.7134.”
Reward: what success looks like
The reward function is where you encode what you want the agent to optimize for. Get it right, and the agent learns useful behavior. Get it wrong, and the agent finds creative ways to maximize the number you gave it while doing something you did not intend.
Here is the computeReward() method:
private def computeReward(obs: Observation): Double = {
// Primary reward: clicks achieved in this window
val clickReward = windowClicks.toDouble
// Penalty for overspending (spend rate > 1.5x means burning too fast)
val rate = spendRate(obs)
val overspendPenalty = if (rate > 1.5) config.overspendPenalty * (rate - 1.5) else 0.0
// Penalty for budget exhaustion
val exhaustionPenalty =
if (obs.budgetRemaining <= 0 && obs.timeRemaining > 0.1)
config.exhaustionPenalty
else 0.0
clickReward - overspendPenalty - exhaustionPenalty
}
The reward has three components:
Primary reward: clicks
The number of clicks in the last 15-minute window. This is the main objective. More clicks = higher reward.
Why clicks and not impressions? Because impressions are easy to buy — just bid high and you win every auction. But that burns budget on impressions nobody clicks. Clicks are a better proxy for advertiser value.
Penalty: overspending
If the spend rate exceeds 1.5x the ideal pace, the agent receives a penalty proportional to the excess. The penalty factor is 2.0 by default, so:
- Spend rate of 1.5x: no penalty
- Spend rate of 2.0x: penalty of 2.0 * (2.0 - 1.5) = 1.0
- Spend rate of 3.0x: penalty of 2.0 * (3.0 - 1.5) = 3.0
Notice the threshold of 1.5x. The agent is allowed to spend somewhat faster than ideal — maybe there is good inventory available right now and it makes sense to grab it. But once spending exceeds 1.5x the ideal rate, the penalty kicks in, growing linearly. This gives the agent a “soft budget” rather than a hard constraint: overspend a little if the clicks are worth it, but not too much.
Penalty: budget exhaustion
If the budget hits zero while more than 10% of the delivery period remains, the agent receives a flat penalty of 5.0. This is the “hard lesson” — you ran out of money with hours left in the day. A penalty of 5.0 is severe (it might take several windows of good clicks to make up for it), which teaches the agent to avoid this outcome.
The 10% threshold prevents penalizing the agent for running out of budget in the last few minutes of the day, which is often fine or even desirable — you want to use the full budget.
Putting it together
The reward function says: “Get as many clicks as you can, but don’t burn the budget too fast, and definitely don’t exhaust it with a lot of time remaining.” The agent learns to balance these competing objectives.
graph TD
R[Reward] --> C[+ Clicks in window]
R --> O[- Overspend penalty<br/>if spendRate > 1.5x]
R --> E[- Exhaustion penalty<br/>if budget=0 and time > 10% left]
The discount factor: valuing the future
The agent does not just maximize the reward from the current 15-minute window. It maximizes the discounted sum of all future rewards:
total = r_0 + gamma * r_1 + gamma^2 * r_2 + gamma^3 * r_3 + ...
Promovolve uses gamma = 0.99. This means a reward one step in the future is worth 99% of a reward right now. A reward two steps away is worth 0.99 * 0.99 = 98%. Twenty steps away: 0.99^20 = 82%.
With gamma = 0.99, the agent strongly considers future consequences. If bidding aggressively now gets 5 extra clicks but causes budget exhaustion 6 windows from now (losing, say, 20 clicks worth of opportunity), the agent learns to avoid that trade. A lower gamma (say 0.9) would make the agent more myopic — caring mostly about the next few windows. A higher gamma (say 0.999) would make the agent treat rewards an hour from now almost identically to rewards right now.
The value 0.99 is a good default for this domain, where a day has roughly 96 fifteen-minute windows. After 96 steps, 0.99^96 = 0.38, so the agent still cares meaningfully about what happens at the end of the day when making decisions at the beginning.
Episodes: one day, one episode
In RL terminology, an episode is a complete sequence from start to finish. In Promovolve, one day equals one episode.
At the start of the day, the agent begins with a bidMultiplier of 1.0 and the campaign has its full daily budget. Over the course of the day, the agent makes approximately 96 decisions (one per 15-minute window), each time adjusting the multiplier based on what it observes.
At the end of the day, resetDay() is called:
def resetDay(): Unit = {
// Store terminal transition
for {
ps <- prevState
pa <- prevAction
} {
val terminalState = Array.fill(config.dqnConfig.stateSize)(0.0)
val terminalReward = windowClicks.toDouble
dqn.store(ps, pa, terminalReward, terminalState, done = true)
}
_bidMultiplier = 1.0
prevObservation = None
prevState = None
prevAction = None
// ... reset all window and day counters ...
}
Two things happen here:
-
Terminal transition. The agent stores a final experience with
done = true. This tells the learning algorithm that there is no “next state” — the episode is over. Future rewards after this point are zero. Without this signal, the agent would think the episode continues forever and assign inflated values to end-of-day states. -
Reset to defaults. The bid multiplier goes back to 1.0. All counters reset to zero. The agent starts fresh, but — and this is important — it keeps its learned weights. The neural network retains everything it learned from previous days. Tomorrow, the agent starts from the same bid multiplier (1.0) but makes better decisions because it has more experience to draw on.
Over many episodes (days), the agent’s policy improves. Early on, it explores randomly and makes mistakes — overspending, underspending, missing cheap inventory. Over time, it learns patterns: “when budget is at 60% and time is at 40%, I should bid conservatively” or “when win rate drops below 0.2, I should increase the multiplier.”
The full picture
Here is how all the pieces connect in a single 15-minute cycle:
sequenceDiagram
participant CE as CampaignEntity<br/>(fast path)
participant BOA as BidOptimizationAgent<br/>(slow path)
participant Env as Ad Marketplace
Note over CE: Handles bid requests<br/>using current bidMultiplier
CE->>Env: bid = maxCpm * bidMultiplier
Env-->>CE: win/loss, impressions, clicks
Note over BOA: Every 15 minutes
CE->>BOA: Observation(maxCpm, budget, time, ...)
BOA->>BOA: toState() → 8-dim vector
BOA->>BOA: computeReward() → clicks - penalties
BOA->>BOA: Store (state, action, reward, nextState)
BOA->>BOA: Train DQN on replay buffer
BOA->>BOA: Select new action (epsilon-greedy)
BOA->>BOA: Apply action → new bidMultiplier
BOA-->>CE: updated bidMultiplier
Note over CE: Next 15 min: uses<br/>new bidMultiplier
In the next chapter, we will look inside the DQN — the neural network and learning algorithm that powers the agent’s decision-making.
Building a Neural Network From Scratch
In Chapter 2, we explored how an agent learns Q-values – estimates of how good each action is in a given state. We stored those Q-values in a table, one entry per state-action pair. That works when the world is small and discrete: a grid with a handful of squares, a game with a few dozen positions.
But Promovolve’s bid optimization agent lives in a continuous world. Its state is a vector of 8 floating-point numbers – CTR, win rate, budget remaining, spend rate, and more. Each of those can take on essentially infinite values. There is no table large enough to hold a Q-value for every possible combination of those 8 numbers.
We need a function that can take any 8-number state as input and produce a Q-value for each possible action. That function is a neural network.
What a neural network does
Think of a neural network as a programmable formula. You feed in numbers on one side, and it produces numbers on the other. In between, it has thousands of adjustable knobs (called weights and biases) that determine exactly how the input maps to the output. Training the network means finding the right settings for all those knobs.
In Promovolve’s case:
- Input: 8 numbers describing the campaign’s current state (effective CPM, CTR, win rate, budget remaining, time remaining, spend rate, impression rate, cost per click)
- Output: 7 numbers, one per possible action (each action adjusts the bid multiplier: 0.7x, 0.8x, 0.9x, 1.0x, 1.1x, 1.2x, 1.4x)
The output number for each action is the network’s estimate of the Q-value for that action. The agent picks the action with the highest Q-value.
The architecture
Promovolve’s network is defined with a single line:
val layerSizes = Vector(8, 64, 64, 7)
This describes four layers of neurons:
- Input layer (8 neurons): one for each state feature
- First hidden layer (64 neurons): an internal processing stage
- Second hidden layer (64 neurons): a second processing stage
- Output layer (7 neurons): one Q-value per action
The word “hidden” just means these layers are internal – they are not directly visible as input or output. Their job is to discover useful intermediate representations.
How many adjustable knobs does this create? Between each pair of adjacent layers, every neuron in the first layer connects to every neuron in the next. Each connection has a weight, and each neuron in the receiving layer has a bias. So:
| Connection | Weights | Biases | Total |
|---|---|---|---|
| Input to Hidden 1 | 8 x 64 = 512 | 64 | 576 |
| Hidden 1 to Hidden 2 | 64 x 64 = 4,096 | 64 | 4,160 |
| Hidden 2 to Output | 64 x 7 = 448 | 7 | 455 |
| Total | 5,191 |
Roughly 5,000 parameters. The network will learn values for all of them.
A concrete example: a tiny network
Before we look at the real code, let’s walk through a network small enough to compute by hand. Suppose we have a 2-input, 3-hidden, 1-output network.
Input (2) --> Hidden (3, ReLU) --> Output (1, linear)
Say the weights and biases are:
Layer 1 (input to hidden):
| input_0 | input_1 | bias | |
|---|---|---|---|
| hidden_0 | 0.5 | -0.3 | 0.1 |
| hidden_1 | -0.2 | 0.8 | 0.0 |
| hidden_2 | 0.4 | 0.4 | -0.5 |
Layer 2 (hidden to output):
| hidden_0 | hidden_1 | hidden_2 | bias | |
|---|---|---|---|---|
| output_0 | 0.6 | -0.4 | 0.9 | 0.2 |
Now feed in the input [1.0, 2.0].
Step 1 – hidden layer (before activation):
hidden_0 = 0.5*1.0 + (-0.3)*2.0 + 0.1 = 0.5 - 0.6 + 0.1 = 0.0
hidden_1 = (-0.2)*1.0 + 0.8*2.0 + 0.0 = -0.2 + 1.6 + 0.0 = 1.4
hidden_2 = 0.4*1.0 + 0.4*2.0 + (-0.5) = 0.4 + 0.8 - 0.5 = 0.7
Step 2 – apply ReLU activation:
ReLU means: if the value is negative, replace it with zero. Otherwise, keep it.
hidden_0 = max(0, 0.0) = 0.0
hidden_1 = max(0, 1.4) = 1.4
hidden_2 = max(0, 0.7) = 0.7
Step 3 – output layer (linear, no ReLU):
output_0 = 0.6*0.0 + (-0.4)*1.4 + 0.9*0.7 + 0.2
= 0.0 - 0.56 + 0.63 + 0.2
= 0.27
The network outputs 0.27. If this were a Q-value, the agent would use it to decide whether this action is worth taking. The entire forward pass is just multiply, add, and clamp-to-zero, repeated layer by layer.
Xavier initialization: starting with sensible weights
Before the network has learned anything, what should the initial weights be? If they are all zero, every neuron computes the same thing – the network cannot learn. If they are huge random numbers, the signals explode through the layers. If they are tiny, the signals shrink to nothing.
Xavier initialization (also called He initialization for ReLU networks) picks random weights from a normal distribution scaled to the number of inputs each neuron receives:
val scale = math.sqrt(2.0 / fanIn)
Array.tabulate(fanOut) { _ =>
Array.tabulate(fanIn) { _ => rng.nextGaussian() * scale }
}
fanIn is the number of inputs to the layer. For the first hidden layer in Promovolve’s network, fanIn = 8, so scale = sqrt(2/8) = 0.5. Weights are drawn from a normal distribution with mean 0 and standard deviation 0.5.
Why does this work? Each neuron sums up fanIn weighted inputs. If each weight has variance 2/fanIn, the sum has variance roughly 2 – not too big, not too small. This keeps the signal magnitude stable as it passes through many layers. Without this, deep networks are extremely difficult to train: the signals either explode (and the weights diverge) or vanish (and the weights never update).
Biases start at zero:
private val biases: Array[Array[Double]] =
Array.tabulate(numLayers) { l => Array.fill(layerSizes(l + 1))(0.0) }
The forward pass: how the network computes output from input
Here is the actual layerForward method from Promovolve’s DenseNetwork.scala:
private def layerForward(
w: Array[Array[Double]],
b: Array[Double],
input: Array[Double],
relu: Boolean
): Array[Double] = {
val out = new Array[Double](w.length)
var j = 0
while (j < w.length) {
var sum = b(j)
val wj = w(j)
var k = 0
while (k < wj.length) {
sum += wj(k) * input(k)
k += 1
}
out(j) = if (relu && sum < 0.0) 0.0 else sum
j += 1
}
out
}
For each neuron j in the output:
- Start with the bias:
sum = b(j) - For each input
k, addweight * input:sum += wj(k) * input(k) - Apply ReLU if this is a hidden layer: if
sum < 0, set it to0
The full forward pass chains this through all layers:
def forward(input: Array[Double]): Array[Double] = {
var activation = input
var l = 0
while (l < numLayers) {
activation = layerForward(weights(l), biases(l), activation,
relu = l < numLayers - 1)
l += 1
}
activation
}
Notice relu = l < numLayers - 1 – every layer gets ReLU except the last one. The output layer is linear because Q-values can be negative, and ReLU would clamp them to zero.
Why ReLU?
ReLU (Rectified Linear Unit) is the activation function f(x) = max(0, x). It does one simple thing: it kills negative values.
Why not just use the raw weighted sum everywhere? Without a nonlinear activation, stacking multiple layers is pointless. A linear function of a linear function is still just a linear function – you get no more expressive power from depth. ReLU introduces nonlinearity, which lets the network learn curved decision boundaries instead of just flat planes.
Why ReLU specifically? There are many activation functions (sigmoid, tanh, etc.), but ReLU is popular for practical reasons:
- Cheap to compute: just a comparison and possibly a zero assignment
- Avoids vanishing gradients: for positive inputs, the gradient is always 1.0. Sigmoid and tanh squash large values into a flat region where the gradient approaches zero, which can stall learning in deep networks.
- Works well in practice: decades of empirical evidence across many problem domains
Backpropagation: teaching the network
Forward pass gives us an output. But how do we improve the weights so the output gets closer to what we want?
The idea is straightforward: measure how wrong the output is, then trace backward through the network to figure out which weights contributed most to the error, and nudge each weight in the direction that reduces the error.
MSE loss: measuring how wrong we are
Promovolve uses Mean Squared Error (MSE):
loss = sum((output_i - target_i)^2) / n
If the network outputs [0.3, -0.1, 0.5] and the target is [0.3, 0.2, 0.5], the loss is:
loss = ((0.3-0.3)^2 + (-0.1-0.2)^2 + (0.5-0.5)^2) / 3
= (0 + 0.09 + 0) / 3
= 0.03
The chain rule: tracing blame backward
Backpropagation uses the chain rule from calculus. If you vaguely remember that from school, great – if not, here is the intuition.
Imagine you are in a factory assembly line. The final product has a defect. You need to trace back through the line to find which stations caused the problem and by how much. Backpropagation does exactly this, but with math: it propagates the error signal backward through each layer, computing how much each weight contributed to the final error.
The actual backprop code
Here is the train method from DenseNetwork.scala, with annotations:
def train(input: Array[Double], target: Array[Double], lr: Double): Double = {
val activations = forwardFull(input)
val output = activations(numLayers)
val outputSize = output.length
// MSE loss
var loss = 0.0
var i = 0
while (i < outputSize) {
val diff = output(i) - target(i)
loss += diff * diff
i += 1
}
loss /= outputSize
First, run the forward pass and save every layer’s activations (we need them to compute gradients). Compute the MSE loss.
// Output layer gradient: dL/dz = 2(output - target) / n, clipped
var delta = new Array[Double](outputSize)
i = 0
while (i < outputSize) {
delta(i) = math.max(-GradClip, math.min(GradClip,
2.0 * (output(i) - target(i)) / outputSize))
i += 1
}
The output gradient tells us, for each output neuron, which direction and how much we need to adjust. The formula 2 * (output - target) / n is the derivative of the MSE loss. If the output is too high, the gradient is positive, pushing the output down. If too low, the gradient is negative, pushing it up.
The gradient is clipped to the range [-5.0, 5.0] (GradClip = 5.0) to prevent extreme values from destabilizing training. This is called gradient clipping.
// Backprop through layers (reverse order)
var l = numLayers - 1
while (l >= 0) {
val act = activations(l)
val w = weights(l)
val b = biases(l)
val nextDelta = if (l > 0) new Array[Double](w(0).length) else null
var j = 0
while (j < w.length) {
// Update bias
b(j) -= lr * delta(j)
val wj = w(j)
var k = 0
while (k < wj.length) {
// Accumulate gradient for previous layer
if (nextDelta != null) {
nextDelta(k) += delta(j) * wj(k)
}
// Update weight
wj(k) -= lr * delta(j) * act(k)
k += 1
}
j += 1
}
This is the heart of backpropagation. For each layer, going from the output backward:
- Bias update:
b(j) -= lr * delta(j)– the bias moves in the opposite direction of the gradient, scaled by the learning rate. - Weight update:
wj(k) -= lr * delta(j) * act(k)– each weight is updated proportionally to (a) how much the output needs to change (delta(j)) and (b) how active the input was (act(k)). If the input neuron was not active (zero), the weight does not change – it did not contribute to the error. - Gradient propagation:
nextDelta(k) += delta(j) * wj(k)– the error signal for the previous layer is the sum of all the deltas from this layer, weighted by the connections. This is the chain rule in action.
// Apply ReLU derivative for hidden layers, with gradient clipping
if (nextDelta != null) {
val prevAct = activations(l)
delta = new Array[Double](nextDelta.length)
var k = 0
while (k < nextDelta.length) {
val g = if (prevAct(k) > 0.0) nextDelta(k) else 0.0
delta(k) = math.max(-GradClip, math.min(GradClip, g))
k += 1
}
}
l -= 1
}
loss
}
The ReLU derivative is simple: if the neuron’s activation was positive during the forward pass, the gradient passes through unchanged. If the activation was zero or negative, the gradient is killed – set to zero. This makes intuitive sense: if a neuron did not fire (ReLU set it to zero), it did not contribute to the output, so adjusting its incoming weights would have no effect.
Again, gradient clipping is applied after the ReLU derivative, bounding each gradient to [-5.0, 5.0].
Walking through the example
Let’s trace backpropagation through our tiny 2-3-1 network. Suppose:
- Input:
[1.0, 2.0] - Target:
[1.0] - Forward pass output:
0.27(as we computed earlier) - Learning rate:
0.1
Output gradient:
delta_output = 2 * (0.27 - 1.0) / 1 = -1.46
The network output 0.27 but should have output 1.0. The negative gradient says “push the output higher.”
Update output layer weights and biases:
Recall the hidden activations were [0.0, 1.4, 0.7].
bias_0 -= 0.1 * (-1.46) --> bias increases by 0.146
w(0,0) -= 0.1 * (-1.46) * 0.0 --> no change (hidden_0 was zero)
w(0,1) -= 0.1 * (-1.46) * 1.4 --> weight increases by 0.2044
w(0,2) -= 0.1 * (-1.46) * 0.7 --> weight increases by 0.1022
Notice that hidden_0 was zero (killed by ReLU), so its weight does not change. Only active neurons get their weights updated.
Propagate gradient to hidden layer:
nextDelta(0) = (-1.46) * 0.6 = -0.876
nextDelta(1) = (-1.46) * (-0.4) = 0.584
nextDelta(2) = (-1.46) * 0.9 = -1.314
Apply ReLU derivative:
hidden_0 activation was 0.0 --> delta(0) = 0.0 (gradient dies)
hidden_1 activation was 1.4 --> delta(1) = 0.584 (gradient passes)
hidden_2 activation was 0.7 --> delta(2) = -1.314 (gradient passes)
Then the first layer’s weights update using these deltas and the original input [1.0, 2.0]. The process is the same: weight -= lr * delta * input.
Learning rate: the step size
The learning rate (lr) controls how much the weights change in each update. Promovolve uses 0.001.
- Too high (e.g., 0.1): the weights overshoot the optimum and oscillate wildly, possibly diverging to infinity.
- Too low (e.g., 0.00001): the weights barely move and learning takes forever.
- Just right (e.g., 0.001): the weights converge smoothly toward good values.
There is no formula for the “right” learning rate – it depends on the problem. 0.001 is a common starting point that works well across many domains.
Target network sync: copyFrom
In DQN (covered in the next chapter), we maintain two copies of the network: a Q-network that is actively learning, and a target network that provides stable training targets. Periodically, we copy the Q-network’s weights into the target network:
def copyFrom(other: DenseNetwork): Unit = {
var l = 0
while (l < numLayers) {
var j = 0
while (j < weights(l).length) {
System.arraycopy(other.weights(l)(j), 0, weights(l)(j), 0,
weights(l)(j).length)
j += 1
}
System.arraycopy(other.biases(l), 0, biases(l), 0, biases(l).length)
l += 1
}
}
This is a deep copy using System.arraycopy – fast, element-by-element duplication of every weight and bias array. After this call, the target network is an exact clone of the Q-network, but subsequent training updates only affect the Q-network.
Serialization: saving and restoring the network
A trained network is useless if it vanishes when the process restarts. Promovolve serializes the network by flattening all weights and biases into simple one-dimensional arrays:
def serialize(): DenseNetwork.Snapshot = {
val flatWeights = weights.flatMap(_.flatMap(_.toSeq)).toArray
val flatBiases = biases.flatMap(_.toSeq).toArray
DenseNetwork.Snapshot(layerSizes, flatWeights, flatBiases)
}
The Snapshot case class holds three things: the layer sizes (so we know the architecture), the flat weight array, and the flat bias array. To restore, we create a new network with the same architecture and copy the weights back into the nested array structure:
def restore(snapshot: DenseNetwork.Snapshot): Unit = {
require(snapshot.layerSizes == layerSizes, "Layer sizes mismatch")
var wi = 0
var l = 0
while (l < numLayers) {
var j = 0
while (j < weights(l).length) {
System.arraycopy(snapshot.weights, wi, weights(l)(j), 0,
weights(l)(j).length)
wi += weights(l)(j).length
j += 1
}
l += 1
}
// (similar loop for biases)
}
This is integrated with Pekko Persistence – when the campaign entity takes a snapshot, the DQN agent’s network weights are included and survive process restarts.
Summary
A neural network is a learnable function made of layers of simple operations: multiply, add, and clamp-to-zero. Promovolve’s DenseNetwork implements one in roughly 200 lines of pure Scala, with no external ML libraries:
- Xavier initialization keeps signals stable across layers
- Forward pass chains matrix multiplications with ReLU activations
- Backpropagation traces errors backward to compute weight updates
- Gradient clipping prevents numerical instability
- Serialization enables persistence across restarts
With this network in hand, we are ready to use it as a Q-value approximator. In the next chapter, we will combine it with experience replay and target networks to build a full Deep Q-Network.
From Q-Tables to Deep Q-Networks
In the previous chapters, we learned two things: Q-values tell an agent how good each action is in a given state, and neural networks can approximate functions over continuous inputs. Now we put them together.
This chapter covers the Deep Q-Network (DQN) algorithm as implemented in Promovolve’s DQNAgent.scala. By the end, you will understand how the agent learns to adjust ad bids from raw experience, and why several seemingly arbitrary design choices – target networks, experience replay, epsilon decay – are each solving a specific problem.
Q-values: the value of an action
A Q-value, written Q(state, action), answers a precise question: “If I am in this state and take this action, then act optimally from here on, what total reward can I expect?”
In Promovolve’s ad bidding system, the state describes a campaign’s current situation (8 numbers: effective CPM, CTR, win rate, budget remaining, time remaining, spend rate, impression rate, cost per click). The actions are 7 bid multiplier adjustments (0.7x through 1.4x). So Q(state, action=3) means: “Given the campaign’s current metrics, if I hold the bid steady at 1.0x and then act optimally for the rest of the day, how many clicks will I get?”
The agent’s policy is simple: always pick the action with the highest Q-value. If the Q-values are accurate, this is optimal behavior.
The Q-table: why it can’t work here
In a small, discrete world – say, 10 states and 5 actions – you could store Q-values in a 2D table:
| State | A0 | A1 | A2 | A3 | A4 |
|---|---|---|---|---|---|
| S0 | 0.2 | 0.5 | 0.1 | 0.3 | 0.0 |
| S1 | 0.4 | 0.1 | 0.6 | 0.2 | 0.8 |
| … | … | … | … | … | … |
With 10 states and 5 actions, the table has 50 entries. Each time the agent takes action a in state s, receives reward r, and lands in state s', it updates the entry using:
Q(s,a) <- Q(s,a) + alpha * [reward + gamma * max_a' Q(s',a') - Q(s,a)]
The term in brackets is the TD error – the difference between what the agent expected and what it actually observed. The learning rate alpha controls how much to adjust. Given enough visits to every state-action pair, the values converge to the true Q-values and the agent behaves optimally.
But Promovolve’s state space has 8 continuous dimensions. A CTR of 0.031 is different from 0.032. A budget remaining fraction of 0.7234 is different from 0.7235. The number of possible states is effectively infinite. No table can hold them all, and even if it could, the agent would never visit most states twice, so the values would never converge.
We need a function that generalizes: given a state the network has never seen before, it should produce reasonable Q-values by interpolating from similar states it has seen. This is exactly what a neural network does.
The DQN idea
The key insight from DeepMind’s 2015 paper (Mnih et al., “Human-level control through deep reinforcement learning”) is straightforward: replace the Q-table with a neural network.
- Input: the state vector (8 numbers)
- Output: a Q-value for each action (7 numbers)
- Architecture: the
DenseNetworkfrom Chapter 3 – Input(8) -> Hidden(64, ReLU) -> Hidden(64, ReLU) -> Output(7, linear)
To select an action, run the forward pass and pick the output with the highest value:
def selectGreedy(state: Array[Double]): Int =
argMax(qNetwork.forward(state))
Where argMax returns the index of the largest element:
private def argMax(arr: Array[Double]): Int = {
var bestIdx = 0
var bestVal = arr(0)
var i = 1
while (i < arr.length) {
if (arr(i) > bestVal) {
bestVal = arr(i)
bestIdx = i
}
i += 1
}
bestIdx
}
If the Q-network outputs [1.2, 0.8, 2.1, 3.0, 1.5, 0.9, 2.3], the agent picks action 3 (Q-value 3.0, which maps to a 1.0x multiplier – hold the current bid).
Simple enough. But two problems immediately arise: how do we train this network, and how do we make sure the agent explores enough to discover good strategies?
Epsilon-greedy exploration
If the agent always picks the action with the highest Q-value, it will never try anything new. In the early stages of learning, the Q-values are essentially random (the network was just initialized with random weights). If the agent latches onto whichever random action happened to look best, it will never discover that a different action might yield better results.
This is the explore-exploit tradeoff. Exploiting means using your current best knowledge. Exploring means trying something new to potentially find something better. A good agent needs both.
Promovolve uses epsilon-greedy exploration: with probability epsilon, pick a random action; otherwise, pick the best known action.
def selectAction(state: Array[Double]): Int = {
totalSteps += 1
if (rng.nextDouble() < epsilon) {
rng.nextInt(config.actionSize) // explore: random action
} else {
argMax(qNetwork.forward(state)) // exploit: best known action
}
}
The key is that epsilon changes over time:
- Start: epsilon = 1.0 (every action is random – pure exploration)
- Each training step: epsilon *= 0.995 (gradually shift toward exploitation)
- Floor: epsilon = 0.05 (always keep 5% exploration)
// Decay epsilon
epsilon = math.max(config.epsilonEnd, epsilon * config.epsilonDecay)
Why start fully random? Because the initial Q-values are meaningless. Acting greedily on garbage values wastes time reinforcing an arbitrary strategy. Better to explore widely first, collect diverse experience, and let the network learn from all of it.
Why keep a 5% floor? Because the environment can change. In ad bidding, competitive dynamics shift throughout the day. A strategy that was optimal at 9 AM may be suboptimal at 3 PM. A small amount of ongoing exploration lets the agent detect and adapt to these shifts.
Why not explore more? Every exploratory action is potentially a wasted bid – the agent tries something it “knows” is suboptimal. Too much exploration squanders the campaign’s budget on bad bids. The gradual decay from 1.0 to 0.05 balances this: explore heavily when ignorant, exploit increasingly as knowledge improves.
The Bellman equation: what the Q-values should be
We want the Q-network to output accurate Q-values. But what does “accurate” mean? What is the correct Q-value for a given state and action?
The Bellman equation provides the answer:
Q(s, a) = reward + gamma * max_a' Q(s', a')
In words: the value of taking action a in state s equals the immediate reward you get, plus the discounted value of the best action in the next state s'.
Let’s make this concrete with an ad bidding example. Suppose the agent is in state s (budget 60% remaining, time 50% remaining, CTR 0.03). It takes action a = “bid 1.2x”. Over the next observation window, it gets 3 clicks (reward = 3.0) and transitions to state s' (budget 55% remaining, time 45% remaining, CTR 0.035). The discount factor gamma is 0.99.
The Bellman equation says:
Q(s, bid_1.2x) = 3.0 + 0.99 * max_a' Q(s', a')
If the best Q-value in state s' is 15.0, then the target Q-value is 3.0 + 0.99 * 15.0 = 17.85.
The discount factor gamma = 0.99 means the agent values future rewards almost as much as immediate ones. A gamma of 0 would make the agent completely myopic – only caring about the next window’s clicks. A gamma of 1 would weight all future rewards equally, which can cause numerical instability. 0.99 is a practical middle ground: the agent plans ahead but slightly prefers sooner rewards.
The training loop
Now we have all the pieces. Here is how Promovolve’s DQNAgent.trainStep() puts them together:
def trainStep(): Option[Double] = {
if (replayBuffer.size < config.minBufferSize) return None
val batch = replayBuffer.sample(config.batchSize, rng)
var totalLoss = 0.0
var i = 0
while (i < batch.size) {
val state = batch.states(i)
val action = batch.actions(i)
val reward = batch.rewards(i)
val nextState = batch.nextStates(i)
val done = batch.dones(i)
// Current Q-values
val currentQ = qNetwork.forward(state)
// Double DQN target:
// 1. Q-network selects best action for next state
// 2. Target network evaluates that action
val target = currentQ.clone()
if (done) {
target(action) = reward
} else {
val nextQOnline = qNetwork.forward(nextState)
val bestNextAction = argMax(nextQOnline)
val nextQTarget = targetNetwork.forward(nextState)
target(action) = reward + config.gamma * nextQTarget(bestNextAction)
}
// Clip target to prevent extreme Q-values
target(action) = math.max(-config.qClip, math.min(config.qClip,
target(action)))
totalLoss += qNetwork.train(state, target, config.learningRate)
i += 1
}
trainSteps += 1
// Sync target network periodically
if (trainSteps % config.targetSyncInterval == 0) {
targetNetwork.copyFrom(qNetwork)
}
// Decay epsilon
epsilon = math.max(config.epsilonEnd, epsilon * config.epsilonDecay)
Some(totalLoss / batch.size)
}
Let’s walk through this step by step.
Step 1: Wait for enough experience
if (replayBuffer.size < config.minBufferSize) return None
The agent does not start training until it has collected at least minBufferSize (32) transitions in its replay buffer. This ensures there is enough variety in the training batch.
Step 2: Sample a batch
val batch = replayBuffer.sample(config.batchSize, rng)
Randomly sample 32 transitions from the replay buffer. Each transition is a tuple of (state, action, reward, nextState, done). We will cover the replay buffer in detail in the next chapter – for now, think of it as a big bag of past experiences that we draw from randomly.
Step 3: Compute the target Q-value
For each transition in the batch:
val currentQ = qNetwork.forward(state)
val target = currentQ.clone()
First, get the current Q-values for the state. Clone them into a target array. We will only modify the Q-value for the action that was actually taken – the other values stay the same, so the network receives zero gradient for actions it did not take.
If the episode is done (budget exhausted or time expired):
if (done) {
target(action) = reward
}
There is no future, so the Q-value is just the immediate reward.
Otherwise, apply the Bellman equation:
val nextQOnline = qNetwork.forward(nextState)
val bestNextAction = argMax(nextQOnline)
val nextQTarget = targetNetwork.forward(nextState)
target(action) = reward + config.gamma * nextQTarget(bestNextAction)
This is the Double DQN variant (Van Hasselt et al., 2016). Standard DQN would use the target network for both selecting and evaluating the best next action, which tends to overestimate Q-values. Double DQN decouples these: the Q-network picks the best action, but the target network evaluates how good that action actually is. This small change significantly reduces overestimation bias.
The target is also clipped to [-100, 100] to prevent runaway values:
target(action) = math.max(-config.qClip, math.min(config.qClip,
target(action)))
Step 4: Train the Q-network
totalLoss += qNetwork.train(state, target, config.learningRate)
This calls the DenseNetwork.train() method from Chapter 3. It runs a forward pass, computes MSE loss between the network’s current output and the target, and backpropagates to update the weights. The learning rate is 0.001.
Step 5: Decay epsilon
epsilon = math.max(config.epsilonEnd, epsilon * config.epsilonDecay)
After each training step, reduce the exploration rate by multiplying by 0.995. This gradually shifts the agent from exploration toward exploitation.
How fast does this decay? After 100 training steps: 1.0 * 0.995^100 = 0.606. After 500 steps: 0.995^500 = 0.082. After about 600 steps, epsilon hits the floor of 0.05 and stays there. In practice, this means the agent explores heavily for the first few hundred observation windows, then settles into mostly exploiting what it has learned.
Step 6: Sync the target network
if (trainSteps % config.targetSyncInterval == 0) {
targetNetwork.copyFrom(qNetwork)
}
Every 100 training steps, copy all weights from the Q-network into the target network. Between syncs, the target network is frozen – it provides stable targets for training.
The instability problem
Here is the fundamental challenge of DQN, and the reason for the two complications we just introduced (target network and experience replay).
When you train a supervised learning model, the training targets are fixed. If you are classifying images of cats and dogs, the label “cat” does not change as the model learns. The model is trying to hit a stationary target.
In DQN, the targets are computed from the network itself:
target = reward + gamma * max_a' Q_target(s', a')
As the Q-network improves, the target values change. It is like trying to hit a moving bullseye – every time you get closer, the bullseye shifts. This can cause wild oscillations: the network overshoots in one direction, which changes the targets, which causes it to overcorrect in the other direction, and so on.
Two mechanisms stabilize this:
-
Target network: by freezing a copy of the Q-network and only updating it every 100 steps, the targets remain stable for stretches of training. The bullseye moves, but it moves in discrete jumps rather than continuously, giving the Q-network time to converge toward each set of targets.
-
Experience replay (covered in the next chapter): instead of training only on the most recent transition, the agent samples randomly from all past experience. This breaks the correlation between consecutive training samples. Without replay, the agent would train on a sequence of highly correlated transitions (all from the same part of the state space), which tends to cause the network to overfit to recent experience and forget what it learned earlier.
Together, these two ideas transformed DQN from an unstable curiosity into a practical algorithm. Promovolve implements both.
The full picture
Let’s summarize how all the pieces fit together in Promovolve’s bid optimization:
-
Every observation window (~15 minutes), the campaign entity sends the agent an observation of its current state.
-
The agent converts the observation into an 8-dimensional state vector (effective CPM, CTR, win rate, budget remaining, time remaining, spend rate, impression rate, cost per click).
-
If there was a previous state, the agent stores the transition (previous state, action taken, reward received, current state) in the replay buffer.
-
The agent runs
trainStep(): sample a batch from the replay buffer, compute Bellman targets, train the Q-network, decay epsilon, and periodically sync the target network. -
The agent selects an action using epsilon-greedy: either a random action (exploration) or the action with the highest Q-value (exploitation).
-
The selected action maps to a bid multiplier adjustment (e.g., action 5 -> 1.2x), which is applied to the campaign’s base CPM.
-
Over the next 15 minutes, the adjusted bid competes in auctions. The results (impressions, clicks, spend) become the next observation, and the cycle repeats.
Over the course of a day, the agent collects dozens of transitions, gradually refines its Q-value estimates, and shifts from random exploration to informed bidding. Over multiple days (with weights persisted via snapshots), the agent accumulates thousands of transitions and develops a nuanced bidding strategy that balances click maximization against budget pacing.
What’s next
We have glossed over two important components: the experience replay buffer (how past transitions are stored and sampled) and the target network (why freezing a separate copy stabilizes training). The next chapter dives deep into both, showing how Promovolve’s ReplayBuffer works and walking through the stability dynamics that make DQN practical.
Experience Replay: Learning From the Past
Imagine you are learning to cook. If you only practiced the dish you made five minutes ago, you would get very good at that one recipe and completely forget everything else. Worse, if your last five dishes were all pasta (because you went on a pasta streak), you would start to think cooking is pasta. You need to flip back through your recipe journal at random – revisiting last Tuesday’s stir fry, last month’s soup, yesterday’s salad – to become a well-rounded cook.
A DQN agent faces exactly this problem. This chapter introduces the replay buffer, a deceptively simple data structure that makes deep reinforcement learning work.
The Problem With Learning Online
Without a replay buffer, the agent would train on each experience exactly once, in the order it happened. Two things go wrong:
Catastrophic forgetting. The agent trains on the most recent experiences, and the neural network’s weights shift to fit them. Older lessons fade. If the agent spent the morning learning to conserve budget and the afternoon learning to bid aggressively, the afternoon training overwrites the morning’s lessons.
Correlated samples. Consecutive experiences look almost identical. If the agent is overspending right now, the next five observations will all show a high spend rate, declining budget, and similar state vectors. Training on five near-identical examples in a row is like studying the same flashcard five times and calling it a productive session. The neural network overfits to the current situation instead of learning general patterns.
Both problems have the same fix: store every experience, then sample randomly for training.
What Is a Transition?
Every time the agent makes a decision and observes the outcome, it produces one transition – the complete record of a single decision:
| Field | What It Means | Ad Bidding Example |
|---|---|---|
state | What the world looked like | Budget 60% remaining, CTR 0.02, win rate 0.4 |
action | What the agent did | Action 4 = multiply bid by 1.1x (bid more aggressively) |
reward | What the agent got | 3 clicks, minus a small overspend penalty = 2.7 |
nextState | What the world looked like afterward | Budget 55% remaining, CTR 0.025, win rate 0.5 |
done | Is the episode over? | false (budget not exhausted, still time left in the day) |
A transition is a before-and-after snapshot of one decision. The agent stores thousands of these and trains on random handfuls at a time.
The Full ReplayBuffer Code
Promovolve’s replay buffer is only 64 lines. Here it is in its entirety:
package promovolve.rl
import scala.util.Random
/** Fixed-size circular experience replay buffer for DQN training.
*
* Stores (state, action, reward, nextState, done) transitions.
* Supports uniform random sampling for mini-batch training.
*/
final class ReplayBuffer(val capacity: Int) {
private val states = new Array[Array[Double]](capacity)
private val actions = new Array[Int](capacity)
private val rewards = new Array[Double](capacity)
private val nextStates = new Array[Array[Double]](capacity)
private val dones = new Array[Boolean](capacity)
private var writeIdx = 0
private var currentSize = 0
def size: Int = currentSize
def store(
state: Array[Double],
action: Int,
reward: Double,
nextState: Array[Double],
done: Boolean
): Unit = {
states(writeIdx) = state.clone()
actions(writeIdx) = action
rewards(writeIdx) = reward
nextStates(writeIdx) = nextState.clone()
dones(writeIdx) = done
writeIdx = (writeIdx + 1) % capacity
if (currentSize < capacity) currentSize += 1
}
/** Sample a random mini-batch. */
def sample(batchSize: Int, rng: Random): ReplayBuffer.Batch = {
require(currentSize >= batchSize,
s"Not enough experiences: $currentSize < $batchSize")
val indices = Array.fill(batchSize)(rng.nextInt(currentSize))
ReplayBuffer.Batch(
states = indices.map(states),
actions = indices.map(actions),
rewards = indices.map(rewards),
nextStates = indices.map(nextStates),
dones = indices.map(dones)
)
}
}
object ReplayBuffer {
final case class Batch(
states: Array[Array[Double]],
actions: Array[Int],
rewards: Array[Double],
nextStates: Array[Array[Double]],
dones: Array[Boolean]
) {
def size: Int = states.length
}
}
Let’s walk through each part.
Storage: Five Parallel Arrays
private val states = new Array[Array[Double]](capacity)
private val actions = new Array[Int](capacity)
private val rewards = new Array[Double](capacity)
private val nextStates = new Array[Array[Double]](capacity)
private val dones = new Array[Boolean](capacity)
The buffer uses a struct-of-arrays layout rather than an array-of-structs. Instead of storing 10,000 Transition objects (each with five fields), it stores five arrays of 10,000 elements. This is a common pattern in performance-sensitive code – it avoids object allocation overhead and keeps memory access patterns cache-friendly when iterating over a single field.
Each array holds one component of the transition tuple. Position i across all five arrays represents a single complete transition.
The Circular Buffer
private var writeIdx = 0
private var currentSize = 0
Two variables track the buffer’s state:
writeIdx– where the next transition will be written.currentSize– how many valid transitions are stored (capped atcapacity).
When a new transition arrives:
writeIdx = (writeIdx + 1) % capacity
if (currentSize < capacity) currentSize += 1
The modulo operator (% capacity) is the key. When writeIdx reaches 10,000, it wraps around to 0 and starts overwriting the oldest transitions. This is the “circular” part – the buffer is a ring, not a growing list.
Here is what happens as transitions arrive in a buffer with capacity = 5:
Store #1: [T1, _, _, _, _ ] writeIdx=1, size=1
Store #2: [T1, T2, _, _, _ ] writeIdx=2, size=2
Store #5: [T1, T2, T3, T4, T5] writeIdx=0, size=5 (full!)
Store #6: [T6, T2, T3, T4, T5] writeIdx=1, size=5 (T1 overwritten)
Store #7: [T6, T7, T3, T4, T5] writeIdx=2, size=5 (T2 overwritten)
The oldest experience is always the one that gets replaced. No resizing, no shifting, no garbage collection pressure. Constant-time insertion, every time.
Why Clone Arrays
Notice this line in store:
states(writeIdx) = state.clone()
nextStates(writeIdx) = nextState.clone()
Why clone()? Because arrays in Scala (and Java) are mutable reference types. The caller passes in a state array, but they might reuse that same array for the next observation – overwriting its contents. Without cloning, the buffer would hold a reference to the caller’s array, and every stored transition would silently change whenever the caller mutates it.
This is a subtle but critical correctness issue. The clone ensures that the buffer owns its own independent copy of every state vector.
Uniform Random Sampling
val indices = Array.fill(batchSize)(rng.nextInt(currentSize))
When it is time to train, the buffer picks batchSize random indices (with replacement) and gathers the corresponding transitions:
ReplayBuffer.Batch(
states = indices.map(states),
actions = indices.map(actions),
rewards = indices.map(rewards),
nextStates = indices.map(nextStates),
dones = indices.map(dones)
)
“With replacement” means the same transition could appear twice in one batch. In practice, with 10,000 stored transitions and a batch size of 32, duplicates are rare (about a 5% chance of any duplicate), and they do not cause problems.
The Batch Case Class
final case class Batch(
states: Array[Array[Double]],
actions: Array[Int],
rewards: Array[Double],
nextStates: Array[Array[Double]],
dones: Array[Boolean]
) {
def size: Int = states.length
}
The Batch is just a container that holds the sampled transitions in array form, ready for the training loop to iterate over. Each index i across the five arrays is one complete transition.
Why 10,000 Capacity?
From Promovolve’s configuration:
bufferSize = 10_000
This is a tradeoff. Too small, and the agent forgets useful old experiences – it might lose the lessons from a budget crisis two weeks ago. Too large, and the agent trains on stale data from a world that no longer exists (campaigns end, traffic patterns shift, competitors change their bids).
At 15-minute observation intervals with 96 observations per day, 10,000 transitions represent roughly 104 days of experience. That is plenty of history to learn from, while still being recent enough to reflect the current advertising landscape.
The memory cost is modest. Each transition stores two 8-element Double arrays (state and nextState), one Int, one Double, and one Boolean. That comes out to about 150 bytes per transition, so the full buffer uses approximately 1.5 MB. Negligible on any modern server.
Why Wait Until 32 Experiences?
minBufferSize = 32
The agent refuses to train until the buffer holds at least 32 transitions. Why?
If you trained on 3 experiences, the neural network would memorize those 3 examples perfectly and learn nothing general. It would be like studying for an exam by reading only three questions and assuming those are the only three questions that could possibly appear.
With 32 diverse experiences – some from periods of overspending, some from underspending, some from high-CTR traffic, some from low – the network sees enough variety to extract patterns rather than memorize specifics. The number 32 is also the batch size, which means the very first training step will sample every available experience at least once.
Why Batch Size 32?
batchSize = 32
Each training step samples 32 random transitions from the buffer. Why not 1? Why not 1,000?
Batch size 1 (online learning): each weight update is based on a single example. The gradient points in roughly the right direction but with enormous noise. Training is erratic – the network lurches from one example to the next.
Batch size 1,000: the gradient is an average over 1,000 examples, so it is very smooth and stable. But each training step is 1,000 times more expensive, and in the early stages when the buffer is small, you would need 1,000 experiences before you could train at all.
Batch size 32 is a common sweet spot. It averages out enough noise to give the optimizer a reliable gradient direction, while remaining cheap enough to run on every observation step. This value is so standard in deep learning that it is almost the “default” choice. Larger models on GPUs sometimes benefit from bigger batches, but for Promovolve’s small 8-input, 2-hidden-layer network running on CPU, 32 is more than adequate.
The Key Insight
By sampling randomly from a large buffer, the replay buffer breaks the temporal correlation between training examples.
Without replay:
Training step 1: experience from 10:00am
Training step 2: experience from 10:15am
Training step 3: experience from 10:30am
Training step 4: experience from 10:45am
All four experiences come from the same morning, with similar traffic, similar budget levels, similar everything. The network thinks the world always looks like late morning.
With replay:
Training step 1 (batch of 32):
- experience from Monday 10:00am
- experience from Friday 3:45pm
- experience from Tuesday 8:15am
- experience from Thursday 11:30pm
- ... 28 more random picks
A Monday morning experience sits next to a Friday afternoon one. An experience where budget was nearly exhausted sits next to one where the day had just started. The network sees the full diversity of situations it might encounter, every single training step.
This is why experience replay was one of the key innovations that made DQN work in the original Atari paper (Mnih et al., 2015). Without it, the neural network is unstable. With it, training converges reliably. And as you have seen, the implementation is nothing more than a circular array and a random number generator.
Next chapter: we have a buffer full of experiences and a way to sample them. Now we need to compute the right training targets – and avoid a subtle trap where the agent systematically overestimates how good its actions are.
Double DQN: Fixing Overestimation
The previous chapter gave us a replay buffer full of past experiences. Now we need to train the neural network on those experiences. This chapter covers the core training loop in Promovolve’s DQN agent – and a subtle but devastating bug in standard DQN that Double DQN fixes.
The Overestimation Problem
Recall from earlier chapters that the DQN training target for a transition (state, action, reward, nextState) is:
target = reward + gamma * max Q(nextState, a')
^^^
over all actions a'
The agent asks: “What is the best possible future value from nextState?” and uses the answer to update the Q-value of the action it took.
The problem is the max operator. Here is why.
Noise Becomes Optimism
Early in training, Q-values are essentially random numbers. The network has not learned anything yet, so its predictions are noise. Suppose the agent is evaluating nextState and there are 7 possible actions. The Q-network produces these estimates:
Q(nextState, action0) = 0.3 (true value: 0.5)
Q(nextState, action1) = 1.8 (true value: 0.4) <-- overestimated!
Q(nextState, action2) = -0.2 (true value: 0.6)
Q(nextState, action3) = 0.7 (true value: 0.5)
Q(nextState, action4) = 0.1 (true value: 0.3)
Q(nextState, action5) = 0.4 (true value: 0.4)
Q(nextState, action6) = 0.9 (true value: 0.7)
The true best action is action 6 (true value 0.7), but max picks action 1 (estimated value 1.8) because it has the highest noise spike. The max operator acts as a noise amplifier – it always selects the most overestimated value.
If Q-values were perfectly accurate, max would correctly identify the best action. But Q-values always have estimation error, especially early in training. And max systematically exploits that error in the upward direction. It can never underestimate (it picks the highest value), so the bias is always positive.
Overestimation Compounds
This would be manageable if it only happened once. But the overestimated target gets baked into the Q-network’s weights through training. On the next training step, the Q-network produces slightly higher values (because it was trained toward the inflated target), which produce an even more inflated max, which produces an even higher target. The overestimation feeds on itself.
After thousands of training steps, the agent can end up with Q-values that bear no relationship to reality. In ad bidding, this manifests as the agent believing that aggressive bidding (action 6 = 1.4x multiplier) is always the best choice, because the noisy early Q-values made it look good, and that initial overestimation compounded over time. The result: the agent overspends the campaign budget and burns through it in the first few hours.
The Double DQN Fix
The insight behind Double DQN (Van Hasselt et al., 2016) is to decouple action selection from action evaluation using two separate networks.
Standard DQN uses the same network for both steps:
- Select the best action: which action has the highest Q-value?
- Evaluate that action: what is that Q-value?
When the same noisy network both picks the action and evaluates it, the noise reinforces itself. The network picks the action with the highest noise spike, then reports that inflated value as the evaluation.
Double DQN splits these responsibilities:
- Q-network (online network) selects the best action for the next state.
- Target network (a frozen copy) evaluates that action’s value.
Even if the Q-network is wrong about which action is best, the target network’s evaluation of that action is likely more conservative – because the target network has different noise patterns (it is a snapshot from 100 training steps ago). The two networks’ errors are unlikely to align.
Here is the difference in formulas:
Standard DQN:
target = reward + gamma * max Q_target(nextState, a')
(target network picks AND evaluates)
Double DQN:
bestAction = argmax Q_online(nextState, a') -- online network picks
target = reward + gamma * Q_target(nextState, bestAction) -- target network evaluates
The change is small – just one line of code – but it dramatically reduces overestimation.
The Code
Here is the training loop from Promovolve’s DQNAgent.trainStep():
def trainStep(): Option[Double] = {
if (replayBuffer.size < config.minBufferSize) return None
val batch = replayBuffer.sample(config.batchSize, rng)
var totalLoss = 0.0
var i = 0
while (i < batch.size) {
val state = batch.states(i)
val action = batch.actions(i)
val reward = batch.rewards(i)
val nextState = batch.nextStates(i)
val done = batch.dones(i)
// Current Q-values for this state
val currentQ = qNetwork.forward(state)
// Double DQN target:
// 1. Q-network selects best action for next state
// 2. Target network evaluates that action
val target = currentQ.clone()
if (done) {
target(action) = reward
} else {
val nextQOnline = qNetwork.forward(nextState)
val bestNextAction = argMax(nextQOnline)
val nextQTarget = targetNetwork.forward(nextState)
target(action) = reward + config.gamma * nextQTarget(bestNextAction)
}
// Clip target to prevent extreme Q-values
target(action) = math.max(-config.qClip, math.min(config.qClip, target(action)))
totalLoss += qNetwork.train(state, target, config.learningRate)
i += 1
}
trainSteps += 1
// Sync target network periodically
if (trainSteps % config.targetSyncInterval == 0) {
targetNetwork.copyFrom(qNetwork)
}
// Decay epsilon
epsilon = math.max(config.epsilonEnd, epsilon * config.epsilonDecay)
Some(totalLoss / batch.size)
}
Let’s break this down piece by piece.
Guard: Don’t Train Too Early
if (replayBuffer.size < config.minBufferSize) return None
If the replay buffer has fewer than 32 experiences, skip training entirely. As discussed in the previous chapter, training on a tiny number of examples produces memorization, not learning. The method returns None to signal “no training happened.”
Sample a Batch
val batch = replayBuffer.sample(config.batchSize, rng)
Pull 32 random transitions from the 10,000-capacity replay buffer. This is where experience replay breaks the temporal correlation – we might get a transition from Monday morning alongside one from Thursday evening.
Build the Training Target
For each transition in the batch, the code needs to construct a target Q-value vector that the network will be trained toward. This is the heart of Double DQN:
val currentQ = qNetwork.forward(state)
val target = currentQ.clone()
First, compute the Q-network’s current predictions for the state. Then clone that array to create the target. The target starts as a copy of the current predictions – this is important because we only want to change the Q-value for the action that was actually taken. All other Q-values in the target remain equal to the current predictions, so they produce zero gradient (no learning signal).
Terminal States
if (done) {
target(action) = reward
}
When done is true, the episode is over – the campaign’s budget was exhausted or the delivery day ended. There is no future state to consider, so the target is simply the reward received. No discounting, no future value. The episode is finished.
The Double DQN Core
val nextQOnline = qNetwork.forward(nextState)
val bestNextAction = argMax(nextQOnline)
val nextQTarget = targetNetwork.forward(nextState)
target(action) = reward + config.gamma * nextQTarget(bestNextAction)
This is where Double DQN differs from standard DQN. Four lines that are worth reading carefully:
-
nextQOnline = qNetwork.forward(nextState)– The Q-network (the one being trained) evaluates all actions for the next state. These values are used only to decide which action looks best. -
bestNextAction = argMax(nextQOnline)– Pick the action with the highest Q-value according to the Q-network. This is the “selection” step. -
nextQTarget = targetNetwork.forward(nextState)– The target network (frozen copy) also evaluates all actions for the next state. These values are used only to evaluate the selected action. -
target(action) = reward + config.gamma * nextQTarget(bestNextAction)– The final target combines the actual reward with the target network’s evaluation of the action that the Q-network selected. Selection and evaluation are decoupled.
If the Q-network’s noise causes it to select the wrong “best” action, the target network’s evaluation of that wrong action is likely to be mediocre (not inflated), which limits the damage.
Q-Value Clipping
target(action) = math.max(-config.qClip, math.min(config.qClip, target(action)))
Even with Double DQN, Q-values can occasionally grow large during training instabilities. This safety clamp keeps targets within [-100, 100] (the default qClip value). It is a simple guardrail that prevents a single runaway target from destabilizing the entire network.
Train the Network
totalLoss += qNetwork.train(state, target, config.learningRate)
This passes the (state, target) pair to the neural network for one step of backpropagation with stochastic gradient descent. The network adjusts its weights to make its predictions closer to the target. The method returns the MSE (mean squared error) loss, which we accumulate for monitoring.
Note that only the Q-network is trained. The target network’s weights are never updated by backpropagation.
Sync the Target Network
if (trainSteps % config.targetSyncInterval == 0) {
targetNetwork.copyFrom(qNetwork)
}
Every 100 training steps, the target network’s weights are replaced with a copy of the Q-network’s current weights. This is the “hard sync” approach – the target network is an exact copy, just 100 steps behind.
Why not sync every step? If the target network updated continuously with the Q-network, the training targets would shift with every weight update. The network would be chasing a moving target – literally. Stable targets produce stable training.
Think of it this way: the Q-network is a student taking a 100-question test. The target network is the answer key. If someone kept changing the answer key while the student was working, the student could never converge on the right answers. By freezing the answer key for 100 questions at a time, the student can make meaningful progress before the key is updated.
With targetSyncInterval = 100 and the agent training once per 15-minute observation, the target network updates roughly once per day (100 observations = 25 hours). This gives the Q-network a full day of stable targets to learn against before they shift.
Epsilon Decay
epsilon = math.max(config.epsilonEnd, epsilon * config.epsilonDecay)
After each training step, the exploration rate decays. With epsilonDecay = 0.995, epsilon drops from 1.0 (pure exploration) to 0.05 (5% exploration) over about 600 training steps. The agent gradually shifts from random exploration to exploiting what it has learned.
Walking Through a Concrete Example
Let’s trace one complete training step with a batch of 3 transitions (the real batch size is 32, but 3 is easier to follow). The config uses gamma = 0.99.
The Batch
| i | state (abbreviated) | action | reward | nextState (abbreviated) | done |
|---|---|---|---|---|---|
| 0 | [1.0, 0.02, 0.4, 0.6, 0.5, 1.1, 0.8, 0.5] | 3 (hold at 1.0x) | 2.0 | [1.0, 0.03, 0.5, 0.55, 0.45, 1.0, 0.9, 0.4] | false |
| 1 | [1.2, 0.01, 0.3, 0.2, 0.1, 1.8, 0.5, 1.0] | 5 (bid 1.2x) | -1.0 | [1.4, 0.01, 0.2, 0.0, 0.05, 2.5, 0.3, 1.2] | true |
| 2 | [0.8, 0.04, 0.6, 0.8, 0.7, 0.8, 1.2, 0.3] | 1 (bid 0.8x) | 4.0 | [0.7, 0.03, 0.5, 0.75, 0.65, 0.9, 1.0, 0.3] | false |
Transition 0: Normal Mid-Day Observation
The agent held steady (action 3 = 1.0x multiplier) and earned 2 clicks.
currentQ = qNetwork.forward(state_0)
= [-0.5, 0.3, 1.2, 0.8, 0.6, 0.4, 0.1]
The Q-network currently thinks action 2 (0.9x) is the best choice from this state. Now we compute the Double DQN target for action 3 (the action that was actually taken):
nextQOnline = qNetwork.forward(nextState_0)
= [0.1, 0.5, 1.5, 1.0, 0.7, 0.3, 0.2]
bestNextAction = argMax(nextQOnline) = 2 (action 2, value 1.5)
nextQTarget = targetNetwork.forward(nextState_0)
= [0.2, 0.4, 1.1, 0.9, 0.8, 0.5, 0.3]
Notice: the Q-network selected action 2, but the target network’s value for action 2 is 1.1, not 1.5. The target network’s estimate is more conservative.
target[3] = reward + gamma * nextQTarget[bestNextAction]
= 2.0 + 0.99 * 1.1
= 2.0 + 1.089
= 3.089
With standard DQN (using the target network for both selection and evaluation), the target would have used max(nextQTarget) = 1.1 here too. But imagine if the target network had a different noise pattern where action 4 had value 1.3 – standard DQN would use 1.3 (the maximum), while Double DQN still uses 1.1 (the value of the Q-network’s choice, as evaluated by the target network).
The final target vector for transition 0:
target = [-0.5, 0.3, 1.2, 3.089, 0.6, 0.4, 0.1]
^^^^^
only this changed (action 3)
The network trains to push its Q-value for (state_0, action 3) from 0.8 toward 3.089.
Transition 1: Terminal State (Budget Exhausted)
The agent bid aggressively (action 5 = 1.2x) while the budget was already low (20% remaining, spend rate at 1.8x ideal pace). It received reward -1.0 (overspend penalty exceeded clicks). The next state shows budget at 0% with time remaining – done = true.
target[5] = reward = -1.0 (no future discounting -- episode over)
This is the simplest case. The agent exhausted the budget with time left in the day, and it learns that aggressive bidding in this situation leads to a negative outcome. No need to estimate future value – there is no future.
Transition 2: Conservative Bidding, Good Outcome
The agent bid conservatively (action 1 = 0.8x) with plenty of budget and time remaining, and earned 4 clicks.
nextQOnline = qNetwork.forward(nextState_2)
= [0.3, 0.8, 2.0, 1.5, 1.1, 0.6, 0.4]
bestNextAction = argMax(nextQOnline) = 2
nextQTarget = targetNetwork.forward(nextState_2)
= [0.4, 0.7, 1.6, 1.2, 0.9, 0.5, 0.3]
target[1] = 4.0 + 0.99 * 1.6 = 5.584
The agent learns that conservative bidding with a healthy budget produces good outcomes with promising future value.
After All Three
Each of the three (state, target) pairs is fed through qNetwork.train(...), which runs backpropagation and nudges the Q-network’s weights. The total loss is averaged and returned:
totalLoss / batchSize = average MSE across the 3 transitions
This is one training step. In Promovolve, this happens every 15 minutes when the BidOptimizationAgent.observe() method is called.
Putting It All Together
Here is the full picture of how Double DQN training works in Promovolve, from observation to weight update:
- Every 15 minutes,
BidOptimizationAgent.observe()is called with the current campaign metrics. - The agent computes the reward from the previous action and stores the transition
(prevState, prevAction, reward, currentState, done)in the replay buffer. DQNAgent.trainStep()samples 32 random transitions from the buffer.- For each transition, it computes a Double DQN target: the Q-network selects the best next action, the target network evaluates it.
- Terminal transitions (budget exhausted or day ended) use the raw reward as the target.
- Q-value targets are clipped to
[-100, 100]for stability. - The Q-network is trained via backpropagation on each (state, target) pair.
- Every 100 training steps, the target network is synced with the Q-network.
- Epsilon decays slightly, shifting the balance from exploration toward exploitation.
The result is an agent that steadily improves its bid multiplier decisions over days and weeks of operation, without the runaway overestimation that would cause it to blow through campaign budgets.
Next chapter: we have covered the neural network, the replay buffer, and the Double DQN training loop. Now it is time to see how these components wire together in the BidOptimizationAgent and connect to the live auction system.
Putting It Together: The BidOptimizationAgent
We have built every piece individually: a neural network that estimates Q-values, a replay buffer that stores experience, and a Double DQN training loop that learns from that experience. Now it is time to see how Promovolve wires them into a single, working bid optimization agent.
The class is BidOptimizationAgent, and it lives in modules/core/src/main/scala/promovolve/rl/BidOptimizationAgent.scala. We will walk through it top to bottom.
The architecture
The nesting looks like this:
BidOptimizationAgent (one per campaign)
└── DQNAgent
├── qNetwork (DenseNetwork: 8 → 64 → 64 → 7)
├── targetNetwork (DenseNetwork: 8 → 64 → 64 → 7, periodically synced)
└── replayBuffer (ReplayBuffer: capacity 10,000)
BidOptimizationAgent is the outer shell that knows about campaigns, budgets, and ad serving. It translates real-world campaign metrics into the abstract language of states, actions, and rewards that the inner DQNAgent understands. The DQNAgent in turn owns the two neural networks and the replay buffer we built in earlier chapters.
One line creates the entire inner stack:
private val dqn = DQNAgent(config.dqnConfig, rng)
Everything else in BidOptimizationAgent is bookkeeping: tracking window counters, computing states, computing rewards, and applying actions back to the bid multiplier.
Configuration
Here is the actual default configuration that Promovolve ships with:
final case class Config(
dqnConfig: DQNAgent.Config = DQNAgent.Config(
stateSize = 8,
actionSize = 7,
hiddenSizes = Vector(64, 64),
gamma = 0.99,
learningRate = 0.001,
epsilonStart = 1.0,
epsilonEnd = 0.05,
epsilonDecay = 0.995,
bufferSize = 10_000,
minBufferSize = 32,
batchSize = 32,
targetSyncInterval = 100,
actionMultipliers = Vector(0.7, 0.8, 0.9, 1.0, 1.1, 1.2, 1.4)
),
minMultiplier: Double = 0.5,
maxMultiplier: Double = 2.0,
overspendPenalty: Double = 2.0,
exhaustionPenalty: Double = 5.0,
inferenceOnly: Boolean = false
)
Let’s unpack the important choices.
7 actions, asymmetric. The action multipliers are [0.7, 0.8, 0.9, 1.0, 1.1, 1.2, 1.4]. Notice they are not symmetric around 1.0. There are more options for bidding up (1.1, 1.2, 1.4) than for bidding down (0.7, 0.8, 0.9), and the most aggressive upward option (1.4x) is a bigger jump than the most aggressive downward option (0.7x). This reflects a deliberate design choice: in a competitive auction, missing out on impressions is often worse than slightly overpaying. The agent can slam the brakes hard when it needs to (0.7x), but it also has a “turbo” option (1.4x) for when it is underspending and needs to catch up fast.
Hard bounds: 0.5 to 2.0. No matter what sequence of actions the agent takes, the cumulative bid multiplier is clamped to this range. A campaign will never bid less than half its base CPM, and never more than double. This is a safety rail that prevents the RL agent from doing anything catastrophic.
Discount factor (gamma = 0.99). The agent values future rewards almost as much as immediate ones. This makes sense for daily budget pacing: you do not want an agent that burns through the budget in the first hour just because it got a few clicks early on.
Exploration (epsilon: 1.0 to 0.05, decay 0.995). The agent starts fully random and slowly shifts toward exploiting what it has learned. With a decay of 0.995 per training step, after 100 training steps epsilon is about 0.61, after 300 steps about 0.22, and after 600 steps about 0.05 (the floor). Each observation triggers at most one training step, and observations happen every 15 minutes, so reaching the floor takes roughly a week of continuous operation.
Penalties. overspendPenalty = 2.0 and exhaustionPenalty = 5.0 shape the reward signal to discourage burning through the budget too fast. We will see exactly how these work when we look at the reward function.
Window counters
Between observations (every 15 minutes), the agent accumulates raw events:
def recordImpression(spendAmount: Double): Unit = {
windowImpressions += 1
windowSpend += spendAmount
dayImpressions += 1
daySpend += spendAmount
}
def recordClick(): Unit = {
windowClicks += 1
dayClicks += 1
}
def recordBidOpportunity(won: Boolean): Unit = {
windowBidOpportunities += 1
if (won) windowWins += 1
}
The window* counters track what happened since the last observation. The day* counters track the entire day for monitoring. The CampaignEntity calls these methods as impressions, clicks, and bid opportunities flow through the system. By the time observe() is called, the window counters contain a summary of the last 15 minutes of activity.
The state: translating campaign metrics into numbers
The toState method converts an Observation plus the window counters into an 8-dimensional array that the neural network can process:
private def toState(obs: Observation): Array[Double] = {
val maxCpm = if (obs.maxCpm > 0) obs.maxCpm else 1.0
val dailyBudget = if (obs.dailyBudget > 0) obs.dailyBudget else 1.0
Array(
// 0: effective CPM (normalized)
math.min(2.0, (obs.maxCpm * _bidMultiplier) / maxCpm),
// 1: CTR in window
if (windowImpressions > 0) math.min(1.0, windowClicks.toDouble / windowImpressions)
else 0.0,
// 2: win rate
if (windowBidOpportunities > 0) windowWins.toDouble / windowBidOpportunities
else 0.5,
// 3: budget remaining fraction
math.max(0.0, math.min(1.0, obs.budgetRemaining / dailyBudget)),
// 4: time remaining fraction
math.max(0.0, math.min(1.0, obs.timeRemaining)),
// 5: spend rate vs ideal (1.0 = on pace)
spendRate(obs),
// 6: impression rate (normalized by expected)
normalizedImpressionRate(obs),
// 7: CPC (normalized)
if (windowClicks > 0) math.min(2.0, (windowSpend / windowClicks) / maxCpm)
else 0.0
)
}
Each dimension is normalized to a small range (roughly 0 to 2) so the neural network can learn effectively. Here is what each one tells the agent:
| Index | Feature | What it means |
|---|---|---|
| 0 | Effective CPM | How much we are currently bidding, relative to the base price. Equals the bid multiplier itself. |
| 1 | CTR | Click-through rate in the last 15 minutes. Higher is better. |
| 2 | Win rate | Fraction of auctions we won. Low means we are being outbid. |
| 3 | Budget remaining | How much money is left today (1.0 = full, 0.0 = empty). |
| 4 | Time remaining | How much of the day is left (1.0 = start, 0.0 = end). |
| 5 | Spend rate | Current spending speed vs. ideal even pace. 1.0 means on track, 2.0 means spending twice as fast as we should. |
| 6 | Impression rate | How many impressions we got, normalized by a baseline of 100 per window. |
| 7 | CPC | Cost per click, normalized by the base CPM. Lower is better. |
The Observation case class provides the campaign-level data that is not directly available from window counters:
final case class Observation(
maxCpm: Double, // Campaign's base max CPM (before multiplier)
dailyBudget: Double, // Total daily budget in dollars
budgetRemaining: Double, // Remaining budget in dollars
timeRemaining: Double, // Fraction of delivery period remaining
timestamp: Instant // When this observation was taken
)
The spend rate calculation deserves a closer look:
private def spendRate(obs: Observation): Double = {
if (obs.dailyBudget <= 0 || obs.timeRemaining >= 1.0) return 1.0
val elapsed = 1.0 - obs.timeRemaining
if (elapsed <= 0) return 1.0
val expectedSpend = obs.dailyBudget * elapsed
if (expectedSpend <= 0) return 1.0
val actualSpend = obs.dailyBudget - obs.budgetRemaining
math.min(3.0, actualSpend / expectedSpend) // cap at 3x overspend
}
If 40% of the day has passed and we have spent 40% of the budget, the spend rate is 1.0 – perfectly on pace. If we have spent 60% of the budget in that same time, the rate is 1.5 – we are overspending. This single number gives the agent a strong signal about whether it should bid more or less aggressively.
The reward: what the agent optimizes for
private def computeReward(obs: Observation): Double = {
val clickReward = windowClicks.toDouble
val rate = spendRate(obs)
val overspendPenalty =
if (rate > 1.5) config.overspendPenalty * (rate - 1.5) else 0.0
val exhaustionPenalty =
if (obs.budgetRemaining <= 0 && obs.timeRemaining > 0.1)
config.exhaustionPenalty
else 0.0
clickReward - overspendPenalty - exhaustionPenalty
}
The reward is simple: clicks minus penalties. The agent gets +1 for each click in the window. But two things reduce the reward:
-
Overspend penalty. If the spend rate exceeds 1.5x (spending 50% faster than ideal), a penalty kicks in proportional to how far over 1.5 it is. With
overspendPenalty = 2.0, a spend rate of 2.5 costs a penalty of2.0 * (2.5 - 1.5) = 2.0– equivalent to losing 2 clicks worth of reward. -
Exhaustion penalty. If the budget hits zero while more than 10% of the day remains, a flat penalty of 5.0 is applied. Running out of budget at 3pm when ads should run until midnight is a serious failure; this penalty makes sure the agent learns to avoid it.
The combination encourages the agent to maximize clicks while spending at a sustainable pace throughout the day.
The observation loop
Here is the core method, called every 15 minutes:
def observe(obs: Observation): (Double, Option[Double]) = {
val state = toState(obs)
// If we have a previous state, store the transition and learn
val loss = prevState match {
case Some(ps) =>
val reward = computeReward(obs)
dayRewardSum += reward
val done = obs.budgetRemaining <= 0 || obs.timeRemaining <= 0
dqn.store(ps, prevAction.get, reward, state, done)
dqn.trainStep()
case None => None
}
dayObservations += 1
// Select next action
val action =
if (config.inferenceOnly) dqn.selectGreedy(state)
else dqn.selectAction(state)
// Apply action: adjust multiplier
val adjustment = config.dqnConfig.multiplierForAction(action)
_bidMultiplier = math.max(
config.minMultiplier,
math.min(config.maxMultiplier, _bidMultiplier * adjustment)
)
// Save state for next observation
prevObservation = Some(obs)
prevState = Some(state)
prevAction = Some(action)
// Reset window counters
windowImpressions = 0
windowClicks = 0
windowSpend = 0.0
windowBidOpportunities = 0
windowWins = 0
(_bidMultiplier, loss)
}
Let’s trace through what happens on each call, step by step.
Step 1: Convert observation to state. The toState method combines the Observation with window counters to produce an 8-element array of normalized features.
Step 2: Learn from the previous action. If this is not the first observation, we now know the result of the action we chose last time. We compute the reward (clicks minus penalties), build a transition (prevState, prevAction, reward, currentState, done), store it in the replay buffer, and run one training step. The done flag is true if the budget is exhausted or the day is over.
Step 3: Choose the next action. If we are in inference-only mode, pick the action with the highest Q-value. Otherwise, use epsilon-greedy: with probability epsilon pick a random action, otherwise pick the greedy best.
Step 4: Apply the action. Look up the chosen action’s multiplier (e.g., action 4 maps to 1.1x), multiply it into the current bid multiplier, and clamp the result to [0.5, 2.0].
Step 5: Save state for next time. Store the current state and action so that on the next observation we can compute the reward and build a transition.
Step 6: Reset window counters. Clear all the impression, click, spend, and bid opportunity counters so they are fresh for the next 15-minute window.
The method returns the new bid multiplier and the training loss (if training happened). The CampaignEntity uses the bid multiplier for all bid responses until the next observation.
The cumulative multiplier
A subtle but important point: actions do not set the multiplier directly. They scale it. Each action is a relative adjustment to whatever the current multiplier is.
Here is a concrete example of how the multiplier evolves through a day:
| Observation | Action chosen | Multiplier before | Calculation | Result |
|---|---|---|---|---|
| 1 (9:00 AM) | 0.9x | 1.0 | 1.0 x 0.9 | 0.9 |
| 2 (9:15 AM) | 1.2x | 0.9 | 0.9 x 1.2 | 1.08 |
| 3 (9:30 AM) | 0.7x | 1.08 | 1.08 x 0.7 | 0.756 |
| 4 (9:45 AM) | 1.4x | 0.756 | 0.756 x 1.4 | 1.058 |
| 5 (10:00 AM) | 1.4x | 1.058 | 1.058 x 1.4 | 1.482 |
| 6 (10:15 AM) | 1.4x | 1.482 | 1.482 x 1.4 | 2.0 (clamped) |
Notice observation 6: the raw result would be 2.074, but it exceeds the maximum and is clamped to 2.0. The hard bounds always apply. This is the safety mechanism in code:
_bidMultiplier = math.max(
config.minMultiplier,
math.min(config.maxMultiplier, _bidMultiplier * adjustment)
)
This cumulative design means the agent can make large adjustments over several steps (0.7 x 0.7 = 0.49, clamped to 0.5) while each individual step is a moderate change. It also means the agent has to learn to “undo” previous decisions – if it overbid in the morning, it needs to choose multipliers below 1.0 in the afternoon to bring the overall multiplier back down.
Day stats for monitoring
The agent tracks cumulative daily metrics for monitoring dashboards:
def dayStats: BidOptimizationAgent.DayStats = BidOptimizationAgent.DayStats(
impressions = dayImpressions,
clicks = dayClicks,
spend = daySpend,
observations = dayObservations,
totalReward = dayRewardSum
)
The DayStats case class also provides derived metrics:
final case class DayStats(
impressions: Long,
clicks: Long,
spend: Double,
observations: Int,
totalReward: Double
) {
def ctr: Double = if (impressions > 0) clicks.toDouble / impressions else 0.0
def costPerClick: Double = if (clicks > 0) spend / clicks else 0.0
}
These numbers let operators see how the agent is performing day over day. Is it getting more clicks? Is the cost per click improving? Is the total reward trending upward? We will see in the next chapter how to use these across days to watch the learning curve.
Recap
BidOptimizationAgent is a thin translation layer. It converts the messy real world – impressions, clicks, budgets, time of day – into the clean abstractions that DQN needs: fixed-size state vectors, discrete actions, and scalar rewards. The actual learning happens inside DQNAgent, which we built in earlier chapters.
The key design decisions are:
- 8-dimensional state that captures everything the agent needs to know about current performance and remaining resources.
- 7 asymmetric actions that give the agent fine-grained control over bid adjustments, with more aggressive options for bidding up.
- Cumulative multiplier with hard bounds, so the agent adjusts incrementally and cannot do anything catastrophic.
- Click-based reward with pacing penalties, so the agent learns to maximize performance while spending sustainably.
- 15-minute observation cycle, slow enough to observe meaningful patterns and fast enough to react to changing conditions.
In the next chapter, we will see how this agent handles the realities of production: day boundaries, persistence across restarts, cold starts, and the fact that many agents are learning simultaneously in the same marketplace.
Training in Production: Episodes, Persistence, and Day Resets
Most RL tutorials end with a training loop and a plot of rising rewards. Real systems have to deal with everything that comes after: what happens at midnight? What happens when the server restarts? What happens on the first day when the agent knows nothing? This chapter covers the production concerns that tutorials usually skip.
Episodes are days
In the textbook formulation, an RL episode is a sequence of states and actions that ends when the agent reaches a terminal state. In Promovolve, each delivery day is one episode. The agent starts the day with a bid multiplier of 1.0, makes observations every 15 minutes throughout the day, and the episode ends at midnight (or when the budget runs out, whichever comes first).
The resetDay() method handles the transition between episodes:
def resetDay(): Unit = {
// Store terminal transition only if we have both a pending state AND a real action
// (prevAction is None if observe() was never called this day — no transition to record)
for {
ps <- prevState
pa <- prevAction
} {
val terminalState = Array.fill(config.dqnConfig.stateSize)(0.0)
val terminalReward = windowClicks.toDouble
dqn.store(ps, pa, terminalReward, terminalState, done = true)
}
_bidMultiplier = 1.0
prevObservation = None
prevState = None
prevAction = None
windowImpressions = 0
windowClicks = 0
windowSpend = 0.0
windowBidOpportunities = 0
windowWins = 0
dayImpressions = 0
dayClicks = 0
daySpend = 0.0
dayObservations = 0
dayRewardSum = 0.0
}
There is a lot happening here, so let’s break it down.
The terminal transition. The last observation of the day chose an action, but we never got to see the result during a normal observe() call. The resetDay() method closes this gap by storing one final transition: the last state and action, a reward based on whatever clicks accumulated in the final window, a zero-filled terminal state, and done = true. Without this, the agent would lose information about the end of every day – which is often the most interesting part, since that is when budget exhaustion or underspend tends to show up.
The for comprehension guards against the case where observe() was never called during the day (perhaps the campaign was paused). In that case there is no pending state or action, so there is nothing to record.
Reset the multiplier. The bid multiplier goes back to 1.0. Each day starts fresh. The agent does not carry forward yesterday’s multiplier because the market conditions, traffic patterns, and competitive landscape may be completely different today.
Clear all counters. Window counters, day counters, and the previous state/action references are all zeroed out.
Keep the learned weights. This is the crucial part: resetDay() does not touch the DQNAgent. The Q-network’s weights, the target network’s weights, the replay buffer’s stored experiences, and the current epsilon – all of these survive across days. The agent forgets what happened today in terms of raw counters, but it remembers everything it has learned about how to bid.
Persistence: surviving restarts
The agent’s learned weights must survive process restarts. A server crash or deployment should not erase weeks of learning. Promovolve handles this through snapshots.
DQNAgent.snapshot() captures the complete learnable state:
def snapshot(): DQNAgent.Snapshot =
DQNAgent.Snapshot(
config = config,
qNetworkSnapshot = qNetwork.serialize(),
targetNetworkSnapshot = targetNetwork.serialize(),
epsilon = epsilon,
totalSteps = totalSteps,
trainSteps = trainSteps
)
Each neural network serializes itself by flattening all weights and biases into one-dimensional arrays:
def serialize(): DenseNetwork.Snapshot = {
val flatWeights = weights.flatMap(_.flatMap(_.toSeq)).toArray
val flatBiases = biases.flatMap(_.toSeq).toArray
DenseNetwork.Snapshot(layerSizes, flatWeights, flatBiases)
}
For our 8-64-64-7 network, that is (8*64) + (64*64) + (64*7) = 512 + 4096 + 448 = 5,056 weight values plus 64 + 64 + 7 = 135 bias values. Two networks means about 10,382 floating-point numbers total. Stored as doubles, that is roughly 83 KB – small enough to persist frequently without concern.
These snapshots are stored via Pekko’s durable state mechanism, backed by PostgreSQL. On restart, DQNAgent.fromSnapshot() reconstructs the agent:
def fromSnapshot(snapshot: Snapshot, rng: Random): DQNAgent = {
val agent = new DQNAgent(snapshot.config, rng)
agent.restore(snapshot)
agent
}
And restore() puts everything back:
def restore(snap: DQNAgent.Snapshot): Unit = {
qNetwork.restore(snap.qNetworkSnapshot)
targetNetwork.restore(snap.targetNetworkSnapshot)
epsilon = snap.epsilon
totalSteps = snap.totalSteps
trainSteps = snap.trainSteps
}
The replay buffer is not persisted. This is a deliberate choice. The buffer holds up to 10,000 raw transitions – serializing it would be much larger, and more importantly, it is not necessary. The neural network’s weights already contain the distilled knowledge from all those transitions. After a restart, the buffer starts empty and refills naturally from new experience. Training resumes as soon as 32 new transitions accumulate (the minBufferSize), which takes about 8 hours at 15-minute intervals. In the meantime, the agent still makes decisions using its pre-restart weights – it just does not learn from new data until the buffer has enough samples.
Inference-only mode
Sometimes you want an agent that has trained enough and should just run its learned policy without further exploration. The inferenceOnly flag controls this:
val action =
if (config.inferenceOnly) dqn.selectGreedy(state)
else dqn.selectAction(state)
When inferenceOnly is true, the agent calls selectGreedy() instead of selectAction(). The difference is simple: selectGreedy() always picks the action with the highest Q-value, while selectAction() uses epsilon-greedy and will occasionally pick a random action.
def selectGreedy(state: Array[Double]): Int =
argMax(qNetwork.forward(state))
In inference-only mode the agent still calls observe(), still records transitions, and still accumulates day stats. It just never explores. This is useful for campaigns where the advertiser wants predictable, stable behavior and the agent has already learned a good policy.
The cold start problem
A brand new campaign has an agent with:
- Random weights (Xavier initialization). The Q-values it produces are essentially noise. It has no idea which actions are good.
- Epsilon = 1.0. Every action is chosen randomly. The agent never consults its Q-values.
- Empty replay buffer. Even if the agent wanted to train, there are no transitions to learn from.
This sounds bad, but it is actually fine. When you know nothing, random exploration is the right strategy. The agent will try different bid adjustments, observe the results, and start building up experience.
Here is the timeline for a new campaign:
Hours 0-8 (observations 1-32). The agent takes random actions and accumulates transitions. No training happens because the replay buffer has not reached its minimum size of 32. The bid multiplier bounces around randomly within the [0.5, 2.0] bounds. This is wasteful but unavoidable – the agent needs data before it can learn.
Hours 8-24 (observations 33-96). Training begins. Each observation now triggers a training step: sample 32 transitions from the buffer, compute Double DQN targets, update the Q-network. Epsilon is still high (around 0.85 after 64 training steps), so most actions are still random, but the Q-values are starting to become meaningful.
Days 2-7. Epsilon drops from roughly 0.85 to about 0.05 over the course of the first week. The agent increasingly relies on its learned Q-values rather than random exploration. It starts to exhibit recognizable behavior: conserving budget when ahead of pace, bidding up when underspending, backing off when the win rate is already high.
Week 2 and beyond. Epsilon has hit its floor of 0.05. The agent is 95% exploitation, 5% exploration. It has developed a stable policy tuned to this campaign’s traffic patterns, budget, and competitive environment. The 5% exploration ensures it can adapt if conditions change.
This is slow. It takes days of real data to produce a competent agent. This is one reason why Promovolve does not rely solely on RL for bid optimization. A separate PI (proportional-integral) pacing controller handles real-time delivery smoothing within each observation window. RL optimizes the overall bidding strategy over days; PI keeps delivery smooth within hours.
Multiple agents, one marketplace
Every campaign in Promovolve has its own BidOptimizationAgent. If there are 50 active campaigns, there are 50 independent agents, each with its own Q-network, replay buffer, and epsilon schedule. They do not communicate with each other and have no knowledge of each other’s existence.
This creates an interesting situation. From any one agent’s perspective, the environment is non-stationary: the other 49 agents are also adjusting their bids, which changes auction dynamics, win rates, and effective prices. An action that worked well yesterday might work poorly today because a competitor raised its bids overnight.
Why does this work despite the non-stationarity? Several factors keep things stable:
Small adjustments. Each action changes the multiplier by a factor between 0.7 and 1.4. No single agent can dramatically shift the marketplace in one step.
Slow updates. Agents observe every 15 minutes. The marketplace has time to absorb changes before the next round of adjustments.
Continuous learning. Because epsilon never reaches zero (it floors at 0.05), agents always do some exploration and can adapt to changing conditions. If a competitor starts bidding aggressively, our agent will observe lower win rates and learn to adjust.
PI pacing as a stabilizer. The PI pacing controller handles fast reactions to marketplace changes within each 15-minute window. RL handles the slow, strategic adjustments. This separation of timescales prevents oscillation.
The result is a system where many learning agents coexist, each gradually converging on a good policy for its own campaign. It is not a Nash equilibrium in the game-theoretic sense, but it is stable enough for a real ad marketplace.
What the agent actually learns over weeks
If you deploy a new campaign and watch the agent’s behavior over time, here is roughly what you will see:
Week 1: mostly random. Epsilon is around 0.7 for most of the week. The agent makes many random bid adjustments. Day-over-day reward is noisy. The multiplier time series looks like a random walk with clamps. You might see the agent exhaust the budget early on some days and underspend on others.
Week 2: starting to exploit. Epsilon has dropped to around 0.3. The agent is choosing its learned policy 70% of the time. You can start to see patterns: the multiplier tends to decrease when budget consumption is ahead of pace, and increase when it is behind. The exhaustion penalty kicks in less often. Daily reward starts trending upward.
Week 3-4: stable policy. Epsilon reaches its floor of 0.05. The agent is 95% exploitation. It has learned the campaign’s traffic pattern – when impressions are plentiful, when they are scarce, how aggressive competitors are at different times of day. The multiplier follows a smooth daily pattern instead of bouncing randomly. Budget pacing is tight. Clicks per dollar are near optimal for the campaign’s targeting and creative quality.
Ongoing. The agent continues to learn and adapt, but changes are incremental. The 5% exploration ensures it notices if traffic patterns shift (seasonal changes, new competitors, different creative rotation). Major changes in campaign configuration (new budget, new targeting) effectively create a new environment that the agent must partially relearn.
The key takeaway: RL is slow. It needs days of real data to produce results. This is fundamentally different from supervised learning, where you can train on a dataset in minutes. The agent must interact with the real marketplace, observe real outcomes, and learn from real consequences. There are no shortcuts.
Monitoring: watching the learning curve
The dayStats method provides a daily summary for dashboards:
def dayStats: BidOptimizationAgent.DayStats = BidOptimizationAgent.DayStats(
impressions = dayImpressions,
clicks = dayClicks,
spend = daySpend,
observations = dayObservations,
totalReward = dayRewardSum
)
By plotting these values across days, you can watch the agent learn:
- Total reward should trend upward over the first 1-2 weeks, then plateau. If it is flat from day one, the agent may not be training (check that
inferenceOnlyis false and thatobserve()is being called). - Clicks should increase as the agent learns to bid more effectively.
- CTR (
clicks / impressions) may increase if the agent learns to win auctions at better times of day when users are more engaged. - Cost per click (
spend / clicks) should decrease or stabilize as the agent gets better at bidding just enough to win without overpaying. - Observations should be around 96 per day (24 hours x 4 observations per hour). Significantly fewer means the campaign is pausing or running out of budget.
For deeper debugging, currentQValues shows what the agent “thinks” about its current state:
def currentQValues: Option[Array[Double]] = prevState.map(dqn.qValues)
This returns the Q-value for each of the 7 actions given the most recent state. If the agent is well-trained, you should see clear differentiation: one or two actions with notably higher Q-values than the rest. If all Q-values are nearly identical, the agent has not yet learned to distinguish between actions – it needs more training time.
You can also check epsilon to see where the agent is on its exploration schedule:
def epsilon: Double = dqn.currentEpsilon
If epsilon is still high (above 0.5), the agent is mostly exploring and you should not expect consistent behavior yet. If epsilon is at 0.05, the agent is running its learned policy and any erratic behavior is a sign that the policy itself needs more training data – or that the environment has changed.
Recap
Production RL is a different discipline from research RL. The core algorithm (Double DQN) is the same, but the engineering around it matters enormously:
- Episodes map to days. The terminal transition at day’s end closes the learning loop. The multiplier resets; the weights persist.
- Persistence is lightweight. Two networks serialized to flat arrays, plus epsilon and step counts. About 83 KB per campaign. The replay buffer is not persisted because the network weights already encode what it learned.
- Cold start is handled by design. Random exploration with epsilon = 1.0 is the correct strategy when you know nothing. Training begins after about 8 hours of experience.
- Multi-agent stability comes from slow, small adjustments. Each agent learns independently, but the marketplace stays stable because changes are incremental and the PI pacing controller handles fast reactions.
- Monitoring is essential. Daily reward, clicks, CPC, and Q-value inspection let operators verify that learning is happening and catch problems early.
The RL agent is not a standalone system. It is one component in a larger architecture where PI pacing handles short-term delivery smoothing, Thompson Sampling handles creative selection at serve time, and the RL agent optimizes the overall bidding strategy over days and weeks. Each piece operates at a different timescale, and together they produce a system that is both responsive and adaptive.
Promovolve vs SSP/DSP/Exchange
This chapter maps Promovolve’s design choices against the traditional programmatic advertising stack.
Traditional Programmatic Stack
graph LR
Publisher["Publisher<br/>(webpage)"] --> SSP["SSP<br/>(Supply)"]
SSP --> Exchange["Exchange<br/>(auction)"]
Exchange --> DSP1["DSP1"]
Exchange --> DSP2["DSP2"]
Exchange --> DSP3["DSP3"]
Flow: User loads page → SSP sends bid request → Exchange broadcasts to DSPs → DSPs respond within 100ms → Highest bid wins → Ad served.
Promovolve Stack
Promovolve collapses the SSP, DSP, and exchange into a single system with two distinct phases: an offline auction phase that runs ahead of time, and an online serve phase that responds to user requests.
Phase 1: Offline Auction (no user present)
graph LR
Crawler["Crawler<br/>(scheduled)"] --> Auctioneer["Auctioneer<br/>(Pekko Shard)"]
Auctioneer --> ServeIndex["ServeIndex<br/>(DData)"]
- Crawler periodically fetches publisher pages and sends them to an LLM (Gemini Flash) for content classification into IAB taxonomy categories.
- AuctioneerEntity — one per site, sharded across the Pekko cluster — runs a batch auction. It collects bids from all campaigns whose target categories match the page content, applies RL-adjusted bid multipliers and pacing throttles, and shortlists multiple candidates per ad slot (not just a single winner).
- ServeIndex — a replicated in-memory cache built on Pekko Distributed Data (DData) — stores the shortlisted candidates. Every node in the cluster holds a local replica, so no remote call is needed at serve time.
This phase re-runs on a schedule (every 5 minutes by default) and whenever content changes, keeping candidates fresh without waiting for a user to arrive.
Phase 2: Online Serve (user arrives)
graph LR
User["User<br/>(browser)"] --> APINode
subgraph APINode["API Node"]
ServeIndex["ServeIndex<br/>(local DData replica)"] --> TS["Thompson Sampling<br/>+ Pacing"]
end
The ServeIndex is not a separate service — it’s a DData-replicated data structure, and every API node holds a local replica in its own process memory. There is no network call between the API node and the ServeIndex; it’s a local in-memory lookup.
- User requests an ad for the page they’re viewing.
- API Node reads pre-computed candidates directly from its local ServeIndex replica — no network hop, no auction, no external call.
- Thompson Sampling selects among the shortlisted candidates, balancing exploration of new creatives against exploitation of known performers. A pacing check ensures the selected campaign hasn’t exhausted its budget for this time window.
The result: serve latency under 1ms, with no user data collected, no cookies set, and no third-party calls made.
What replaced what
| Traditional role | Promovolve equivalent |
|---|---|
| SSP (supply-side platform) | Crawler + AuctioneerEntity — the publisher’s inventory is discovered by crawling, not by firing bid requests |
| Exchange (auction house) | AuctioneerEntity — runs the auction offline, ahead of user traffic |
| DSP (demand-side platform) | Campaign entities + DQN RL agent — advertisers’ bid strategies are learned automatically, not configured in a separate system |
| Ad server | API Node + local DData replica — serves pre-computed results from memory |
| DMP (data management platform) | Not needed — targeting is content-based, not user-based |
Summary Comparison
| Aspect | Traditional SSP/DSP | Promovolve |
|---|---|---|
| Auction timing | Per-request (realtime) | Per-crawl + 5-min re-auction |
| Serve latency | 50-200ms | < 1ms |
| Winner selection | Highest bid wins | Fair selection → Thompson Sampling |
| Price model | Second-price (GSP) | First-price, RL-adjusted CPM |
| Price discovery | Yes (competitive) | No (RL optimizes pacing) |
| Learning | RTB feedback loops | TS + DQN RL + category ranking |
| Candidate model | Single winner | Multi-candidate with diversity |
| Budget control | Per-campaign throttling | Aggregate PI-controlled pacing |
| State persistence | Database/Redis | DData (replicated in-memory) |
| Content scope | Any page, any time | Recency only (< 48h) |
| Targeting | User profiles, cookies | Content classification (LLM) |
| Failure mode | No ad shown | Serve cached candidates |
| Privacy | User tracking required | No user profiles |
The following sub-chapters explore each difference in detail.
Auction Timing: Periodic vs Realtime
The most fundamental difference between Promovolve and traditional ad tech is when the auction runs.
Traditional: Per-Request Auctions
t=0ms User loads page
t=5ms SSP sends bid request to exchange
t=10ms Exchange broadcasts to DSPs
t=80ms DSPs respond with bids
t=85ms Exchange picks winner
t=90ms Ad creative URL returned
t=200ms Ad renders on page
Advantages: Fresh bidding, competitive price discovery Disadvantages: 50-200ms latency, auction QPS = page QPS, failure = empty slot
Promovolve: Periodic Batch Auctions
Crawl time (background, 2am daily + 5-min re-auctions):
t=0s Crawler classifies page (LLM)
t=1s AuctioneerEntity starts auction
t=3s Bids collected (800ms timeout for taxonomy)
t=4s Candidates cached in DData
Serve time (user-facing):
t=0.0ms User loads page
t=0.1ms Local DData lookup
t=0.2ms Pacing gate + Thompson Sampling
t=0.3ms Ad response sent
Advantages: Sub-ms serving, bounded compute, graceful failure, exploration Disadvantages: Stale bids (up to 5 min between re-auctions), no user-level signals
When Periodic Wins
- Content is the signal, not the user: Promovolve targets content categories via LLM classification. Content changes slowly, so periodic auctions suffice.
- Single publisher control: No cross-publisher price discovery needed.
- Serve latency matters: Adding 100ms per ad slot is unacceptable for performance-conscious publishers.
- Exploration has value: The publisher wants to learn which creatives engage users, not just which advertiser pays most.
The Refresh Cycle
Promovolve’s re-auction interval (5 minutes) is a middle ground:
- Fresh enough to react to campaign budget changes
- Infrequent enough to avoid overwhelming entity actors
- Candidates in DData with 120-minute TTL survive multiple re-auction cycles
Winner Selection: MAB vs Highest Bid
Traditional exchanges pick a single winner — the highest bidder — and move on. Promovolve takes a fundamentally different approach: it shortlists multiple candidates at auction time, then uses Thompson Sampling at serve time to learn which creative actually performs best.
Traditional: Highest Bid Wins
In a standard ad exchange, winner selection is simple:
Bids received: $8.00, $5.50, $3.20
Winner: $8.00
Price paid: $5.51 (second-price: winner pays $0.01 above second bid)
The logic is purely financial: whoever is willing to pay the most gets the impression. The creative’s quality, relevance to the page, or likelihood of being clicked plays no role in the decision.
Why this is a problem
CTR is invisible. A campaign bidding $8 CPM with a 0.5% click-through rate beats a campaign bidding $3 CPM with a 5% CTR. The publisher serves a worse ad and makes less money per click. The advertiser with the better creative loses despite offering more value to readers.
There is no learning mechanism. The exchange doesn’t track whether the winning creative gets clicked. It doesn’t know if the $8 bid was worth it. Each auction is independent — the system never gets smarter.
The winner’s curse. In a competitive auction, the highest bidder is statistically the one who overestimated the impression’s value the most. Sophisticated DSPs account for this; small advertisers don’t, and overpay.
New entrants can’t compete. A new advertiser with a potentially excellent creative but a conservative bid never wins, never gets impressions, and therefore never has a chance to prove itself. The system has no exploration — only exploitation of whoever bids highest today.
Promovolve: A Two-Phase Selection System
Promovolve splits winner selection into two phases: fair shortlisting at auction time (when content is crawled) and adaptive selection at serve time (when a user arrives). This separation is the key design difference.
Phase 1: Auction-Time Fair Shortlisting
Instead of picking a single winner, the AuctioneerEntity shortlists multiple candidates per ad slot with a per-campaign diversity guarantee:
3 campaigns, 3 slots → each campaign gets exactly 1 slot
2 campaigns, 3 slots → each gets 1, fill the 3rd with the best remaining creative
4 campaigns, 3 slots → top 3 by CPM each get 1 slot
The algorithm:
- Sort all candidates by CPM descending, with pre-approved creatives preferred as a tiebreaker (publishers can approve creatives before they enter the auction — approved creatives win ties over unapproved ones)
- Group by campaign, pick the best creative per campaign
- If there are more campaigns than slots, the top campaigns by CPM each get one slot
- If there are fewer campaigns than slots, every campaign is guaranteed representation, and remaining slots are filled with next-best creatives
Why this matters: In a traditional exchange, a single high-budget campaign can monopolize every impression. Promovolve’s diversity guarantee ensures that every participating campaign gets representation in the candidate pool, giving Thompson Sampling a diverse set to learn from.
Phase 2: Serve-Time Thompson Sampling
When a user loads a page, Thompson Sampling selects among the shortlisted candidates. The scoring formula:
score = sampledCTR × log(1 + CPM)
Where sampledCTR is drawn from a Beta distribution based on observed performance over a 60-minute rolling window:
sampledCTR ~ Beta(clicks + 1, impressions - clicks + 1)
The log(1 + CPM) term gives higher bids an advantage, but with diminishing returns: a $10 CPM is only 3.5x better than a $1 CPM (not 10x). This means CTR dominates — a creative that readers actually click beats one that merely bids high.
A worked example
Three campaigns competing for the same slot. Campaign C is brand new with no data:
Campaign A: $5.00 CPM, 150 impressions, 5 clicks
Beta(6, 146) → sample: 0.032
score = 0.032 × log(6.00) = 0.057
Campaign B: $4.20 CPM, 22 impressions, 3 clicks
Beta(4, 20) → sample: 0.091
score = 0.091 × log(5.20) = 0.150
Campaign C: $3.80 CPM, 0 impressions, 0 clicks
Beta(1, 1) → sample: 0.647 (uniform — could be anything)
score = 0.647 × log(4.80) = 1.016 ← wins (exploration)
Campaign C wins this request despite having the lowest CPM and no track record. This is exploration — the system gives the new creative a chance to prove itself. Over the next few dozen impressions, if C’s true CTR turns out to be low, its Beta distribution narrows and it stops winning. If C turns out to be genuinely good, it earns a stable share of impressions.
A traditional exchange would never serve Campaign C. It would never get data. It would never have a chance.
Cold start: getting new creatives off the ground
Thompson Sampling needs data to work, so Promovolve uses specific strategies for new creatives depending on the state of the candidate pool:
| Condition | Strategy | Behavior |
|---|---|---|
| All candidates have 0 impressions | Full cold start | Use categoryScore ± noise as estimated CTR — the auction’s content-relevance signal bootstraps the first selections |
| All candidates have < 10 impressions | Warmup round-robin | Always serve the candidate with the fewest impressions — ensures every creative gets at least 10 impressions before exploitation begins |
| Some candidates are new, some have data | Partial cold start | 30% of the time, randomly pick a new creative; 70% run normal Thompson Sampling on all candidates |
| All candidates have ≥ 10 impressions | Standard | Full Thompson Sampling |
The 30% exploration rate for partial cold starts is aggressive by design — new creatives need data quickly, and Thompson Sampling’s natural exploration handles the rest once they have enough impressions.
The full selection pipeline
Thompson Sampling doesn’t run in isolation. It’s one step in a pipeline, and its position in that pipeline is deliberate:
1. ServeIndex lookup → fetch cached candidates from local DData replica
2. Content recency → drop candidates if page content is stale (> 48h)
3. Frequency cap → drop candidates the user has seen too many times
4. Pacing gate → probabilistic throttle based on aggregate budget utilization
5. Thompson Sampling → score and select among remaining candidates
6. Budget reservation → reserve spend with the selected campaign
Why pacing runs before Thompson Sampling: If pacing ran after selection, Thompson Sampling would pick a creative, then pacing would sometimes throw it away. That wastes an exploration opportunity — we showed nothing, we learned nothing. By putting the pacing gate first, every request that makes it to Thompson Sampling produces a served impression and useful data.
Budget reservation and graceful fallback
After Thompson Sampling selects a winner, the system attempts to reserve the spend:
1. Reserve spend with CampaignEntity
2. Verify budget status with AdvertiserEntity
3. On failure (budget exhausted) → try the next-best candidate by Thompson score
4. All candidates exhausted → return NoCandidates (HTTP 204)
In a traditional exchange, if the winner can’t pay, the auction fails and the slot goes unfilled (or a low-quality fallback ad appears). Promovolve’s multi-candidate model means there’s always a next-best option waiting — graceful degradation without re-running the auction.
Side-by-Side Comparison
| Dimension | Traditional (Highest Bid) | Promovolve (Thompson Sampling) |
|---|---|---|
| Selection criterion | Price only | CTR × log-compressed price |
| Number of candidates | 1 winner | Multiple shortlisted per slot |
| Learning | None — each auction is independent | Continuous — every impression updates the Beta posterior |
| New creative discovery | Impossible without outbidding the incumbent | Built-in via exploration; 30% forced exploration for brand-new creatives |
| Cold start | New advertiser must bid high to win | Round-robin warmup guarantees initial data collection |
| Failure handling | Slot goes unfilled | Fall through to next-best candidate |
| Publisher alignment | Optimizes for advertiser spend | Optimizes for reader engagement (CTR), weighted by spend |
| Short-term revenue | Higher (always picks highest bid) | Sometimes lower (explores lower-CPM creatives) |
| Long-term revenue | Stagnant (no learning) | Higher (discovers high-CTR creatives that earn more clicks) |
| Advertiser ROI | Favors large budgets | Favors creative quality — a good ad at $3 CPM can beat a bad ad at $8 CPM |
Price Discovery & First-Price Model
Traditional exchanges use second-price auctions for price discovery. Promovolve uses first-price with RL-adjusted bids.
Traditional: Second-Price Auctions
Bids: $8.00, $5.50, $3.20
Winner pays: $5.51 (second-highest + $0.01)
Truthful bidding: Bidders bid true value because overbidding doesn’t increase cost. Price discovery: Market-clearing price ($5.51) reveals impression value.
In practice, DSPs use bid-shading to game second-price mechanics, and many exchanges have shifted to first-price anyway.
Promovolve: First-Price, RL-Adjusted
Advertisers set maxCpm. The DQN agent adjusts with a bid multiplier:
bidCpm = max(maxCpm × bidMultiplier, floorCpm)
The advertiser pays exactly this CPM — no second-price discount.
Why First-Price Is Acceptable
- Single-publisher platform: No competitive exchange dynamics requiring truthful revelation
- RL handles optimization: The DQN agent learns the right bid level over days (5 actions: 0.8x to 1.2x, applied cumulatively)
- Simpler: No second-bid tracking, reserve prices, or bid-shading countermeasures
- Thompson Sampling redefines “winning”: High CPM alone doesn’t win — creative must also have good CTR
RL as Price Optimization
The DQN agent’s reward function (clicks - overspend penalty) naturally discovers efficient bid levels:
- If overpaying (high CPC, spendRate > 1.5): reward drops → agent reduces multiplier
- If underbidding (low win rate, low impressions): agent increases multiplier
- Converges to the bid level that maximizes clicks within budget
No Market-Clearing Price
Promovolve does not discover market-clearing prices:
- Advertisers set
maxCpmbased on business judgment - RL optimizes pacing, not price discovery
- No mechanism to tell advertisers they’re over/under-paying vs competition
This is acceptable for internal advertising where pricing is a business decision, not a market problem.
Comparison
| Aspect | Second-Price (RTB) | First-Price (Promovolve) |
|---|---|---|
| Price paid | Second bid + $0.01 | Exactly the bid |
| Optimization | DSP bid-shading | DQN RL agent |
| Discovery | Market-clearing price | None |
| Complexity | High | Low |
| Floor price | floorCpm (default $0.50) | Same |
Learning Mechanisms
Both traditional ad tech and Promovolve learn from feedback, but at different levels and timescales.
Traditional: RTB Feedback Loops
DSP-Side
- Bid values: Learn from win/loss notifications what impressions are worth
- Audience targeting: Learn from conversions which user segments are valuable
- Learning happens at the bid level, not the creative level
Exchange-Side
- Floor prices: Learn from bid distributions to set minimum acceptable bids
Limitations
- No exploration: only learns from wins, never discovers if alternative creatives would perform better
- User-dependent: requires cookies, profiles, cross-site tracking
Promovolve: Three-Layer Learning
Layer 1: Thompson Sampling (Per-Request)
What: Which creative performs best for a given slot
Timescale: Every impression updates CreativeStats (1-minute buckets, 60-minute window)
How: Bayesian posterior update Beta(clicks+1, impressions-clicks+1)
Before impression: Beta(5, 95) → mean 5%
User clicks: Beta(6, 95) → mean 5.9%
This is the fastest loop — sub-second feedback incorporated into the next serve decision.
Layer 2: DQN Bid Optimization (Every 15 Minutes)
What: Optimal bid multiplier given budget, pacing, and performance Timescale: 15-minute observation windows, convergence over ~8 days How: Double DQN with 8-dim state, 5 actions, click reward - overspend penalty
State → Q-network → action → bidMultiplier adjustment
Medium timescale, learning competitive dynamics and pacing patterns.
Layer 3: Category Ranking (Per Auction)
What: Which content categories are valuable for each site
Timescale: Updated every auction, 7-day half-life decay, 14-day max age
How: Thompson Sampling with Beta(prior_α + clicks, prior_β + non_clicks)
Category "sports" on site X:
Prior: Beta(1, 1) → weight sampled ~0.5
After data: Beta(15, 95) → weight sampled ~0.14
Slowest loop, learning site-level category affinities.
Layer 4: Traffic Shape (Per Day)
What: Hourly traffic patterns (separate weekday/weekend)
Timescale: Daily rollover with dayAlpha = 0.2 blend
How: 24-bucket EMA with alpha = 0.1
Adjusts pacing targets to match actual traffic distribution.
Layer 5: PI Self-Tuning (Continuous)
What: Optimal aggressiveness for overpace correction Timescale: Every 20 samples, min 500ms interval How: Overpace multiplier adapts between 1.5x and 5.0x based on persistent overspend detection
Comparison
| Dimension | Traditional RTB | Promovolve |
|---|---|---|
| What’s optimized | Bid price | Creative + bid + category + pacing |
| Exploration | None | Built-in (Thompson Sampling) |
| Learning layers | 1 (bid-level) | 5 (per-request through daily) |
| User targeting | Yes (profile) | No (content-based) |
| Privacy impact | High (tracking) | Low (no user profiles) |
| Cold start | Historical bid data | Bayesian priors + round-robin warmup |
| Adaptability | Real-time bids | RL converges over days, TS adapts per-request |
Key Innovations
Promovolve’s design choices form a coherent system where each innovation enables the next.
1. Content-Based, Not User-Based
Traditional: User profiles, cookies, device fingerprinting power ad targeting.
Promovolve: Targeting uses LLM-based content classification (Gemini/OpenAI/Anthropic) to match ads to page topics. No user tracking, no cookies.
Why it matters: Privacy-preserving (no GDPR/CCPA data collection), simpler infrastructure (no profile database), content-value alignment (ads match what the user is currently reading).
2. Recency-Only Monetization
Traditional: Ads on any page, regardless of publication date.
Promovolve: Only content within the 48-hour recency window participates. AuctioneerEntity prunes older classifications every 5 minutes.
Why it matters: Fresh content has higher engagement → higher CTR → better outcomes for all participants. Reduces low-quality inventory.
3. Periodic Batch Auctions
Traditional: One auction per page load.
Promovolve: One auction per crawl (scheduled via Quartz cron) + 5-minute re-auctions.
Why it matters: Decouples auction cost from traffic. Sub-millisecond serving via DData local replica. Enables multi-candidate caching.
4. Fair Selection + Multi-Candidate MAB
Traditional: Single winner per auction.
Promovolve: Per-campaign diversity guarantee at auction time (one creative per campaign first), then Thompson Sampling explores among cached candidates at serve time.
Why it matters: Discovers which creative actually engages users. Graceful degradation on budget exhaustion. Self-correcting (poor creatives lose share naturally).
5. Pure-Scala Double DQN
Traditional: DSPs use bid-shading algorithms (often in Python/C++ ML frameworks).
Promovolve: Each campaign has a dedicated Double DQN agent (8→64→64→5, ~4,800 parameters) implemented in pure Scala. No TensorFlow, no PyTorch.
Why it matters: Lives inside the Pekko actor, no inter-process communication. Weights serialize as Array[Double] in CampaignEntity state. JVM-only deployment.
6. Self-Tuning PI Pacing
Traditional: Simple rules (“spend X% by noon”) or fixed-gain controllers.
Promovolve: PI controller with:
- Adaptive gains scaled by traffic volatility (CV)
- Self-tuning overpace multiplier (1.5x-5.0x, adjusts every 20 samples)
- Oscillation detection (stddev threshold 0.08 → dampening)
- Leaky integrator (decay 0.995, anti-windup)
- Cross-day learning (boosts multiplier if budget exhausted early)
- Traffic shape awareness (separate weekday/weekend 24-hour profiles)
Why it matters: Adapts to any traffic pattern without manual tuning. Learns from past days’ mistakes.
7. Buffered At-Least-Once Spend Recording
Traditional: Database writes per impression.
Promovolve: Spend events buffered (500ms timer OR batch of 20), deduplicated via Bloom filter (50K entries, 0.01% FPP), at-least-once delivery with exponential backoff retries.
Why it matters: Reduces persistence load by ~20x while maintaining correctness guarantees.
The Unified Picture
graph TD
LLM["LLM content classification"] --> Auction["Periodic batch auction<br/>(content changes slowly)"]
Auction --> Fair["Fair selection<br/>(per-campaign diversity guarantee)"]
Fair --> DData["DData cache<br/>(multi-candidate, replicated locally)"]
DData --> TS["Thompson Sampling<br/>(explore among cached candidates)"]
TS --> DQN["DQN bid optimization<br/>(learn to pace across days)"]
DQN --> PI["Self-tuning PI pacing<br/>(smooth delivery within days)"]
PI --> Spend["Buffered spend recording<br/>(correctness at scale)"]
Each choice enables the next. Remove one, and the system loses coherence. Together, they create an ad platform that is fast, learning, privacy-preserving, and publisher-aligned.