Reward Function

The reward function defines what the DQN agent optimizes for. From BidOptimizationAgent.scala:

Formula

reward = clickReward - overspendPenalty

where:
  clickReward = windowClicks.toDouble

  overspendPenalty = if (spendRate > 1.5)
                       config.overspendPenalty × (spendRate - 1.5)
                     else 0.0

Default penalty factor: overspendPenalty = 2.0

Component Breakdown

Clicks (Primary Signal)

Raw number of clicks in the 15-minute observation window. This is the positive signal — the agent maximizes clicks because that’s what advertisers care about.

Why clicks, not impressions?

Impressions don’t indicate value — they’re “free” from the user’s perspective
Clicks represent actual engagement
Maximizing clicks naturally selects for high-CTR placements

Why clicks, not revenue?

Revenue (CPM × impressions) would incentivize bidding as high as possible
This contradicts the advertiser’s interest in efficient spending
Clicks align the agent with advertiser ROI

Overspend Penalty

overspendPenalty = 2.0 × max(0, spendRate - 1.5)

Threshold at 1.5x: No penalty for spending up to 50% faster than target. Gives the agent freedom to bid aggressively when opportunities are good.
2.0x factor: Each unit of overspend above 1.5x costs 2.0 reward points
Continuous: Allows the agent to learn the trade-off rather than hitting a hard wall

Examples:

spendRate = 1.0 → penalty = 0      (on pace)
spendRate = 1.5 → penalty = 0      (at threshold)
spendRate = 2.0 → penalty = 1.0    (moderate overspend)
spendRate = 3.0 → penalty = 3.0    (severe overspend)

Episode Termination

The episode terminates when:

done = (budgetRemaining <= 0.0) || (timeRemaining <= 0.0)

At termination, a special terminal transition is stored:

val terminalState = Array.fill(stateSize)(0.0)   // Zero vector
val terminalReward = windowClicks.toDouble        // Final clicks (no penalty)
dqn.store(prevState, prevAction, terminalReward, terminalState, done = true)

The done = true flag tells DQN not to bootstrap future rewards beyond the episode boundary.

Reward Examples

Window	Clicks	spendRate	Penalty	Reward
Normal pacing	3	1.0	0	3.0
Good CTR	8	1.2	0	8.0
Slight overspend	5	1.8	0.6	4.4
Severe overspend	2	3.0	3.0	-1.0
At threshold	4	1.5	0	4.0

Design Simplicity

Note what the reward function does not include:

No exhaustion penalty — the episode simply ends when budget hits zero
No CPA signal — conversion tracking is sparse, clicks are a sufficient proxy
No win-rate bonus — win rate is in the state space, letting the agent learn its own trade-offs

This simplicity makes the reward signal clean and easy to interpret. The agent learns that clicks are good and overspending is bad — everything else it figures out from the state space.

Keyboard shortcuts

Promovolve: Ad Auction Algorithms & Architecture