Putting It Together: The BidOptimizationAgent

We have built every piece individually: a neural network that estimates Q-values, a replay buffer that stores experience, and a Double DQN training loop that learns from that experience. Now it is time to see how Promovolve wires them into a single, working bid optimization agent.

The class is BidOptimizationAgent, and it lives in modules/core/src/main/scala/promovolve/rl/BidOptimizationAgent.scala. We will walk through it top to bottom.

The architecture

The nesting looks like this:

BidOptimizationAgent          (one per campaign)
  └── DQNAgent
       ├── qNetwork           (DenseNetwork: 8 → 64 → 64 → 7)
       ├── targetNetwork      (DenseNetwork: 8 → 64 → 64 → 7, periodically synced)
       └── replayBuffer       (ReplayBuffer: capacity 10,000)

BidOptimizationAgent is the outer shell that knows about campaigns, budgets, and ad serving. It translates real-world campaign metrics into the abstract language of states, actions, and rewards that the inner DQNAgent understands. The DQNAgent in turn owns the two neural networks and the replay buffer we built in earlier chapters.

One line creates the entire inner stack:

private val dqn = DQNAgent(config.dqnConfig, rng)

Everything else in BidOptimizationAgent is bookkeeping: tracking window counters, computing states, computing rewards, and applying actions back to the bid multiplier.

Configuration

Here is the actual default configuration that Promovolve ships with:

final case class Config(
    dqnConfig: DQNAgent.Config = DQNAgent.Config(
      stateSize = 8,
      actionSize = 7,
      hiddenSizes = Vector(64, 64),
      gamma = 0.99,
      learningRate = 0.001,
      epsilonStart = 1.0,
      epsilonEnd = 0.05,
      epsilonDecay = 0.995,
      bufferSize = 10_000,
      minBufferSize = 32,
      batchSize = 32,
      targetSyncInterval = 100,
      actionMultipliers = Vector(0.7, 0.8, 0.9, 1.0, 1.1, 1.2, 1.4)
    ),
    minMultiplier: Double = 0.5,
    maxMultiplier: Double = 2.0,
    overspendPenalty: Double = 2.0,
    exhaustionPenalty: Double = 5.0,
    inferenceOnly: Boolean = false
)

Let’s unpack the important choices.

7 actions, asymmetric. The action multipliers are [0.7, 0.8, 0.9, 1.0, 1.1, 1.2, 1.4]. Notice they are not symmetric around 1.0. There are more options for bidding up (1.1, 1.2, 1.4) than for bidding down (0.7, 0.8, 0.9), and the most aggressive upward option (1.4x) is a bigger jump than the most aggressive downward option (0.7x). This reflects a deliberate design choice: in a competitive auction, missing out on impressions is often worse than slightly overpaying. The agent can slam the brakes hard when it needs to (0.7x), but it also has a “turbo” option (1.4x) for when it is underspending and needs to catch up fast.

Hard bounds: 0.5 to 2.0. No matter what sequence of actions the agent takes, the cumulative bid multiplier is clamped to this range. A campaign will never bid less than half its base CPM, and never more than double. This is a safety rail that prevents the RL agent from doing anything catastrophic.

Discount factor (gamma = 0.99). The agent values future rewards almost as much as immediate ones. This makes sense for daily budget pacing: you do not want an agent that burns through the budget in the first hour just because it got a few clicks early on.

Exploration (epsilon: 1.0 to 0.05, decay 0.995). The agent starts fully random and slowly shifts toward exploiting what it has learned. With a decay of 0.995 per training step, after 100 training steps epsilon is about 0.61, after 300 steps about 0.22, and after 600 steps about 0.05 (the floor). Each observation triggers at most one training step, and observations happen every 15 minutes, so reaching the floor takes roughly a week of continuous operation.

Penalties. overspendPenalty = 2.0 and exhaustionPenalty = 5.0 shape the reward signal to discourage burning through the budget too fast. We will see exactly how these work when we look at the reward function.

Window counters

Between observations (every 15 minutes), the agent accumulates raw events:

def recordImpression(spendAmount: Double): Unit = {
  windowImpressions += 1
  windowSpend += spendAmount
  dayImpressions += 1
  daySpend += spendAmount
}

def recordClick(): Unit = {
  windowClicks += 1
  dayClicks += 1
}

def recordBidOpportunity(won: Boolean): Unit = {
  windowBidOpportunities += 1
  if (won) windowWins += 1
}

The window* counters track what happened since the last observation. The day* counters track the entire day for monitoring. The CampaignEntity calls these methods as impressions, clicks, and bid opportunities flow through the system. By the time observe() is called, the window counters contain a summary of the last 15 minutes of activity.

The state: translating campaign metrics into numbers

The toState method converts an Observation plus the window counters into an 8-dimensional array that the neural network can process:

private def toState(obs: Observation): Array[Double] = {
  val maxCpm = if (obs.maxCpm > 0) obs.maxCpm else 1.0
  val dailyBudget = if (obs.dailyBudget > 0) obs.dailyBudget else 1.0

  Array(
    // 0: effective CPM (normalized)
    math.min(2.0, (obs.maxCpm * _bidMultiplier) / maxCpm),
    // 1: CTR in window
    if (windowImpressions > 0) math.min(1.0, windowClicks.toDouble / windowImpressions)
    else 0.0,
    // 2: win rate
    if (windowBidOpportunities > 0) windowWins.toDouble / windowBidOpportunities
    else 0.5,
    // 3: budget remaining fraction
    math.max(0.0, math.min(1.0, obs.budgetRemaining / dailyBudget)),
    // 4: time remaining fraction
    math.max(0.0, math.min(1.0, obs.timeRemaining)),
    // 5: spend rate vs ideal (1.0 = on pace)
    spendRate(obs),
    // 6: impression rate (normalized by expected)
    normalizedImpressionRate(obs),
    // 7: CPC (normalized)
    if (windowClicks > 0) math.min(2.0, (windowSpend / windowClicks) / maxCpm)
    else 0.0
  )
}

Each dimension is normalized to a small range (roughly 0 to 2) so the neural network can learn effectively. Here is what each one tells the agent:

Index	Feature	What it means
0	Effective CPM	How much we are currently bidding, relative to the base price. Equals the bid multiplier itself.
1	CTR	Click-through rate in the last 15 minutes. Higher is better.
2	Win rate	Fraction of auctions we won. Low means we are being outbid.
3	Budget remaining	How much money is left today (1.0 = full, 0.0 = empty).
4	Time remaining	How much of the day is left (1.0 = start, 0.0 = end).
5	Spend rate	Current spending speed vs. ideal even pace. 1.0 means on track, 2.0 means spending twice as fast as we should.
6	Impression rate	How many impressions we got, normalized by a baseline of 100 per window.
7	CPC	Cost per click, normalized by the base CPM. Lower is better.

The Observation case class provides the campaign-level data that is not directly available from window counters:

final case class Observation(
    maxCpm: Double,         // Campaign's base max CPM (before multiplier)
    dailyBudget: Double,    // Total daily budget in dollars
    budgetRemaining: Double, // Remaining budget in dollars
    timeRemaining: Double,  // Fraction of delivery period remaining
    timestamp: Instant      // When this observation was taken
)

The spend rate calculation deserves a closer look:

private def spendRate(obs: Observation): Double = {
  if (obs.dailyBudget <= 0 || obs.timeRemaining >= 1.0) return 1.0
  val elapsed = 1.0 - obs.timeRemaining
  if (elapsed <= 0) return 1.0
  val expectedSpend = obs.dailyBudget * elapsed
  if (expectedSpend <= 0) return 1.0
  val actualSpend = obs.dailyBudget - obs.budgetRemaining
  math.min(3.0, actualSpend / expectedSpend) // cap at 3x overspend
}

If 40% of the day has passed and we have spent 40% of the budget, the spend rate is 1.0 – perfectly on pace. If we have spent 60% of the budget in that same time, the rate is 1.5 – we are overspending. This single number gives the agent a strong signal about whether it should bid more or less aggressively.

The reward: what the agent optimizes for

private def computeReward(obs: Observation): Double = {
  val clickReward = windowClicks.toDouble

  val rate = spendRate(obs)
  val overspendPenalty =
    if (rate > 1.5) config.overspendPenalty * (rate - 1.5) else 0.0

  val exhaustionPenalty =
    if (obs.budgetRemaining <= 0 && obs.timeRemaining > 0.1)
      config.exhaustionPenalty
    else 0.0

  clickReward - overspendPenalty - exhaustionPenalty
}

The reward is simple: clicks minus penalties. The agent gets +1 for each click in the window. But two things reduce the reward:

Overspend penalty. If the spend rate exceeds 1.5x (spending 50% faster than ideal), a penalty kicks in proportional to how far over 1.5 it is. With overspendPenalty = 2.0, a spend rate of 2.5 costs a penalty of 2.0 * (2.5 - 1.5) = 2.0 – equivalent to losing 2 clicks worth of reward.
Exhaustion penalty. If the budget hits zero while more than 10% of the day remains, a flat penalty of 5.0 is applied. Running out of budget at 3pm when ads should run until midnight is a serious failure; this penalty makes sure the agent learns to avoid it.

The combination encourages the agent to maximize clicks while spending at a sustainable pace throughout the day.

The observation loop

Here is the core method, called every 15 minutes:

def observe(obs: Observation): (Double, Option[Double]) = {
  val state = toState(obs)

  // If we have a previous state, store the transition and learn
  val loss = prevState match {
    case Some(ps) =>
      val reward = computeReward(obs)
      dayRewardSum += reward
      val done = obs.budgetRemaining <= 0 || obs.timeRemaining <= 0
      dqn.store(ps, prevAction.get, reward, state, done)
      dqn.trainStep()

    case None => None
  }
  dayObservations += 1

  // Select next action
  val action =
    if (config.inferenceOnly) dqn.selectGreedy(state)
    else dqn.selectAction(state)

  // Apply action: adjust multiplier
  val adjustment = config.dqnConfig.multiplierForAction(action)
  _bidMultiplier = math.max(
    config.minMultiplier,
    math.min(config.maxMultiplier, _bidMultiplier * adjustment)
  )

  // Save state for next observation
  prevObservation = Some(obs)
  prevState = Some(state)
  prevAction = Some(action)

  // Reset window counters
  windowImpressions = 0
  windowClicks = 0
  windowSpend = 0.0
  windowBidOpportunities = 0
  windowWins = 0

  (_bidMultiplier, loss)
}

Let’s trace through what happens on each call, step by step.

Step 1: Convert observation to state. The toState method combines the Observation with window counters to produce an 8-element array of normalized features.

Step 2: Learn from the previous action. If this is not the first observation, we now know the result of the action we chose last time. We compute the reward (clicks minus penalties), build a transition (prevState, prevAction, reward, currentState, done), store it in the replay buffer, and run one training step. The done flag is true if the budget is exhausted or the day is over.

Step 3: Choose the next action. If we are in inference-only mode, pick the action with the highest Q-value. Otherwise, use epsilon-greedy: with probability epsilon pick a random action, otherwise pick the greedy best.

Step 4: Apply the action. Look up the chosen action’s multiplier (e.g., action 4 maps to 1.1x), multiply it into the current bid multiplier, and clamp the result to [0.5, 2.0].

Step 5: Save state for next time. Store the current state and action so that on the next observation we can compute the reward and build a transition.

Step 6: Reset window counters. Clear all the impression, click, spend, and bid opportunity counters so they are fresh for the next 15-minute window.

The method returns the new bid multiplier and the training loss (if training happened). The CampaignEntity uses the bid multiplier for all bid responses until the next observation.

The cumulative multiplier

A subtle but important point: actions do not set the multiplier directly. They scale it. Each action is a relative adjustment to whatever the current multiplier is.

Here is a concrete example of how the multiplier evolves through a day:

Observation	Action chosen	Multiplier before	Calculation	Result
1 (9:00 AM)	0.9x	1.0	1.0 x 0.9	0.9
2 (9:15 AM)	1.2x	0.9	0.9 x 1.2	1.08
3 (9:30 AM)	0.7x	1.08	1.08 x 0.7	0.756
4 (9:45 AM)	1.4x	0.756	0.756 x 1.4	1.058
5 (10:00 AM)	1.4x	1.058	1.058 x 1.4	1.482
6 (10:15 AM)	1.4x	1.482	1.482 x 1.4	2.0 (clamped)

Notice observation 6: the raw result would be 2.074, but it exceeds the maximum and is clamped to 2.0. The hard bounds always apply. This is the safety mechanism in code:

_bidMultiplier = math.max(
  config.minMultiplier,
  math.min(config.maxMultiplier, _bidMultiplier * adjustment)
)

This cumulative design means the agent can make large adjustments over several steps (0.7 x 0.7 = 0.49, clamped to 0.5) while each individual step is a moderate change. It also means the agent has to learn to “undo” previous decisions – if it overbid in the morning, it needs to choose multipliers below 1.0 in the afternoon to bring the overall multiplier back down.

Day stats for monitoring

The agent tracks cumulative daily metrics for monitoring dashboards:

def dayStats: BidOptimizationAgent.DayStats = BidOptimizationAgent.DayStats(
  impressions = dayImpressions,
  clicks = dayClicks,
  spend = daySpend,
  observations = dayObservations,
  totalReward = dayRewardSum
)

The DayStats case class also provides derived metrics:

final case class DayStats(
    impressions: Long,
    clicks: Long,
    spend: Double,
    observations: Int,
    totalReward: Double
) {
  def ctr: Double = if (impressions > 0) clicks.toDouble / impressions else 0.0
  def costPerClick: Double = if (clicks > 0) spend / clicks else 0.0
}

These numbers let operators see how the agent is performing day over day. Is it getting more clicks? Is the cost per click improving? Is the total reward trending upward? We will see in the next chapter how to use these across days to watch the learning curve.

Recap

BidOptimizationAgent is a thin translation layer. It converts the messy real world – impressions, clicks, budgets, time of day – into the clean abstractions that DQN needs: fixed-size state vectors, discrete actions, and scalar rewards. The actual learning happens inside DQNAgent, which we built in earlier chapters.

The key design decisions are:

8-dimensional state that captures everything the agent needs to know about current performance and remaining resources.
7 asymmetric actions that give the agent fine-grained control over bid adjustments, with more aggressive options for bidding up.
Cumulative multiplier with hard bounds, so the agent adjusts incrementally and cannot do anything catastrophic.
Click-based reward with pacing penalties, so the agent learns to maximize performance while spending sustainably.
15-minute observation cycle, slow enough to observe meaningful patterns and fast enough to react to changing conditions.

In the next chapter, we will see how this agent handles the realities of production: day boundaries, persistence across restarts, cold starts, and the fact that many agents are learning simultaneously in the same marketplace.

Keyboard shortcuts

Promovolve: Ad Auction Algorithms & Architecture