Lewis C. Lin’s Newsletter

O15: Metrics for AI Coding Agents

How to choose the North Star Metric for a coding agent

Lewis C. Lin's avatar
Lewis C. Lin
May 09, 2026
∙ Paid

Coding agents produce a lot of measurable signals: lines written, tests passed, PRs opened, tokens consumed. Most teams reach for whichever one is easiest to instrument.

The harder question is whether the metric actually tracks the thing developers care about: a task handed off and finished well. A coding agent can score well on activity metrics while quietly failing on that dimension, shipping code that gets rewritten in review or completing tasks that create production problems downstream.

This post works through the full NSM selection process for a coding agent: stakeholders, core value, candidate metrics, and a VITAL framework evaluation, ending with a primary recommendation, the case against the runners-up, and the supporting and counter metrics that keep the primary honest.

Metrics for a Coding Agent

Identify Key Stakeholders

  • Primary: Individual developers using the agent to write, debug, refactor, and review code — seeking speed and output confidence

  • Secondary: Engineering managers and team leads absorbing downstream effects: code review load, production incident rates, knowledge distribution

  • Tertiary: The product organization — using the agent as a competitive differentiator, engineer retention tool, and shipping force multiplier

Articulate Core Value

  • Primary: Completing real software tasks autonomously — not suggesting lines, but closing tickets

  • Supporting: Developers delegate well-scoped work and redirect cognitive energy toward architecture, product judgment, and complex debugging

  • Business: Compressing time from intent to working, tested, merged code

Define Core Actions

The agent’s core loop:

  1. Understand a task from natural language or a linked issue

  2. Write or modify code across multiple files

  3. Run tests and interpret results

  4. Fix failures

  5. Open a pull request ready for human review

Each step is measurable — and a potential failure point.

Brainstorm Potential NSMs

  • Task Completion Rate: % of agent-initiated tasks reaching a merged PR without the developer abandoning mid-task

  • Weekly Active Developers (WAD): Developers delegating at least one meaningful task per week — sustained adoption vs. novelty usage

  • Agent-to-Merge Time: Median time from task delegation to merged PR

User's avatar

Continue reading this post for free, courtesy of Lewis C. Lin.

Or purchase a paid subscription.
© 2026 Lewis C. Lin · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture