Back
Technology  · 

Building visibility into a CI platform

About the Author

Cian Lane is Global Head of Developer Experience at Optiver, where he leads the systems that support the software development lifecycle, including CI infrastructure and build systems used across global engineering teams.

When people talk about developer productivity, they often jump straight to tools: powerful coding agents, faster compilers, smarter automation. These things matter, but they are not the whole story. At Optiver, developer productivity is about how effectively the entire end-to-end pipeline works, transforming an idea to code then moving it through build, test, validation, and eventually into something running safely in production. If any part of that pipeline becomes a bottleneck, iteration slows down, and we lose our competitive edge.

I’ve been at Optiver for five years and currently lead the Developer Experience team. Our role is to keep that pipeline healthy for engineers across the organisation, from those building low-latency systems and custom hardware to those working on broader infrastructure or cutting-edge research. A big part of that job is keeping our build platform healthy.

This post is about how we run our build platform at Optiver, why some common assumptions about Continuous Integration (CI) don’t quite hold for us, and how we used data to operate a system that cannot simply scale on demand.

More code, broader contributors, same constraints

Over the last few years, the amount of code flowing through our systems has increased significantly, and not just because software engineers are writing more. Researchers and traders have always written code at Optiver, but their capability and scope continue to grow. They build visualisations, run experiments, and automate analysis as part of their day-to-day work, and that code goes through the same build and test pipeline as everything else.

AI has amplified this effect. AI-assisted development has lowered the cost of producing code across the organisation, which means more frequent changes from a broader set of contributors. That’s a net positive, but it does change the shape of the problem. When more people can push changes, keeping a tight feedback loop becomes harder, especially when the underlying infrastructure can’t elastically scale.

CI under real constraints

CI plays a central role in that feedback loop. It verifies changes, catches regressions early, and lets engineers move quickly with confidence. When it works well, you barely notice it. When it doesn’t, everything slows down.

For us, CI is particularly sensitive because of the systems we build. Many of our workloads need to run directly on specific hardware. That can mean testing against FPGA cards, running code that interacts closely with the Linux kernel, benchmarking across CPU architectures, or running system tests with strict latency requirements.

Some parts of our CI can be containerised and run in the cloud, but a significant portion cannot. Those jobs need to run on bare metal, on hardware we own and configure ourselves. Once you have that constraint, the usual assumptions about scaling start to fall away.

When auto scaling isn’t an option

In many environments, increasing CI demand is handled with elastic compute. More pull requests mean spinning up more agents, queues stay short, and the system absorbs bursts without much thought.

That approach doesn’t work for us. If we need another agent that interacts with an FPGA card, someone has to physically install that card. In some cases, it means buying new servers. None of that happens quickly, and none of it happens automatically.

Comparison of a typical CI architecture and Optiver’s fixed-hardware CI platform.

At that point, the question becomes less about how fast we can scale and more about whether we have the right capacity to begin with. That shifts the problem away from orchestration and towards understanding demand.

To operate the system responsibly, we need to be able to answer questions like:

  • Is our capacity meeting demand?
  • How long are jobs actually waiting, especially under load?
  • What sort of bursts do we see through the day?
  • If the queue is growing, what’s actually blocking it?

Those are fundamentally data problems for us.

The GitHub Actions visibility gap

We moved to GitHub Actions because it’s the native CI platform for GitHub, where we store our source code. The developer experience is strong, reusable pipelines are genuinely useful, and adoption friction is low. The trade-off is visibility.

Most CI platforms expose queue times, utilisation, and system-level metrics out of the box. GitHub Actions doesn’t, at least not at the level you need when you own the agent infrastructure. You can inspect individual repositories, but you can’t easily see what’s happening across the system as a whole.

For teams that can auto scale, that’s often acceptable. For us, it isn’t. So we were left with a choice. Either we give up that visibility, or we build it ourselves.

Reconstructing the system from events

GitHub webhooks gave us the raw material we needed. Every time a job is queued, started, or finished, we receive an event, and we ingest and store those events as they arrive. At a basic level, each job becomes a simple sequence of states: queued, started, finished.

On its own, that data isn’t very useful. The value comes from reconstructing state over time. By consuming the event stream and updating the state of each job as new events arrive, we can see what is queued, what is running, and what has finished at any given moment.

There is some latency between GitHub emitting an event and it flowing through our pipeline, but it’s measurable and small enough to work with. For operational purposes, the view is effectively live. That helped, but when queues started to grow, knowing what was waiting still wasn’t enough.

Understanding what’s blocking the queue

To understand why work is waiting, we needed to look at the runners themselves. In GitHub Actions, each job runs on exactly one runner, and that runner is included in the webhook data. That lets us map running jobs directly to the hardware executing them.

With that information, we can see which runners are busy, which are idle, and how long each job has been running. Because our runner pool is relatively static, bottlenecks show up very quickly. A long-running job tied to a specific class of hardware is no longer hidden somewhere in the system.

At that point, the CI platform stops being a black box. We can see what’s queued, what’s running, and what’s blocking progress right now. That solved the immediate operational problem. The next thing we needed to understand was how the system behaved over time.

From real-time visibility to system behaviour

Every event includes a timestamp, which makes it straightforward to calculate queue time and run time once we attach those timestamps to job state. Those timestamps also allow us to observe how the system behaves over time.

We process completed jobs into time-series metrics, calculating averages, percentiles, and extremes over short intervals. That data feeds dashboards and alerts that show how the system behaves under normal load and during bursts.

Clear patterns start to emerge. Queue times rise during the day as people open pull requests, often peaking around lunchtime. Overnight behaviour looks different, less user activity but some load driven by scheduled workloads.

In one case, we received an alert because the 90th percentile queue time exceeded the threshold we were comfortable with. That alert wasn’t the diagnosis, but it was the trigger to investigate whether we were seeing an anomaly or a genuine capacity issue.

Capacity, utilisation, and confidence

To answer that question, we look at utilisation. At any moment, we know whether each runner is busy or idle, and sampling that over time gives us a utilisation metrics for the entire pool, specific hardware classes and agent types.

This kind of data lets us reason more clearly about capacity. We can tell whether we’re under-provisioned, over-provisioned, or simply experiencing expected bursts. These are decisions that cost time and money, and having data changes how confidently we can make them.

Register to watch

Turning CI data into an experimentation platform

Once we had reliable operational data, we started to think about how we can attach additional context to jobs themselves. During a typical C++ build, we collect metadata about the environment, including Linux version, compiler version, and build tooling. When the job finishes, that metadata can be published as an artifact of the build.

Ingesting those artifacts and turning them into structured data, lets us analyse build performance alongside environment details. That opens the door to experiments that were previously based on intuition.

When rolling out a compiler upgrade across our shared toolchain, we can measure its impact gradually and see whether builds got faster or slower, or whether regressions affected specific workloads. The same applies to OS upgrades. Because we operate close to the kernel, those changes matter, and having concrete data beats relying on anecdotal reports.

We wouldn’t make trading decisions based on CI metrics alone, but for a system this sensitive, visibility is extremely valuable. It gives us something close to A/B testing for toolchain and OS changes and helps us pinpoint regressions as they appear.

Treating CI like a production system

What started as a lack of visibility in GitHub Actions pushed us to build something more robust than we originally planned. Over time, it changed how we operated the platform.

We moved from reacting to problems to understanding behaviour, from guessing about capacity to measuring it, and from treating CI as a black box to running it much more like a production system.

For us, developer productivity isn’t just about adopting the right tool. It’s about being able to see the impact of each change we make and using that insight to keep refining how we deliver software into production.

Interested in operating developer platforms at scale?

See our openings

Technology
Insights

Related Articles

  • Technology

    UI as a Systems Problem

    When something feels slow, you’re probably looking in the wrong place. UI issues are rarely local and are often caused by how data is fetched, recomputed, and duplicated across the system, becoming visible only when multiple services, queries, and interactions run at the same time. Read how Optiver approaches building systems where UI and data are tightly coupled.

    Learn more
    Global
  • Technology

    Infrastructure-as-Code for Latency-Critical Bare Metal Systems

    About the Speaker Leanne Fok is a Senior Infrastructure Engineer in Amsterdam with 15+ years at Optiver. She previously led the Infrastructure Platform team responsible for our internal Infrastructure-as-Code stack and is now leading the Infrastructure Kubernetes project. Infrastructure-as-Code is straightforward in the cloud. But what about on bare metal? In her PyLadies talk, “Infrastructure-as-Code […]

    Learn more
    Global
  • Technology

    When Speed and Scale Collide

    Data systems are often described along two axes: speed and scale. In practice, “speed” usually means some combination of latency and throughput, and systems are often optimized for one at the expense of the other, sometimes by trading efficiency for raw capacity. Those distinctions tend to break down quickly once systems move beyond simple use […]

    Learn more
    Global
  • Experienced, Technology

    Choosing between free threading and async in Python

    As host of this year’s first PyAmsterdam Meetup, Optiver welcomed the local Python community back to our headquarters for an evening of tech talks, networking and knowledge sharing. To open the event, Optiver Senior Software Engineer and Team Lead Samet Yaslan delivered a timely talk for developers working on performance-critical systems: “Choosing Between Free Threading and Async.” Prompting his choice of topic was a recent change in CPython, as Samet explained: “Beginning with version 3.13, CPython introduces an option for a build known as free threading, where the Global Interpreter Lock (GIL) is removed. The question is: With the GIL gone, do we still need async in Python?” Here’s how he breaks it down, and what it means for your next Python project.

    Learn more