Research at Scale: Orchestrating ML for Systematic Trading
This article is adapted from a technical talk delivered by Rob Hahn, our Global Head of Optiver Research Platform, at Optiver’s Chicago office.
Systematic trading rests on the strength of its research engine. Experiments run today to become tomorrow’s models, signals, risk checks, and position limits. End-to-end research turnaround time should be a core operating target for any desk aiming to compete. The research cycle is simple to state: form a hypothesis, run experiments, evaluate results, then either refine the hypothesis or translate the results into production decisions. The hard part is running this cycle repeatably across thousands of instruments and decades of market data. Here’s how we approach this problem and the engineering choices that made a difference.
What a Research Platform Must Optimize
We keep one goal in front of everything else: shorten the loop from idea to validated result. Cost, capacity, capability, and correctness all serve that goal.
- Cost: HPC is expensive, so efficiency matters. If you can run twice the compute per dollar, that’s a real competitive edge.
- Capacity: Have enough headroom to avoid blocking the business while keeping usage high. Wasting budget is not an option, and neither is throttling research.
- Capability: A good research platform anticipates the direction of the business and what it needs to allow growth and success not just today but 12 to 24 months from now.
- Correctness: Reproducibility and traceability are nonnegotiable. If results can’t be recreated, you’re burning cycles, not doing research.
When Pipelines Scale Up
On paper, a research pipeline looks linear: generate features, train models, run simulations. Reality bends it out of shape. If feature generation for a single market day takes one hour, about 1,000 trading days will take roughly six weeks of end-to-end runtime. Shard feature generation by day and the runtime drops back down to about an hour, although total computational time stays the same. Now, attempt to generate features for 100 instruments, and you will soon be looking at 100,000+ feature generation jobs. Serialized, the original six weeks turned into roughly 11 years of runtime by changing just one requirement. Parallelism stops being a choice.
Researchers tend to think about pipelines as mostly linear flows. Once parallelism becomes mandatory, that mental model breaks down. It becomes hard to reason how the underlying computational graph should be structured for maximum throughput and utilization. Without careful shaping, a pipeline that looks simple on a whiteboard can create sharp hotspots in its resource usage. Layer on other dimensions such as team headcount, model and simulation complexity, and you are quickly dealing with a combinatorial explosion that is impossible to optimize by hand.

That combinatorial explosion doesn’t just make graphs hard to reason about; it also slams into hard capacity limits quickly. Before we simplify how researchers express pipelines, we need to make sure there is enough compute to run them.
Addressing Compute Capacity
In practice, there are three ways to scale compute capacity: go all-in on on-prem, run hybrid on-prem and cloud, or move everything to the cloud.
- Fully on-prem: An obvious option is to buy more hardware and let the research grow into it. This works at modest scale, but cracks appear quickly: new machines can sit underutilized if demand forecasts are wrong, overcommitting can leave you stuck on older hardware generations, supply chain delays become a planning risk, and at large enough scale anything in the cluster can break at any point in time.
- On-prem with cloud spillover: A common alternative is to keep a strong on-prem baseline and spill into the cloud during peaks. That hedges uncertainties around future demand and build-out, but introduces issues of data gravity (moving or replicating hundreds of PBs is slow and expensive), data integrity (lagging or corrupted replicas producing different results), and environment drift (subtle differences in limits and services between on-prem and cloud). Naive hybrids tend to leak into user code as “if cloud, do X; if on-prem, do Y,” which is exactly the complexity we want to remove from researchers.
- Fully cloud: Declaring the cloud home for all workloads avoids hybrid complexity but comes with its own trade-offs: less control over hardware and topology than specialized on-prem builds, ongoing compute and storage costs that accumulate quickly, potential for vendor lock-in and billing surprises, and, at sufficient scale, the same multi-region data-gravity and integrity problems in a different form.
Data gravity and integrity will eventually always become a problem regardless of whether you choose an on-prem or cloud-first solution. We assume multiple on-prem data centers and multiple cloud regions from the start, and we treat those constraints as a first-class rather than edge case.
We follow two principles to reduce the data gravity problem: minimize ad-hoc spilling (“just burst this job”) and decide placement up front. For each team, workload and research stage we decide which cluster it runs on and, where possible, move the workload to the data rather than the data to the workload.
To make that usable, the platform hides the complexity behind two abstractions.
- Compute is abstracted so clusters, whether on-prem or cloud, sit behind a common interface, and switching placement is a policy or config change, not a big application layer code change.
- Data is exposed through a shared catalog that spans on-prem and cloud and abstracts over storage backends such as S3, NFS, Postgres, and Delta Table.
The result is high utilization of on-prem capacity, “unlimited” headroom through cloud when genuinely needed, and far fewer surprises from data gravity while keeping researchers focused on what to run, not where it runs.
From Scheduler-Owned DAGs to In-Process Orchestration
In the traditional model, a research pipeline builds a computational graph and hands the entire graph to a scheduler such as Slurm. The scheduler owns that graph: it tracks dependencies, decides when nodes are ready, and runs them on the scheduler’s underlying cluster. That works as long as the whole pipeline stays in one environment. It works much less well if you want one stage of the pipeline on an on-prem cluster and another stage on cloud GPUs.
You can approximate that by manually splitting the graph into smaller subgraphs and submitting each one to a different scheduler. In practice, that means asking researchers to think in scheduler boundaries, split pipelines into stages, and predict hotspots up front. Given how much we care about research turnaround time, we do not think it is realistic to expect researchers to spend their day hand-carving DAGs to steer placement.
Instead, we pulled the part that decides “what is ready to run next?” out of the scheduler and into a Python library that lives next to the user’s code. We call that component the GraphRunner.
In this model, the GraphRunner owns the computational graph inside the same Python process that built it. It tracks which nodes are ready, and when a node becomes runnable it decides where to send it: an on-prem batch scheduler, a cloud scheduler, or a local thread or process pool. The schedulers no longer see a monolithic DAG; they see individual tasks. Placement decisions move from “which cluster should run this whole pipeline?” to “which cluster should run this node?”
That shift lets us route different nodes, branches and stages of a single graph to different clusters, without exposing any of that complexity to the researcher. They write one pipeline; the platform decides, node by node, where it runs.
Closing thoughts
Research at this scale is a moving target. New models, larger datasets and changing market conditions keep pushing the boundaries, and the platform evolves with them. What matters is keeping the loop fast, predictable and correct so researchers can keep pushing forward.
