About the author
Nicolás is a seasoned software engineer with over 15 years of experience. He joined Optiver in 2019 with the aim of developing an infrastructure management platform and currently leads the Linux, Networks, and Platform infrastructure teams.
This blog post is an overview of a talk that I gave at EuroPython 2023. If you prefer, you can watch the full presentation via the conference site here.
Table of Contents
In the high-performance landscape of algorithmic trading, technological infrastructure isn’t just important—it’s critical. While Infrastructure as Code (IaC) is a well-established practice in cloud-based solutions, its application in non-cloud environments presents unique challenges, especially in latency-sensitive environments like ours at Optiver.
In this post, I’ll go into these specific challenges and the solutions we’ve developed at Optiver.
The evolution of trading infrastructure
Optiver’s first-ever trade took place in 1986. In those days, we had a single trader on the floor of the Amsterdam Options Exchange, where brokers vocally communicated orders in a traditional trading floor setting. Fast forward to the present, trading has evolved into computer signals on wires and instead of the trading floor, data centres are the home of the trading exchanges.
Today, any entity that wants to join an exchange needs a way to connect to the exchange system that is hosted in a data centre. It’s possible to use brokers to connect, but of course, big market making companies like Optiver must build their own infrastructure to remain competitive.
It would also be possible—theoretically—to use cloud providers and a tool like Terraform to provision infrastructure. For example, you could have a simple Point of Presence (PoP) in the data centre where the exchange is hosted and connect this PoP to infrastructure running on the closest cloud-provider region. But in today’s trading landscape, success is constrained by the frequency of processors and latency of networks. As transactions occur in fractions of a second, even minor delays can translate to missed opportunities and reduced profitability. Competitive participants must operate at these high speeds, or risk falling behind in a market where microseconds, even nanoseconds, can mean the difference between profit and loss.
If you want to be at the top of the industry, building your own custom infrastructure has a lot of competitive advantages.
Modern exchange infrastructure
The exchange system is hosted in a data centre, and every member is connected to it. Using UDP Multicast, the exchange gives you a guarantee that all members will receive information at the same time. For example, a stock option traded 10 lots at €5.00; then it’s up to your systems and applications to process the data, decide how to react and potentially send a new order back to the exchange.
This scenario means that we have to build all our networking and compute, including external connectivity, and connectivity to our offices where the traders are.
For each of your colocations, you are required to take care of the physical space, power, racks, cables, switches and servers, connectivity, firmware, OS, configurations, etc. That’s already not a simple problem to solve, but now multiply that by dozens of exchanges around the globe, and you can see the scale of the challenges we face at Optiver.
Infrastructure as Code
In recent decades, Infrastructure As Code (IaC) is the approach that has taken infrastructure to the next level; no more spreadsheets, manually maintained diagrams or tribal knowledge (like using pet names for servers). As engineers we don’t want to depend on our memories, some mnemonic rule, or even a spreadsheet to manage our environment. We want a system to take care of that, to enable us to easily manage scalability.
IaC also gives you the power to enforce a standard, set up information assurance, build orchestration and have a good interface for applications to grow from. The true test of your infrastructure is the question: “Can you rebuild it from scratch?” If the answer is a fearless yes, you’ve passed the test.
Sadly, there is no “de facto” OSS solution to build on-prem infra. The “K8s of on-prem” has yet to be built. We suggest exploring OpenStack, MaaS and RackN before embarking on a custom-made solution.
Netbox is another interesting choice. After considering these options, we concluded that it was probably the closest match to solve part of our problem, but it didn’t have the flexibility and functionality we needed. Of course, we still use many open-source technologies to build our stack. NAPALM is a great community-driven library to interact with Switches, and our main system is built on top of Django, FastAPI and Celery.
Our Approach: IaC without the cloud
Implementing a standard
Now back the original question: “How can we use IaC without the cloud?”
Step 0 is to have a standard. It’s unrealistic to build any platform to manage infrastructure without having a standard to implement. Any code we write should simply be an implementation of the standard, but our system should support more than one standard simultaneously, as retrofitting existing remote data centres is always going to lag behind your ideal standard to some degree. The standard should dictate everything from the colour of the cables to the networking architecture to the OS configuration.
Our intent systems
In our implementation, our “Infra-Intent” systems represent our infrastructure and we can model the different realities in a relational database. For example a data centre has racks and in these racks we have switches and servers, which are all interconnected with cables. You can compare this system with a cloud provider web-console where you can see a VPC that has virtual machines, each of which has interfaces that belong to a subnet.
Using the web API we can define our infrastructure, and once that’s there, it can be consumed by our provisioning pipelines that are simply reading that source and making it reality. A provisioning pipeline can be a process generating a switch configuration, or a pipeline that knows how to install an OS image in a bare-metal server and configure it. Decoupling these pipelines from our intent system allows us to easily change our provisioning tooling as we see fit.
Defining the infrastructure in the Infra-Intent system is a key step for our automation but also to assure our standard. Provisioning all the resources will require hundreds of API calls that a single human would have to carry out in the required order. To do that we have code that can take a high-level definition and do all the required API calls. Just imagine here Terraform doing API calls to a cloud provider when you ask it to create a VM. This abstracts the low level complexity of the devices configuration and allow the engineers to effectively manage our infrastructure at scale.
Truth collectors and audits
Finally, our truth collectors take a snapshot of how our infrastructure looks in reality, and audits compare this reality to our intention. If something is different, we generate an action for a human to check what happened, so having tests for our infra in the same way we have them for code.
Real-life use case
When we create a new colocation, we first write our high-level definition files. Then we run our piece of code that will create all the resources in our intent system; this includes everything from the cables to the firmware version of a server. Once we have that, we can export patching instructions for our Data Centre Engineering team, who will do the physical work and connect everything as expected. Once that’s done, they can run a pipeline to verify that things are connected according to intent (so the same process as for our audits), and when that’s green they can just run the pipelines to provision the devices.
Your future in Infrastructure Software Engineering
I could go on for pages about this topic, but I hope this glimpse is enough to give an idea of our challenges and what are our Infrastructure Software Engineers are building.
If you’re excited about exploring the world of infrastructure as code and contributing to an environment that thrives on creativity and innovation, visit our job post below. We would love to hear from you.