Experienced  · 

Optiver’s approach to network devices and Infrastructure as Code

Carlo Maragno

Part of series:

Carlo is a software engineer with a background in offensive cybersecurity and hardware. They joined Optiver’s Platform team in 2022 to develop our bespoke network abstraction layer.

Some weeks ago, my colleague Nicolás Demarchi wrote a post about Infrastructure as Code (IaC) at Optiver. Building on Nico’s post, I’d like to take the opportunity to go into more depth on the implementation of some of the subsystems, with a focus on networks.

In this post, I’ll lay out some of the challenges, framed in the context of IaC, and how we solved them. Many of our solutions are bespoke implementations of popular concepts, an advantageous approach that allows us great flexibility of implementation.

We’ll cover two areas: device data audits, also known as our secret grey field IaC recipe, and device provisioning.

One of the first topics Nico highlighted is the need for a standard. This is undoubtedly true, especially for new data centres or colocations, but one of the goals is implementing IaC in a grey field environment. In other words, the solution should be able to tolerate some degree of manually executed changes, so that operations can continue while these systems are being built.

This is a priority, because as a trading company, we don’t get to pull out of the markets for extended periods of time while infrastructure is being rebuilt to new, more manageable and automatable standards. Solving this problem involves a concept Nico introduced in his post: automated audits.

When we start managing parts of our infrastructure via code, we need to ensure that the state doesn’t diverge over time. One way to do this would be to periodically redeploy all devices, but that can create other problems. For example, network switches may fail during configuration reloads. This failure can have severe implications, especially when it involves trading-critical applications.

Our solution involves periodically collecting data from devices and comparing it to the intended state in our intent system.

Figure 1: High Level

This requires the following components:

  1. An intent system: At Optiver we call this “Infra intent”, and it’s implemented as a Django application. The idea here is that Infra intent is the sole entity that holds our desired state, or as we say at Optiver, the intended configuration. Given its role of oracle, this system is also responsible for ensuring that the data doesn’t diverge over time.
  2. A hardware abstraction layer for the network devices: At Optiver we call this Network Pipeline, which is a Python application that can interact with a lot of devices from different vendors via a bespoke abstraction layer built on top of NAPALM and some custom pipelines.

The network configuration audit flow relies heavily on these two systems. Let’s dig deeper into how the interaction between these systems works.

Figure 2: Low level

Focusing on Network Pipeline, you can see that workflows are composed of a series of pipelines, of which we will describe two below device config audits and configuration deployment. Network Pipeline supports tasks, defined as a series of pipelines, but currently we don’t rely heavily on this feature.

Defining infra intent

As mentioned in Nico’s blog, everything starts with ‘intent’, because it’s the keeper of the “default state” for our devices. The Django models in intent provide a comprehensive view over our entire infrastructure, making it the perfect system to implement complex business logic. This logic is crucial to check if the state of devices has diverged from what it was intended.

Daily, Infra Intent dispatches a series of tasks to Network Pipeline, the so-called “device truth pipeline”. These pipelines represent one of the core hardware abstraction layer functionalities of Network Pipeline. In short, they provide a unified state representation of the configuration of all network devices in our compute fabric.

Our primary goal is to write only one refined business logic component within our Infra Intent. This logic, referred to as ‘audits’, is crucial as it ensures correctness of our network configurations, while maintaining the flexibility to use the best type of device for each function in our network.

Optiver’s tech stack is so optimized that we can’t afford to make decisions like, “We only buy devices from vendor X, so we only develop integration with those devices.”

Enhancing vendor flexibility with open-source solutions

To achieve this level of flexibility, we integrate two powerful open-source projects. The first is NAPALM, a set of Python libraries that wraps many of the vendor-provided SDKs into one coherent library. We rely heavily on NAPALM, as it’s a convenient way to abstract many of the common interactive tasks without relying on legacy technologies like Expect.

While it’s a great start, NAPALM doesn’t entirely meet our requirements, especially in terms of multi-vendor support and handling complex interaction patterns. For example, NAPALM can parse the configuration of the device interfaces, but it struggles with parsing the BGP configuration. This feature is only partially supported for Arista devices, and we need to support other vendors as well. We spent some time thinking about a holistic way to solve this issue and we came up with a Look Ahead Left to Right (LALR) parser, based on Lark.

Lark is a Python library that implements a LALR parser in pure Python. It has some limitations, for example, only supporting one-token lookaheads, but it’s definitely a great tool to help build a coherent overview of your infrastructure. Before moving on, let’s dive into how parsers work, and this will serve as the context to justify the seemingly more complex implementation.

LALR parsers fundamentally split the tasks of interpreting the data at hand in two steps. The first one is deriving an abstract syntax tree, based on a grammar. This abstract syntax tree provides an abstract syntax representation of each element of the configuration. This is great, because with a good grammar we can encode all states of the switch configuration inside the tree, without having to rely on things like complex regexes.

This tree is still fairly tied to the configuration of the device, with many elements still vendor-dependent, but now we can build a transformer to shape the AST data into a standardised data structure that we can then use in Infra intent. Transformers are amazing, and with Lark they are even better. Effectively they are just a series of operations that you perform on the AST tree, starting from the leaf and ascending in the graph. Each base token is converted to the final representation, and more complex tokens are built by combining together already converted children.

This is particularly handy when parsing BGP peers, or route maps. All the configuration complexity is dealt with in the grammar, and the transformer is flexible enough to be able to lift all of these configurations into a standardised data structure.

We have found this approach amazing for two reasons:

  1. It’s very much rooted in great computer-science principles. Parsers and AST are quite well known in the literature, providing a vast resource of knowledge on how to test and maintain them.
  2. It’s extremely easy to extend for new features. This is great, because alternatives based on regexes tend to increase exponentially in complexity, especially if supporting complex configurations.

This set of open-source projects allows us to effectively gain a coherent view of the state of all of our network devices. The encoding is no longer device-specific, but rather completely agnostic. We call this approach “lifting the device configuration”.

Borrowing from many other applications such as LLVM or QEMU, we also have an intermediate language that we use to represent the configuration of a device. This enables us to have the same exact representation for all devices, and unsurprisingly, it comes in extremely handy when provisioning the devices.

So far we have introduced many concepts, a key one being representing the status of the device in an intermediate form. As we have seen, this allows us to lift the configuration to a higher abstraction level via a mixture of NAPALM and Lark, but this wouldn’t be done if we couldn’t do the opposite as well.

Conceptually this part is much easier, since it’s based on a set of commonly used technologies. We need an extra set of assurances, mainly around how we allow engineers to interact with the provisioning pipelines and how we validate deploys.

One of the main sets of controls we have is the idea of a device state. For the provisioning pipelines to run, the device must be in a maintenance state, and this state can be changed only via an explicit command by an engineer. If the device is in maintenance mode, the provisioning pipeline can be run.

To aid the engineer in validating the changes, we have built a Monaco-based diff visualisation framework, that will show the differences between the running config and the candidate config in a familiar VS Code-like environment.

After the engineer has validated the changes, they can proceed with triggering a deployment to the production devices. This will dispatch netpipe pipeline that will fetch the current intended device state from Infra intent and use a series of bespoke Jinja templates to lower it to the appropriate device configuration.

We then use NAPALM to push this configuration to the device, completing the task of deploying a new device.

I hope this post provided some helpful insights into the implementation challenges we’ve encountered in the context of IaC, how we’ve chosen to solve them within our network, and why we’re enthusiastic about those solutions.

If these sound like the kind of challenges you’re looking for in your own career, take a look at our job posts below or reach out to [email protected].


Related Articles

  • Machine Learning at Optiver
    Experienced, Life at Optiver

    Machine learning opportunities in capital markets

    Solving problems at scale The allure of “problems at scale” is significant for researchers aspiring to transition from academia to the private sector. At Optiver, we are constantly scaling up in every dimension – adding more features, models, financial exchanges on which we trade; and expanding our range of products, asset classes and geographic colocations. […]

    Learn more
  • Experienced, Life at Optiver, Technology

    Behind the scenes: Engineering Optiver’s global trading network

    Optiver's global trading network is a marvel of engineering, ensuring rapid and reliable data transmission essential for electronic trading. Network Engineer Ryan Bennett reveals how dedicated fibre optic cables and meticulous route planning maintain Optiver's competitive edge. Despite challenges like geographical hurdles and fibre cuts, the network's resilience and continuous improvement keep Optiver at the forefront of trading innovation.

    Learn more
    Europe, Global
  • Experienced, Meet the team

    A finance role unlike others

    As a leading proprietary trading firm, Optiver works to make markets more efficient, transparent and stable across the globe. While our commitment to provide liquidity is continuous and our aim is to be a stabilising force, financial markets and our operations are dynamic. For the Finance Team, this requires continuous improvements in finance processes to stay aligned with evolving market conditions and business strategies.

    Learn more
  • Competition, Experienced

    Advent of Code 2023: Clean Code Challenge

    In December last year, Optiver proudly entered its third year as a sponsor of Advent of Code. This annual event, structured like an advent calendar, offers tech enthusiasts from around the world the chance to test and showcase their creative programming skills with a new festive-themed puzzle each day. Our sponsorship reinforces our commitment to fostering technical innovation and a culture of continuous learning.

    Learn more
    Europe, Global
  • Experienced, Life at Optiver

    Risk and reward within a dynamic trading firm: Insights from Optiver’s CRO Europe

    In business, risk management is often thought of as a of back-office support function—the department generally responsible for steering a company away from pitfalls and worse-case scenarios with cautionary, arms-length advice. Not at Optiver. In our high-stakes trading firm environment, it’s a core discipline that directly impacts the success of daily trading operations. As Optiver […]

    Learn more
  • Nicolas_Infrastructure_as_code
    Experienced, Life at Optiver, Technology

    Navigating Infrastructure as Code (IaC) in a non-cloud trading environment

    In the high-performance landscape of algorithmic trading, technological infrastructure isn't just important—it's critical. While Infrastructure as Code (IaC) is a well-established practice in cloud-based solutions, its application in non-cloud environments presents unique challenges, especially in latency-sensitive environments like ours at Optiver.

    Learn more