Innovating at Scale – Practices from Within Nexxen Engineering (Part 1)

By: Chris Trader, Senior Software Engineer

At Nexxen, the stability of our platform is core to our engineering team’s mission, ensuring that our customers have a seamless experience while we continue to innovate at a fast pace. To achieve this, we rely on our ability to make incremental changes, push them to our production systems quickly, and immediately see the impact those changes have on the overall health of our platform. I will highlight a few practices the Nexxen engineering team uses to innovate quickly at scale, while minimizing change risk, and keeping our production systems stable. Specifically in this article, we’ll discuss why and how we test in production rather than stage environments.

Testing in Production

When done with the right safeguards and observability in place, testing in production enables engineers to gain immediate confidence in their changes and ship quicker than otherwise possible. It is also, arguably, safer than traditional “staging environment” testing, which doesn’t capture the full complexity of real-world conditions, making it less reliable as a predictor of production readiness.

While staging environments serve a purpose, they are often unable to fully mimic the complexity of live systems, especially at scale. No amount of preparation and testing in a development or staging environment is the same as running your code on a production machine. The hardware is not the same, the network is not the same, the data is not the same, nor are the patterns and behaviors of interactions between different system components.

At Nexxen, we shorten the feedback loop and test directly in production through canary deployments, leveraging the power of Kubernetes to make this process seamless. Canary deployments involve rolling out changes to one or two production servers, limiting exposure to a small percentage of traffic, and closely monitoring the performance and behavior of the canaries before releasing the changes more broadly. Operating at scale both enables and requires us to do this.

For example, Nexxen’s DSP serves millions of requests per second across four datacenters, with an average latency under 80 milliseconds. This, of course, requires a substantial amount of hardware. We’re able to target a subset to test new changes, but also requires a substantial amount of precision – just a small increase in garbage collection time or ten milliseconds of additional latency could be detrimental to overall system performance. Testing anywhere but production doesn’t inspire confidence.

We still follow thorough SDLC (Software Development Lifecycle) procedures, such as passing unit and integration tests, undergoing code reviews, utilizing feature flags to manage new functionality, and proper approvals before any change is applied to production. However, Nexxen has invested heavily in modernizing our CI/CD pipelines, ensuring that we can rapidly and safely deploy, as well as roll back changes across every part of our production system. This modernization enables us to deliver features faster without compromising the stability of our platform.

Testing in production is the ultimate quality control checkpoint, ensuring that our changes work as intended in the real-world environment where they will ultimately run.

In the next article, we’ll explain further by exploring our observability platform and our culture of ownership.

Read Next