At Nexxen, the stability of our platform is core to our engineering team’s mission, ensuring that our customers have a seamless experience while we continue to innovate at a fast pace. To achieve this, we rely on our ability to make small, incremental changes, push them to our production systems quickly, and immediately see the impact those changes have on the overall health of our platform. In my previous post, we discussed why and how we test in production. In this article, we’ll dive into our observability platform and our culture of ownership.
Observability-Driven Development
In a highly concurrent, low-latency system like Nexxen’s, validating a change requires us to examine the production environment holistically. This is where our observability platform, Atlas, comes into play.
Atlas is an internally white-labeled, self-hosted Grafana LGTM stack maintained by our infrastructure team. It provides us with real-time visibility into the health and performance of our production systems, enabling us to quickly detect and diagnose issues. With Atlas, every engineer has access to a wealth of telemetry data, including metrics, logs, and traces, which they can use to gain insights into how their changes are affecting the system.
At Nexxen, some of the first questions we ask when developing a new feature or making changes to our system are:
These questions are at the heart of our observability-driven development approach. By defining clear metrics upfront and ensuring that we have the necessary telemetry in place to track those metrics, we can quickly assess the impact of our changes once they’re deployed to production. This proactive approach to observability helps us catch potential setbacks early to avoid negative impacts on our customers.
Observability-driven development not only helps us identify and resolve issues more efficiently, but also enables us to continuously optimize our systems. By analyzing the telemetry data collected by Atlas, we can identify performance bottlenecks, resource inefficiencies, and opportunities for improvement. We proactively make optimizations and architectural changes that enhance the overall reliability and scalability of our platform.
A Culture of Ownership
Perhaps most importantly, Nexxen has a culture of ownership where every engineer is given the knowledge, tools, responsibility, and trust they need to own their work end-to-end. We all know how our systems work, and nothing is “thrown over the wall” for another team to run or monitor in production.
To support this mindset, we have invested heavily in production-related tooling and practices. Engineers are encouraged to actively engage with production systems daily, as that is where our users interact with our code and infrastructure. We have built robust guardrails and safety nets that enable us to confidently make changes. By fostering a culture of trust, ownership, and continuous improvement, we are able to deliver exceptional value to our customers while maintaining a stable and reliable platform.
Conclusion
At Nexxen, we pride ourselves on our platform’s stability and our ability to continue to improve our technology while we grow. Through realistic testing in production environments, how we track success metrics and analyze our performance data, and fostering a culture of ownership throughout our engineering teams, Nexxen’s platform delivers both innovation and stability.
Read Next