From Acquisition Fragmentation to Atlas: Building Unified Observability at Scale

By: Yeshwanth Javvaji, Staff DevOps Engineer

Part 1: Our transformation journey from fragmented monitoring to a unified platform supporting 6,000 servers globally

The Challenge: When Growth Creates Complexity

At Nexxen, our platform is built both through internal development and strategic M&A. While each acquisition brings new talent and technology, it also brings new monitoring tools – InfluxDB, Graphite, scattered Prometheus instances. Prior to our recent phase of rapid expansion, we were primarily a Datadog shop, but as our data center footprint expanded globally, our monitoring costs were growing faster than our infrastructure.

Every new server meant new metrics. Every new application meant new dashboards. Every new team meant new monitoring requirements. Our observability costs were growing rapidly.  We faced a choice: accept ever-increasing monitoring expenses or find a better way.

We knew we needed a change. Enter Project Atlas – our ambitious initiative to bring metrics and logs under one unified roof.

The Vision: Best of Both Worlds

But here’s where our story takes an interesting turn. Instead of the typical rip-and-replace approach that many companies take, we chose a different path. We looked at our existing tools with fresh eyes and asked: What if we kept the best parts and built something new for everything else?

Datadog wasn’t the problem – it was excellent at network monitoring, APM, and those out-of-the-box integrations that just work. The problem was using it for everything. So, we made a strategic decision: keep Datadog for what it does best and build Project Atlas to handle the massive scale of metrics and logs that were driving our costs through the roof.

This wasn’t just about consolidation – it was about reimagining what observability could look like. We envisioned a platform that could:

  • Handle massive scale (spoiler: we’re now monitoring 35 million active series)
  • Provide disaster resilience during data center outages
  • Scale cost-effectively as we grow
  • Serve as an internal SaaS for our global teams
  • Complement our existing Datadog investment


Our centerpiece for metrics? Grafana Mimir. It gave us the horizontally scalable, highly available backend needed to handle our ambitious requirements. For logs, we chose Grafana Loki—which we’ll cover in detail in Part 2 of this series.

The Numbers That Tell Our Story

Today, our Atlas platform monitors 6,000+ servers globally, tracking 35+ million active metric series through Grafana Mimir, while ingesting ~340 TB of logs monthly (~470 GB/hr average) through Grafana Loki.

The Technical Journey: Building for Scale

Architecture Decisions That Mattered

We chose to host Mimir and Loki in AWS while keeping our applications in on-premises data centers. This wasn’t just about cloud-first thinking – it was about resilience and scalability.

Why Cloud for Atlas

We deployed Atlas in AWS to deliver a resilient, cloud‑like observability experience that scales with our growth. Hosting Mimir and Loki in the cloud ensures visibility during on‑prem data center outages, enables dynamic scaling for high‑cardinality workloads, and simplifies global access for teams. To keep performance and costs balanced at scale, we knew we needed to adopt smarter caching strategies that could increase capacity while reducing infrastructure spend. Together, these decisions provide Atlas with a scalable, cost‑effective foundation for both metrics and logs as data volumes continue to grow.

The Breakthrough: Learning from Giants

Sometimes the best solutions come from unexpected places. While researching caching strategies, we stumbled upon a fascinating blog post from Grafana Labs about how they scaled their cloud logs infrastructure to handle 50TB. They had faced a similar challenge: massive data volumes, cost pressures, and the need for reliable caching at scale.

Their solution was elegant: instead of throwing expensive RAM at the problem, they used NVMe-backed storage with Memcached. The trade-off was simple but powerful: accept a few extra milliseconds of latency in exchange for dramatically larger cache capacity at a fraction of the cost.

We had our aha moment. This wasn’t just about logs – we could apply the same principles to both our Mimir metrics and Loki logs caching. Instead of building complex data center infrastructure or paying premium prices for massive RAM-only caches, we could deploy Memcached clusters on NVMe-backed EC2 instances.

The Results of this Approach:

  • Massive capacity and higher hit rates: We achieved dramatically larger cache capacity and increased our cache hit rates at a fraction of the cost of RAM-only solutions.
  • Significant cost savings: We eliminated expensive direct connect costs and removed our reliance on datacenter caching infrastructure.
  • Simplified, reliable architecture: By reducing moving parts and network dependencies, we significantly improved overall system reliability.
  • Unified, scalable foundation: We established a single caching approach for both metrics (Mimir) and logs (Loki) that can easily grow alongside our data volume.

The Reality: Scale Requires Work

Those impressive numbers didn’t happen overnight. Mimir’s distributed architecture is powerful but complex – what works at 1 million series often does not translate cleanly to 35 million. Through methodical testing and iteration, we found the balance of resources, timeouts, and limits that could handle our scale cost-effectively.

Real-World Impact

The results speak for themselves: engineers troubleshoot faster with unified data, operations get consistent practices globally, and the business has predictable costs with room to scale. This represents significant progress for a team that started with five disparate monitoring tools and escalating costs.

Lessons Learned: What We’d Tell Our Past Selves

  • Embrace hybrid strategies. Combining build‑and‑buy approaches allowed us to scale observability without sacrificing capabilities where commercial tools excel.
  • Design for scale early. Understanding metric cardinality and growth patterns upfront was critical to building a platform that could grow sustainably.
  • Optimize for efficiency, not perfection. Strategic trade‑offs such as smarter caching enabled us to control costs while maintaining reliability and performance.

The Transformation: Progress, Not Perfection

Today, when that 3 AM alert goes off, our story is different. We’re not claiming observability is solved—rather, Atlas represents meaningful progress. Instead of context-switching between five different platforms, our engineers now work with just two unified tools – Atlas and Datadog. It’s measurable progress, and it’s making a real difference.

Project Atlas has fundamentally changed how we think about observability at Nexxen, but more importantly, it’s reinforced a crucial lesson: every approach has tradeoffs, and there’s no universal “right answer.”

We’re pragmatic, not dogmatic. Atlas handles our massive-scale metrics and logs because we can do it better and cheaper at our volume. Datadog excels at network monitoring and APM, where their specialized expertise outweighs the cost. Together they give us complete visibility without compromise.

The journey from fragmentation to strategic observability has taught us something valuable: the best solutions come from honestly weighing the pros and cons at your specific scale and constraints. Sometimes that means building, sometimes buying, and often, like in our case, both.

Connect With Us

Learn how you can effectively and meaningfully leverage today’s video and CTV opportunities with our end-to-end platform, data and insights.