Nexxen’s platforms handle a ton of information, with our SSP alone processing billions of incoming advertising requests per day. Each of those ad requests can potentially be sent to dozens of customers who may want to buy this ad slot. These are serviced by hundreds of servers, each processing hundreds of requests per second per core. To operate consistently, especially as Nexxen continues to scale, requires a finely tuned architecture combining efficiency, reliability and a commitment to minimizing overhead. In this post, I’ll break down the key elements that allow Nexxen to maintain an ambitious level of performance from technology choices, caching strategies, data flow, and quality control.
Here’s a look into how we operate at scale:
Platform: Lean and Efficient
Nexxen’s real-time platform is mostly built using Node.js running without frameworks or virtualization, which reduces computational overhead. This keeps the system lean with more direct control over processing resources. Node’s single-threaded and event-driven processes are uniquely suited to our heavily asynchronous workload; a single incoming request may generate quite a few real-time outgoing requests to potential buyers (DSPs). On the backend, a hierarchy of non-real-time servers aggregate reporting data to the business and migrate frequent configuration changes to the SSP.
Caching: Regional and Fast
Key Value Cache
To ensure utmost performance, the system is heavily cached, relying on an in-memory data store that is present in each of Nexxen’s regional data centers. Nexxen’s solution is unique, but the industry is packed with good open-source and commercial options such as Redis, Memcached, Ignite, Aerospike, etc.
The cache layer serves as a robust real-time cache of temporal request data such as attributes associated with various domains. The cache also holds a list of anonymous IDs, ensuring the system can personalize each of the incoming requests, maximizing advertiser ROI and payouts to Nexxen’s customers. Its regional setup ensures low-latency data access, which is crucial for the speed of our platform. Last, the platform leverages the in-memory store to accumulate real-time stats before those stats are again aggregated and sent to the global reporting databases. The store handles quick, in-memory operations, making it perfect for this use case.
Configuration: Optimizing Configuration Distribution
Smaller configuration changes are distributed and synced to regions via simple tools. Larger configuration changes are shared across thousands of node processes using a custom-built shared memory module. This approach allows Nexxen to export gigabytes of configuration efficiently without replicating it across hundreds of processes per server.
Data Output: Logs and Real-time Insights
The challenge of logging billions of daily log lines is achieved by a multi-stage mechanism, where logs of every request are written to a local file system, compressed, and then pushed to both a global NAS and S3 for long-term storage. These logs are regularly consumed by our global reporting databases, ensuring accurate and up-to-date data for analysis.
Aggregating real-time stats is also quite a challenge. The system aggregates real-time stats using the local cache described above, ensuring Nexxen’s enterprise reporting is up-to-date with near real-time information.
Last, Kafka plays a vital role in communication across the platform. Kafka enables ad requests stream data to be communicated to other teams and communicates changes that update the regional caches.
Quality: Full Stack Responsibility
The team quality process is rooted in developer responsibility and thoroughness: every developer is responsible for running the entire stack. This includes real-time, batch, caches, and databases on their development workstation, ensuring they are familiar with the system end-to-end. Developers are also responsible for change implementation across the complete stack: database through batch servers to real-time servers and back to reporting, coding, and deployment. That means developers switch between SQL to PHP to Node and back frequently during development, which can improve development velocity. Every completed change request requires two reviewer approvals before being merged into the platform, catching any potential issues early on. Even then, issues will be caught by the platform’s canary test servers. The test servers are where changes can be made before automated deployment to the rest of the real-time servers.
Conclusion
Scaling our platform successfully comes down to strategic architecture and technology choices: a multi-layered caching system, efficient data flows, and a commitment to code quality and uptime. By building with minimal overhead, leveraging fast in-memory caches, and ensuring every part of our infrastructure is robust, Nexxen maintains a system that can handle high throughput while remaining highly available.
About the Author
I’ve been deeply involved in developing the Nexxen SSP for eight years. Every step of the way, whether it’s optimizing performance or ensuring seamless integration as Nexxen expands, has been like tackling a new puzzle. What keeps me engaged is the thrill of solving complex issues that come with scaling. The bigger the challenge, the more interesting it gets – especially when the solution turns out to be something that not only fixes the issue, but also sets the groundwork for future growth. It’s this dynamic nature of scaling that makes the work so rewarding!
Read Next