Lessons learned operating Redis Clusters at Roblox

By Jan Berktold

At Roblox, we rely heavily on our Redis and Memcached caching layers to support the experiences of our 150 million monthly active users. Early in 2020, we experienced reliability and infrastructure management bottlenecks with our caching deployments and decided to invest in building a fully automated way to operate open-source Redis Clusters via Docker on top of Hashicorp’s Nomad scheduler.

As of today, Roblox engineers have used our self-service APIs to create hundreds of Redis clusters to serve 8 million QPS at peak times while drastically simplifying management of the infrastructure and making us resilient against large-scale hardware failures.

This talk walks through Roblox’s requirements for its Redis deployments, telling the story of building our Redis orchestration services and sharing lessons learned along the way.