ChakrDB: A Distributed RocksDB Born in the Cloud, Part 1
History, Context, and Motivation
Through its web-scale, scale-out design, the Nutanix cloud operating system offers comprehensive virtualized storage and compute with high availability (HA), resilience, and fault tolerance, giving you a public cloud experience in your on-prem datacenter. Two key technologies underlying our software are Nutanix distributed storage and a Kubernetes-based microservices platform (MSP) for compute management. You’ll find comparable technologies in any public cloud service — for example, the Amazon Web Services (AWS) Elastic Block Store (EBS) or managed disks in Azure for storage and the AWS Elastic Kubernetes Service (EKS), the Azure Kubernetes Service (AKS), or the Google Kubernetes Engine (GKE) for container orchestration.
Today’s distributed databases and data services can be much more lightweight and advanced if they take advantage of these cloud infrastructure constructs. Modern data applications also need to follow certain principles so they can run in any public or private cloud. Making applications cloud-ready usually involves open-source API standards, standardized iSCSI storage endpoints, and an open container orchestration framework like Kubernetes (K8s) for infrastructure operation management.
When the Nutanix Objects team began our work with ChakrDB, Nutanix Engineering was building storage services on our cloud operating system. Nutanix Objects is S3-compliant hybrid cloud storage for unstructured data, built to take advantage of cloud constructs. We selected RocksDB as the foundation for our distributed key-value store (KVS) in the cloud, then built a distribution layer that we named ChakrDB on multiple RocksDB instances to create a standalone multinode cluster system.
This blog has two parts:
- Part 1: Covers the reasoning and theory behind building a distributed KVS in the cloud. We describe how to avoid design complexities while achieving resilience, HA, consistency, seamless dev-ops workflows, and so on.
- Part 2: Provides in-depth coverage of the ChakrDB architecture, benchmarking results, and the projected roadmap for the product.
Objects Metadata Layer Requirement
Objects was the Nutanix product that made a cloud-native scalable KVS necessary. For metadata, Objects needed a KVS that would satisfy several enterprise-grade KVS and cloud-native design requirements, including the following.
- Scale and performance. We built Objects with a very large scale in mind. Because some of the initial use cases for Objects involved secondary storage or backup data, we needed to use deep-storage nodes (120–300 TB capacity per node) and minimal CPU and memory resources to optimize for cost. With 120–300 TB of data, a single node at full capacity would also have multiple terabytes of metadata. But optimizing Objects for capacity and cost didn’t mean that we would compromise on performance. We also wanted to make Objects really fast, so it can run high-performance workloads like artificial intelligence (AI) and machine learning (ML), data lakes, big data analytics, and so on. As virtually all cloud applications use a blob store for persistent storage, we knew that the metadata layer needed to have low latency and high throughput, even at massive scale and with enormous working set sizes. A small-scale minimum viable product (MVP) wouldn’t work at all.
- Scale-out distributed system. Objects started with a stateless microservices model that stores all objects in the underlying highly available Nutanix distributed file system. The metadata layer handles all the state and sharding constructs for Objects. To scale Objects out, we needed a scale-out metadata layer as well, which meant we had to build a distributed KVS or a stateful distribution layer.
- Strict consistency and strong resilience. For Objects to be an enterprise-ready storage system, the metadata layer also needed read-after-read and read-after-write consistency guarantees. At the same time, the system had to be highly resilient in case of any form of hardware domain failure or data loss.
- Transactional capabilities. Objects needed at least a shard-level transactional capability, as there are multiple atomic metadata operations.
- Cloud native. Because we built Objects as a cloud-native service on our K8s-based MSP, our distributed KVS also needed to use K8s constructs for its compute and memory workflows.
- Seamless automated cloud workflows. For Objects to be infrastructure software that can run on any platform, the workflows admins use to scale out instances or scale up resources had to be consumer-grade — a core Nutanix design principle.
- Any cloud or remote storage. We wanted the KVS to be able to run on any cloud, multiple clouds, or any remote storage. This requirement isn’t specifically related to Objects but is part of future-proofing the KVS design for use across all Nutanix cloud products. While designing, It’s always good to consider where else you can use the system in the future.
To fulfill these design requirements, we had to take advantage of cloud storage and compute infrastructure to ensure that the resulting application can run seamlessly on any public cloud, on-prem, or in a hybrid cloud environment as needed without much refactoring or extra work.
Once we’d identified these requirements, we first worked through some theoretical reasoning before trying out a PoC, as even building an MVP to test out a hypothesis can require significant effort. We also wanted to make sure that the KVS could satisfy future hybrid cloud and multicloud requirements across our products.
Some of these requirements and challenges — like scale, consistency, performance, resilience, scale-out architecture, and so on — were not new to us. We had spent several years working on the forked version of Cassandra used for our Nutanix distributed storage fabric metadata. We’ve learned a lot over eight years of optimizing and automating the Cassandra data path and workflows in Nutanix storage. For some of the specific lessons we learned, see the blog post The Wonderful World of Distributed Systems and the Art of Metadata Management and the USENIX paper Fine-Grained Replicated State Machines for a Cluster Storage System.
But our extensive experience doesn’t mean we just reused or repackaged past work for the cloud. We had to consider how to best take advantage of cloud infrastructure capabilities and ensure compatibility with hybrid cloud and multicloud environments. A fast, distributed, and consistent cloud-ready KVS would support multiple infrastructure products in development at Nutanix, including our core storage layer, Files, Prism Central, and a long-term secondary storage service project.
As cloud storage and Kubernetes were already providing scale-out, highly available, and consistent storage and compute, we knew we needed to take advantage of them in the KVS. A traditional distributed KVS uses data replication, active-standby, data migration, and so on to achieve HA and resilience, but these constructs seemed to only add unnecessary complexity and resource overhead in the cloud, where the infrastructure is already highly available, resilient, consistent, and scalable. Cloud infrastructure can recover various failure domains (disks, hosts, blocks, and so on) in a few seconds, which is comparable to the delay experienced with most replication-based strategies, where a system’s health check takes a few seconds to detect a failed host and then change the replica leader. In the world of highly available cloud infrastructure, we can easily offload some of the complex replication, HA, consistency, and scaling problems to the underlying cloud platform. It’s time to start thinking differently about distributed databases.
In general, most teams have three choices for fulfilling a KVS requirement:
- Use a new open source system for the use case.
- Reuse or repackage an existing system.
- Build a new product.
The following sections summarize our thought process in considering each of these options.
Use a New Open-Source System
Bringing in a new open-source system is usually a major undertaking. For a PoC or a small-scale, low-performance use case, it can be easy to get started, but with requirements like very low latency and high throughput with minimal resource consumption you have fewer options.
Building expertise and optimizing a new open-source DB can take months or years. But aside from the time and effort required, very few open-source DBs can realistically claim to have the capabilities we needed. A storage system needs a very lightweight distributed KVS to achieve high throughput with low latency. Most open-source DBs try to balance features, resources, and performance, which makes them heavier. Several distributed KVS options don’t prioritize resilience or consistency. Most systems default to less durable, eventual consistency settings., and any changes to make the DB more durable and strictly consistent can reduce performance significantly. Open-source DBs typically buffer writes in memory and use asynchronous replication models as defaults, which wouldn’t meet the requirements for an enterprise storage product like Objects. Furthermore, most open-source KVS options are either built to work on-prem or on specific cloud platforms; very few are compatible with hybrid clouds.
Reuse Existing Nutanix Cassandra
Another option was to use our forked version of Cassandra. The Cassandra fork we’d used for the previous eight years had served us well in terms of throughput, consistency, resilience, automated workflows, and so on, but it also had a few drawbacks. For one thing, we’d built it for the traditional infrastructure world of disks and virtual machines (VMs). We’d customized it for a cloud infrastructure rather than for applications running on a cloud infrastructure. Because our version didn’t support advanced sharding features like virtual nodes, each shard or node could become very large. Even though we further sharded the shard range of a single node across multiple disks, the I/O path could still encounter a bottleneck at a single commitlog or write-ahead log (WAL) of the log-structured merge (LSM) tree. We believed that basing our sharding scheme in virtual nodes could be much more scalable for throughput and tail latency while also providing better flexibility for data movement and placement. We wanted each virtual node to be a KVS engine so we could fully distribute resources across them.
As load increases, Cassandra slows down because of compaction and heavy Java garbage collection (GC). Significant GC pauses make it difficult to achieve consistently low average and tail latencies in Cassandra. This limitation was critical for a lot of the different types of tier-1 workloads that Nutanix customers run. Because Nutanix Cassandra could store 100 GB to 1 TB of data per instance with just 3–6 GB of memory, we needed to keep increasing the memory assigned to Cassandra to avoid GC pauses. Increasing Cassandra memory is a drawback for infrastructure software running on-prem, even though the numbers were quite compelling for the workload size.
We designed Nutanix Cassandra for VMs and physical servers rather than for the K8s world, so we had to rethink and optimize its built-in replication and fault tolerance for a new cloud storage and compute infrastructure. The strong Paxos consistency in Cassandra replication seemed like overhead on top of the highly available storage, compute, and memory layer that already existed in the cloud.
Making the same Cassandra code work in both VMs and Kubernetes seemed like a costly re-architecture that would have the same scaling and performance limitations as our original design. Reworking Cassandra would also disrupt the existing release plans for the Nutanix distributed storage fabric that is packaged with our Nutanix operating system (AOS).
Build a New Product
While we were considering these options, RocksDB was showing promising performance results in the autonomous extent store (AES), our AOS disk-local metadata DB. (You can read more about the AES in the previous RocksDB at Nutanix blog post.) Because RocksDB was giving us great results, it didn’t make sense to build another storage engine.
With highly available cloud storage like EBS or the Nutanix storage fabric already implemented, we didn’t have to build complex consistent replication protocols like Paxos or RAFT for our distributed KVS — we just needed to add a distribution layer to RocksDB. By letting go of replication, we could get better performance while avoiding the need for heavy memory and storage resources to store replicas.
RocksDB has minimal memory requirements for the large amounts of data it can store. (More about memory in RocksDB in a planned post on compaction.) Theoretically, RocksDB doesn’t need more memory to store more data; more memory just means more performance.
Because we can embed RocksDB in other processes, we get a significant advantage for latency and throughput, allowing us to create a distributed KVS that we could embed into client software. Adding even a single local procedure call can increase latency and reduce throughput significantly due to TCP connection and epoll overhead. Cassandra is in Java, so it can’t be cleanly or easily embedded.
RocksDB also has an Env backend that allows you to plug in any underlying storage layers. With this functionality, we could make the KVS storage-agnostic, supporting various underlying storage systems like Nutanix distributed storage, EBS, and Azure managed disks.
The Way Forward
After all this analysis and our own testing, we decided to build a distributed, cloud-native, highly available, and high-performance KVS based on RocksDB and suitable for scales ranging from dozens of terabytes to multiple petabytes.
When we started building ChakrDB, the team was already familiar with RocksDB. We knew it could provide low latency and high performance. We knew that a distributed scale-out design can scale linearly with the number of shards or virtual nodes across instances. We knew we would have to take advantage of cloud infrastructure constructs to redesign the existing KVS to be compatible with multicloud or hybrid cloud environments. And we knew that developing a new DB that could measure up to the standards of enterprise storage would have taken a huge effort.
So we chose to build the lightweight DB distribution layer on RocksDB: ChakrDB. Cloud infrastructure constructs like K8s and highly available block storage solved a lot of the tough problems involved in creating this distributed KVS while delivering all the simplicity and scalability of cloud-native software. In this new era of hybrid cloud and multicloud infrastructure, we need to rethink database software design.
In part 2, we’ll discuss the ChakrDB distributed KVS architecture in detail and review how the reasoning and theory we’ve covered here came to practical fruition.