MongoDB – Introduction to Replication and Sharding

Introduction

MongoDB is a NoSQL, document-based and a general-purpose database built for modern applications. Being a document-based database, MongoDB stores data in JSON format which makes it more expressive than the traditional row/column model.

Performance & Scalability

For modern applications, now it is common to see rapid data growth and rapid adoption rates. One such example of rapid growth and rapid adoption rate can be considered from the popular multiplayer online game PlayerUnknown’s Battlegrounds (PUBG). PUBG was fully released in December 2017 and by June 2018 it ran into a massive adoption rate with a total of over 400 million players worldwide. 

Now, let’s consider a scenario where you have an application using MongoDB as a database.

 Initially, you had 500-1000 users and your application was operating with a good performance. But going forward your application’s adoption rate suddenly takes a peak to 50,000 users. Assuming you had done a basic deployment of your application without considering scalability, your application’s usage will suddenly take off and this will outgrow your initial environment quickly and this growth will happen due to physical data storage needs or degradation requiring more resources or performance hits or a combination of all these.

From the example above, one can conclude that when developing and operating applications with MongoDB, you may need to analyze the performance of the application and its database and if you encounter degraded performance, it is often a function of database access strategies, hardware availability, and the number of open database connections. 

So from here, it is clear that it’s the time to scale up your database.

MongoDB has inbuilt support for scaling.  The scaling in MongoDB can be achieved in two ways depending on the requirement of your application size and forecast of data.

  1. Replication
  2. Sharding

Replication

Replication involves replicating the same data over multiple instances.  This way of horizontal scaling has the following benefits :

  • The data is safe and always available
  • This eliminates downtime for maintenance and the data can be easily recovered in case of any disaster.
  • Increased querying rates as the traffic is balanced.

Replication is achieved using MongoDB’s Replica Set. A replica-set is a group of mongod instances running on different machines. 

  • A minimum of three instances is required to form a replica set. 
  • Out of these three, one is the primary node and the rest two are the secondary nodes. 
  • All the write operations are handled by the primary node and journaled in a file called oplog. The secondary nodes are able to replicate the latest data using the primary node’s oplog and this way the data remains synced between all the nodes.
  • The read operations can be balanced between the secondary nodes so that the load does not affect the primary node.
  • If the primary node fails, one of the secondary nodes takes its position and starts performing all the writes. And when the failed node is restored, it will now run as one of the secondary nodes.

Sharding

Sharding is a method for distributing data across multiple machines. MongoDB uses sharding to support deployments with very large data sets and high throughput operations.

A shard is a replica set or a single mongod instance that holds the data subset used in a sharded cluster. Shards hold the entire data set for a cluster. Each shard is a replica set that provides redundancy and high availability for the data it holds. 

Horizontal scaling is achieved through sharding. 

These are some important components of a sharded cluster:

  • Shards:  Each shard here stores a subset of the sharded data. A single shard can be a replica set.
  • Mongos: The mongo is responsible for routing the queries from client to a target shard.
  • Routers: Routers are instances of a MongoDB service called mongos that acts as the interface between clients and the sharded cluster. The routers are responsible for forwarding database operations to the correct shard. The config servers are supposed to be deployed as a replica set.
  • Shard Key: The shard key determines the distribution of the collection’s documents among the cluster’s shards. The shard key is either an indexed field or indexed compound field that exists in every document in the collection.

MongoDB partitions data in the collection using ranges of shard key values. Each range defines a non-overlapping range of shard key values and is associated with a chunk.

  • The image below shows the whole sharding process : 

Sharding is a highly scalable approach for improving the throughput and overall performance of high-transaction, large, database-centric business applications.

These are the benefits of sharding :

  • Data backup/restore:  Let’s consider the case where we have a replica set with a 100GB database. The full restore process might last around 2-3 hours, whereas if the database has two shards, it will take about 1-1.5 hours – and depending on the number of shards we can improve that time. So in case, if there are two shards, the restore process takes half of the time to restore when compared to a single replica set.
  • Instance Hardware Limitations: Through sharding, it is possible to increase the IOPS(Input Output Per Second) by multiplying the IOPS of multiple machines with that of a single one.
  • Storage engine limitations: There is a limitation on concurrent writes & reads on storage engines. We can increase this limitation for performance, but this will generate more load on processor degrading I/O performance. Through sharding, we can handle this since the I/O will be balanced between the shards.
  • Speed up queries: Queries can take too long, depending on the number of reads they perform. In a clustered deployment, queries can run in parallel and speed up the query response time. If a query runs in ten seconds in a replica set, it is very likely that the same query will run in five to six seconds if the cluster has two shards, and so on.

Conclusion

So we can conclude now that through Sharding and Replication we can scale out MongoDB when the data is about to outgrow a single server. Some key points to consider while scaling is: 

  1. The general practice to do sharding is when the data size exceeds the storage capacity and there is a fall in IO performance.
  2.  Sharded cluster infrastructure requirements and complexity require careful planning, execution, and maintenance.
  3. Make sure you choose a good shard key and stay up to date with database maintenance.

References:
https://docs.mongodb.com/