Home System Design Tutorial What is Database Sharding and how is it done?

What is Database Sharding and how is it done?

Database sharding is a process of breaking up large tables into multiple smaller tables or chunks called shards and distributing data across multiple machines or clusters. Each shard has the same schema and columns like that of the original table but data stored in each shard is unique and independent of other shards.

Database Sharding

Database sharding is pretty much similar to horizontal scaling(scaling-out). Hence it allows us to add more machines to an existing cluster in order to spread out the load, allow more traffic, and faster processing. Also, sharding helps to make our application distributed thus minimizing a single point of failure.

Database Sharding Techniques

Database sharding needs to be done in such a way that the incoming data should be inserted into a correct shard, there should not be any data loss and the result queries should not be slow. Considering all this in mind, let’s see what techniques we have to shard databases.

1) Hash-Based Sharding

In Hash-based sharding (aka key-based sharding) we take a key value (such as customer Id, client IP address or email id, etc based on criteria we have already decided) from newly inserted data, pass it to the hash function and insert the data into resulting shard number.

Hash Based Sharding

It is the simplest sharding algorithm and can be used to evenly distribute data among shards and prevent the risk of having a database hotspot. The database hotspot problem arises when one shard is accessed more as compared to all other shards and hence, in this case, any benefits of sharding the database are canceled out by the slowdowns and crashes).

The main issue with this approach is that it gets really challenging to dynamically add or remove a database server. Every time this happens, we need to re-shard the database which means we need to update the hash function and rebalance the data. Also, if it happens very frequently, this can cause data loss too.

One solution to the above problem is using consistent hashing. Consistent hashing provides scalability even when we have lots of data among lots of servers(database server), and the number of available servers changes continuously(either a new server is added or a server is removed). See “Consistent Hashing” to learn how.

2) Range-Based Sharding

In range-based sharding, the shard is chosen on the basis of the range of a shard key. The range of sharding is chosen in such a way that as shard key is likely to fall in any one of all the possible ranges.

Let’s say we have a recommender system that stores all the information about a user and recommends user movies based on their age. Hence we can create a few different shards and divide up each user’s information based on which age range they fall into, something like this:

Range Based Sharding

Range-based sharding is easy to implement as we just need to check the range in which our current data falls and insert/read data from the shard corresponding to that shard. Also, every shard holds a different set of data, but the schema of all the shards is the same as that of the original database.

The major drawback of this technique is that if our data is unevenly distributed, again it can lead to database hotspots.

3) Directory-Based Sharding

In directory-based sharding have a lookup table also known as location service. It stores the shard key to keep track of which shard store what entry. To read or write data, the client engine first consults the lookup table to find the shard number for the corresponding data using the shard key and then visits a particular shard to perform further operation.

Directory Based Sharding

This is pretty much similar to range-based sharding except instead of determining which range the shard key’s data falls into, each key is tied to its own specific shard.

Unlike hash-based sharding which uses a fixed hash function and range-based sharding which requires us to specify a range in advance, directory-based sharding allows you to use whatever system or algorithm we want to use to assign data entries to shards, and also it relatively easy to dynamically add shards using this approach.

The main issue with directory-based sharding is we need to consult a lookup table before every read and write query hence it can impact application performance. Also, the lookup table is prone to a single point of failure. One solution to this problem can be using load balancers but again frequently updating the copy of the lookup-table in each server will be overhead.

See how FreshWork is using directory-based sharding “How we scaled Freshdesk“.

4) Geo-Based sharding

Geo Based Sharding is similar to range-based sharding. In Geo-based sharding, the data is processed by a shard corresponding to the user region or location. Tinder use Geo-Based sharding. The tinders geo-bounded sharding has a limit of 100 miles and ensures the production load of the geo-shards is balanced.

Benefits of Sharding

  1. Database sharding helps us to facilitate horizontal scaling. Hence we can add more machines to the existing server and distribute the load to scale up the application.
  2. Faster query response time. Without database sharding, the database needs to compare a query to each and every row and it can be a huge setback. But with sharding instead of traversing all rows, we need to traverse only a few rows.
  3. Sharding makes maintenance becomes easier.
  4. Database sharding eliminates the problem of a single point of failure and makes our application more fault-tolerant.
  5. Reduced cost. Adding more RAM and storage to a single is machine (Vertical Scaling) is expensive while having several nodes with less computation power are cheaper. See “Why is vertical scaling expensive?”  to learn more.

Drawbacks of Sharding

  1. Database sharding becomes complex when it comes to practical implementation. Also, if done incorrectly can lead to data loss or corrupt tables.
  2. Another major issue with sharding is if shards become unbalanced.
  3. Once the database is sharded, it becomes very difficult to return back to the original un-sharded version.
  4. Not all databases natively support sharding. For Example, PostgreSQL doesn’t support automatic sharding features, though it is possible to manually shard it, again it will increase the complexity.

Further Notes:

  1. Sharding vs Partitioning: Partitioning is the distribution of data on the same machine across tables or databases. Sharding is one specific type of partitioning known as horizontal partitioning. In sharding, we distribute data across multiple different servers.

  2. Should I do a Sharding? Sharding should be done only when it’s absolutely necessary because it adds extra operational complexity. Most commonly it is done when we are dealing with a large amount of data. Some scenarios in which sharding might prove helpful:
    1. When the amount of data surpasses the capability of our server and vertical scaling is not possible or overly expensive.
    2. If the number of reading and writing request is more than the capability of a single node and are often uniformly distributed.
    3. Creating more than one read replica or load balancing is also not proving helpful.

Did, we miss something, or do you want to add some other key points?🤔
Please comment.

Do you know System Design is a very important topic in product-based company interviews and almost every tech giant asks about it? At IntMain we have a dedicated section for system design to help you prepare. Read here.  

LEAVE A REPLY

Please enter your comment!
Please enter your name here