Database Sharding

Fundamental principle that very very large scale systems are built on because it allows systems to scale out horizontally as much as they want.

Topics:

1. What is sharding

2. Why is it required

3. What are diff. option in terms of scaling your database

4. How to increase database performance

a. Scaling up your hardware

b. Adding replicas

c. Sharding

5. Pros and cons of sharding

What is Sharding?

- It is optimization technique for achieving horizontal scalability in databases, i.e. splitting our databases into multiple smaller ones.

- Now instead of having one big database you have smaller individual ones and each of those can have same kind of hardware set-up, hence we get better performance and more data storage.

- Sharding is specific type of partitioning

- We’ll be majorly focusing on horizontal sharding, as it is more applicable for large scale problems.

Options available for increasing performance of database

Why?

- May be database is too expensive

o Cz of hardware cost

- May be we are running into the limits in terms of storage on that database

o So we have to dump older data, as we can’t query using older data

Option 1. Scaling up hardware

Ø Increase the RAM

Ø Pick better performance processor

Not a better solution, as its very expensive to keep on scaling up and up, every time when we face performance problems

Option 2. Add replicas

Ø We make copies of database

o Instead of one database that’s serving all the traffic, we have multiple copies of that database

o Copies are also, able to field some of the traffic that’s coming in.

o We have

§ one MASTER NODE (orchestrator)

§ multiple READ REPLICAS (copy of master node)

Ø Master node is responsible for receiving all write requests

Ø Read replicas are not allowed to receive write requests

Ø Master will accept the write request and will propagate the update to that row to read replicas, in eventual propogation

Ø Eventual propagation/ delayed propagation through asynchronous, leads to a problem called “Eventual Consistency”

Ø Eventual Consistency means,

o that there is delay from, when a write request is received and commited to master node

o read replicas get updated with that row

Why Eventual Consistency is a problem?

- Having eventual consistency in a system can result in stale data.

Option 3. Sharding

Ø Splitting of your databases into multiple smaller databases

Ø How will you separate your data?

o Fundamental thing with sharding is that your need some kind of key i.e. predictable and is part of every request that comes into your system.

o We use this key as input to hashing function and that key will tell you which shard the data regarding the key is on.

Method 1 Sharding: Partition Mapping i.e. using independent Shard Databases

Before Sharding

- We have one database, if that goes down the whole application goes down. .

After Sharding

- Now we are saying that we need small table that tells which all customers are on which shard.

- Here, Database Shard 1 and Databse Shard 2, both are on completely different master node in completely different database

- Lets take a scenario where Database Shard 1 goes down, it will definitely impact customers, but we have Database Shard 2, which is storing the subset of your information. Hence, for customer 3, 4 we can still receive the traffic and process the query accordingly.

Method 2 Sharding: using Routing Layer

- So that when a request comes in instead of hitting the database directly, we can introduce another intermediary layer.

- When we query this intermediary layer, then this layer is going to be responsible for looking up which shard each customer is on and then forwarding the request to corresponding underlying database instance.

- This layer should be able to do below mentioned:

o Should know how to route this traffic

o Need to be able to handle single point of failure, if you are using database to store your shard location.

Pros

Cons

Scalability: instead of having one database instance which can lead to single point of failure, we can chunk our data into smaller independent databases.

This splitting into small independent databases can be recursive as per our requirements.

Complexity:

1. Partition Mapping

2. Routing Layer

3. Non-uniformity of data.

Goal of Hashing Function is to split your data evenly.

Let there is a user which has tons of data in your application. Eventually, your shard size can grow and make shard storage disproportionate to one another.

4. Re-sharding/ Re-shuffling

- Lead because of non-uniformity of data

Now we need to orchestrate the whole process

Availability/Fault tolerance: if one shard database goes down, at least we can serve for the remaining subset of customer data.

Analytical type queries restricted

Eg. We’ve all these different shards, that have different databases underneath them.

So you need to have your layer know, to go out and collect all the information from different underlying databases, wrap all that up and then return it to caller called, Scatter Gather.

For analytical queries, it is necessity to go out to each individual shard grab all that data and then return back to your customer.

MUST READ: https://aws.amazon.com/blogs/database/sharding-with-amazon-relational-database-service/

Technology - continuous learning process

Sunday, March 13, 2022

Database Sharding

No comments:

Post a Comment

Popular Posts

Most Featured Post

GitHub: Squash all commits in PR