What is Sharding?
Sharding is the process of dividing one’s data into two or smaller chunks known as logical shards. The logical shards are then distributed across distinct database nodes known as physical shards, each of which can hold multiple logical shards. Regardless, the data contained within all of the shards collectively represents an entire logical dataset. Sharding is also known as Horizontal Partitioning.
MySQL Sharding is the process of partitioning data from a single MySQL database and distributing it across multiple database servers, each with an identical schema.
What factors to consider when selecting a database?
The extreme High Concurrency of different services is one of the most difficult challenges our businesses face. Concurrency Control in Database Management Systems (DBMS) manages simultaneous access to a database. It prevents multiple users from editing the same record concurrently and serializes transactions for backup and recovery.
To deal with these challenges, we have to focus on 4 major aspects:
Multiple services frequently have to go through downtimes. Stability, therefore, should be the first priority to avoid Sharding.
So, the database should:
- Be able to support a multi-active, high-availability architecture.
- Provide simple monitoring and alerting services.
- Carry out a graceful rolling upgrade with no impact on the application.
- Have a few issues with data migration and validation.
- Be efficient when it comes to elastic scaling.
The faster the system, the more the stability, the better the user experience. As a result, the database must be highly efficient and should provide:
- Low Latency
- Pessimistic transactions
- High Queries per Second (QPS)
- Massive Data Processing
The cost factor should also be considered while selecting a database. Cost is associated with 3 things:
- The cost of application: The database should be easily connected to the application so that the application team is not overburdened. This process should, ideally, necessitate little communication or training.
- The cost of CPU, memory, & disks: A machine with a high configuration would significantly increase the hardware costs.
- The cost of network bandwidth: Some networks that connect remote data centers charge by the amount of bandwidth used.
The database should take into consideration the following three features to prevent data breaches and loss:
- Data Auditing: Information about who accessed a specific piece of data and how they operated on it.
- Data Recovery: You should be able to recover the data from backup in case of anomalies.
- Database Privileges: Database privileges should be managed delicately. DBAs and application developers, for example, should not have access to the user ID, phone number, or password.
As a result, a low-cost database that is stable, efficient, and secure should be considered. If it’s an open-source database, you can easily get community support when a problem arises, and also potentially contribute to the community yourself.
TiDB, an open-source, MySQL-compatible database is chosen that provides horizontal scalability and ACID-compliant transactions, after comparing various distributed databases on the market such as Amazon Aurora, Spanner, and CockroachDB.
CASE STUDY: Why is TiDB an Appropriate MYSQL Alternative Database?
TiDB is selected as a suitable MYSQL alternative to avoid Sharding mostly for 3 factors:
- Horizontal Scaling,
- Financial-grade Data Consistency, and
- Data Hub
1. Horizontal Scaling
Distributed databases can easily scale in and out without the need for manual sharding. There are frequent abrupt ups and downs. If more shards aren’t created as traffic increases, latency will rise.
The fact that computation and storage are not separated is a major source of sharding problems. MySQL was unable to replicate MYSQL database in a full-fledged manner. To address this issue, you require a database that separates the computing layer from the storage layer, allowing you to scale storage capacity and computing resources by different amounts as needed. That’s where TiDB comes into the picture!
TiDB’s architecture is multi-layered. It consists of three major components: TiDB for computation, Placement Driver (PD) for metadata storage and providing timestamps, and TiKV for distributed storage. Each layer can scale independently to handle varying traffic without interfering with other components.
2. Financial-grade Data Consistency
The TiDB transaction model is a two-phase commit protocol that is optimized for pessimistic transactions. With TiDB’s distributed transaction model, R&D engineers can focus on their own development rather than the complicated logic of sharding and validating.
3. Data Hub
TiDB Binlog is used to replicate data from one TiDB cluster to another. Because TiDB’s bottom storage engine, RocksDB, uses the log-structured merge-tree (LSM-tree) structure, which is highly efficient for write operations, you can quickly sync data to a TiDB replica. You can use this replica to perform huge calculations and frequent queries, generate reports and even create a search engine.
We have a lot of related data scattered across various systems. If we extract these data and combine them into a single data hub, we can use them for a variety of services, including operations reporting and data subscription.
There are various other factors that add to TiDB’s efficiency. Some of these are as follows:
- Separate hot and cold data.
- Store logs and monitor data.
- Online Data Definition Language (DDL) schema change.
- Migrate from HBase/Elasticsearch (ES).
- HTAP is a growing need.
This post introduces the factors for selecting an appropriate MYSQL alternative database to avoid sharding in detail. It also gives a quick overview of the TiDB database