How database companies keep their data straight

That Transform Technology Summits launch on October 13 with Low-Code / No Code: Enabling Enterprise Agility. Register now!


As developers solve ever larger problems, they need to store their data in more complex ways – adding a constellation of computers to house it all.

However, adding more computer hardware can lead to confusion when different parts of the network need to be accessed for a particular query, especially when fast data requests are so common. Each database update must be sent to all computers – sometimes spread across different data centers – before the update is complete.

Complex data requires complex solutions

Developers like to have a “single source of truth” when building applications, one that is an overview of essential information. This should at all times be able to tell them the most current values.

It is simple to deliver this consistency with a computer running a database. When running multiple machines in parallel, defining a single version of the truth can become complicated. If two or more changes arrive on different machines shortly after each other, there is no easy way for the database to select which one came first. When computers perform their jobs in milliseconds, the order of such changes can be ambiguous, forcing the database to choose who gets the airline seat or the concert tickets.

The problem only grows with the size of tasks assigned to a database. More and more jobs require large databases spanning multiple machines. These machines may be located in different data centers around the world to improve response time and add remote redundancy. However, the extra communication time required increases the complexity greatly when the database updates get close to each other on different machines.

And the problem can not just be solved by handing everything over to a high-end cloud provider. Database services offered by giants like Amazon AWS, Google Cloud and Microsoft Azure all have limits when it comes to consistency and they can offer several variations of consistency to choose from.

To be sure, some jobs are not affected by this problem. Many applications simply request that databases track slowly evolving and unchanged values ​​- e.g. The size of your monthly bill or the winner of last season’s ball game. The information is written once and all subsequent requests receive the same response.

Other jobs, such as Tracking the number of open seats on an aircraft can be very difficult. If two people try to buy the last seat on the plane, they can both receive a reply that there is one seat left. The database needs to take extra steps to ensure that the seat is only sold once. (The airline may still choose to overbook an aircraft, but it is a business decision, not a database error.)

Databases work hard to maintain consistency as the changes are elaborated by aggregating any number of complicated changes into individual packages known as “transactions”. If four people flying together want seats on the same flight, the database can keep the set together and only process the changes if, for example. Is four empty seats available.

In many cases, database creators have to decide if they want to trade consistency for speed. Is a strong consistency worth slowing down the updates until they reach all corners of the database? Or is it better to plow forward because the odds are low that any inconsistency will cause a significant problem? After all, is it really that tragic if someone who buys a ticket five milliseconds later than someone else actually gets the ticket? You could argue that no one will notice it.

The problem only occurs during the time it takes new versions of the data to spread throughout the network. The databases converge to a correct and consistent answer, so why not take a chance if the stakes are low?

There are now several “eventually consistent” versions supported by different databases. The question of how best to tackle the problem has been thoroughly researched over the years. Computer scientists like to talk about the CAP theorem, which describes the trade-off between consistency, availability and divisibility. It is usually relatively easy to choose two of the three, but difficult to get all three in a working system.

Why is the final consistency important?

The idea of ​​final consistency evolved as a way to dampen expectations of accuracy in moments when it is most difficult to deliver. This is right after new information has been written to a node but has not been spread throughout the constellation of machines responsible for storing the data. Database developers often try to be more precise by spelling out the different versions of consistency they are able to offer. Amazon Chief Technology Officer Werner Vogels described five different versions Amazon was considering when designing some of the databases that run Amazon Web Services (AWS). The list contains versions such as “session consistency”, which promise consistency, but only for a specific session.

The performance is closely linked to NoSQL databases because many of these products only began to promise consistency. Over the years, database designers have examined the problem in more detail and developed better models to describe the trade-offs with more precision. The idea still bothers some database administrators, the kind who wear both braces and harnesses to work, but users who do not need perfect answers appreciate the speed.

How do old players approach it?

Traditional database companies such as Oracle and IBM continue to be committed to strong consistency, and their key database products continue to support it. Some developers use very large computers with terabytes of RAM to run a single database that maintains a single, consistent record. For banking and warehousing tasks, this may be the simplest way to grow.

Oracle also supports clusters of databases, including MySQL, and these can resort to delivering final consistency to jobs that require more size and speed than perfection.

Microsoft’s Cosmos database offers five levels of warranty, ranging from strong to final consistency. Developers can trade speed versus accuracy depending on the application.

What do the beginners do?

Many of the new NoSQL database services explicitly embrace any consistency to simplify development and speed up. Start-ups may have begun to offer the simplest model for consistency, but recently they have given developers more options to swap raw speeds away for better accuracy when needed.

Cassandra, one of the earliest NoSQL database offerings, now offers nine options for write consistency and 10 options for read consistency. Developers can trade speed for consistency according to application requirements.

For example, Couchbase offers what the company calls an “adjustable” amount of consistency that can vary from query to query. MongoDB can be configured to offer any consistency for read-only copies for speed, but it can also be configured with a number of options that offer more robust consistency. PlanetScale offers a model that balances consistent replication with speed, arguing that banks are not the only ones to fight inconsistency.

Some companies are building new protocols that come closer to strong consistency. For example, Google’s Spanner relies on a very precise set of clocks to synchronize the versions running in different data centers. The database is able to use these timestamps to determine which new block of data arrived first. FaunaDB, on the other hand, uses a version of a protocol that does not rely on very precise clocks. Instead, the company creates synthetic timestamps that can help decide which version of competing values ​​to keep.

Yugabyte has chosen to embrace consistency and divisibility from the CAP statement and weigh availability. Some read queries stop until the database reaches a uniform state. CockroachDB uses a model that it says sometimes offers a serial version of the data, but not a linearized one.

The limits of any consistency

For critical tasks, such as those involving money, users are willing to wait for answers without inconsistencies. Eventually, consistent models may become acceptable for many data collection tasks, but they are not suitable for tasks that require a high degree of trust. When companies can afford to support large computers with lots of RAM, databases that offer strong consistency are appropriate for anyone who controls scarce resources.

VentureBeat

VentureBeat’s mission is to be a digital urban space for technical decision makers to gain knowledge about transformative technology and transactions. Our site provides important information about data technologies and strategies to guide you as you lead your organizations. We invite you to join our community to access:

  • updated information on topics that interest you
  • our newsletters
  • gated thought-leader content and discount access to our valued events, such as Transform 2021: Learn more
  • networking features and more

sign up

Leave a Comment