In the last weeks, I've been hearing a lot about Cassandra, and other NoSQL Solutions that were candidates to one of the projects I am working on. Which is currently set to function properly with a RDBMS solution - Oracle 11g.
I decided then to take a deeper look into those kind of solution, NoSQL solutions, and compare them with RDBMS solutions. This article is intended to help you understand NoSQL, and pick the solution that best fits your requirements and scenario. This article does not cover all the features of a Specific NoSQL solution, rather it shows the general scenario.
In order to fully understand NoSQL, let us first see some key concepts of distributed computer systems and storage systems.
What is NoSQL?
A NoSQL or Not Only SQL database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. Motivations for this approach include simplicity of design, horizontal scaling and finer control over availability. The data structure (e.g. key-value, graph, or document) differs from the RDBMS, and therefore some operations are faster in NoSQL and some in RDBMS. There are differences though, and the particular suitability of a given NoSQL DB depends on the problem it must solve (e.g., does the solution use graph algorithms).
The CAP Theorem
In theoretical computer science, the CAP theorem, also known as Brewer's theorem, states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees:
- Consistency (all nodes see the same data at the same time)
- Availability (a guarantee that every request receives a response about whether it was successful or failed)
- Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system)
To see where RDBMS are vs NoSQL, take a look at the image bellow:
This theorem, also known as the CAP principle,says that you should choose two out of three, let us figureout it better, by showing all the combinations:
CA - data is consistent between all nodes - as long as all nodes are online - and you can read/write from any node and be sure that the data is the same, but if you ever develop a partition between nodes, the data will be out of sync (and won't re-sync once the partition is resolved).
CP - data is consistent between all nodes, and maintains partition tolerance (preventing data desync) by becoming unavailable when a node goes down.
AP - nodes remain online even if they can't communicate with each other and will resync data once the partition is resolved, but you aren't guaranteed that all nodes will have the same data (either during or after the partition)
This theorem is very useful to gets us started in the paradigm of NoSQL and distributed systems.
Some NoSQL designs give up consistency in order to achieve availability and partitional tolerance. But it is also interisting to note that implementations of NoSQL give up partional tolerance and consistency in order to achieve high performance and to kill latency.
One ofthe biggest goals of NoSQL is to horizontal scalability. NoSQL systems typically accomplish this by relaxing relational abilities and/or loosening transactional semantics.
Now, let us review the ACID properties of RDBM Systems.
Atomicity
Atomicity requires that each transaction is "all or nothing": if one part of the transaction fails, the entire transaction fails, and the database state is left unchanged. An atomic system must guarantee atomicity in each and every situation, including power failures, errors, and crashes. To the outside world, a committed transaction appears (by its effects on the database) to be indivisible ("atomic"), and an aborted transaction does not happen.
Consistency
The consistency property ensures that any transaction will bring the database from one valid state to another. Any data written to the database must be valid according to all defined rules, including constraints, cascades, triggers, and any combination thereof. This does not guarantee correctness of the transaction in all ways the application programmer might have wanted (that is the responsibility of application-level code) but merely that any programming errors do not violate any defined rules.
Isolation
The isolation property ensures that the concurrent execution of transactions results in a system state that would be obtained if transactions were executed serially, i.e. one after the other. Providing isolation is the main goal of concurrency control. Depending on concurrency control method, the effects of an incomplete transaction might not even be visible to another transaction.
Durability
Durability means that once a transaction has been committed, it will remain so, even in the event of power loss, crashes, or errors. In a relational database, for instance, once a group of SQL statements execute, the results need to be stored permanently (even if the database crashes immediately thereafter). To defend against power loss, transactions (or their effects) must be recorded in a non-volatile memory.
The C in ACID and CAP
Now let us make an important observation, the 'C' in ACID it ino t the same as the 'C' in CAP. In ACID, it means being consitent with all the rules defined within the Database, this includes constrants(i.e. FKs), triggers, etc. While in CAP when we talk about Consistency we mean a single-copy consistency, a strict subset of ACID consistency. And important note is that ACID usually does not support Partitional Tolerance.
Comparisons and Conclusions
We can say that a RDBMS would be your first choice, if your application requires ACID transactions, and you do not need to scale. Also, if you want to perform more complex queries,usually NoSQL would not be your first choice. However, if scaling is your goal, and you evaluate performance as a major issue, you might consider "going NoSQL".
There is also the case that you might not choose NoSQL because you need joins in your application, and you also need transaction. However consider the possibility writing this on your application.
Now, about the consistency issue in NoSQL systems,you must consider the level of criticity and availability of your application you choose a NoSQL solution. To better undestand consider this part of Brewer's 2012 article:
"the '2 of 3' view is misleading on several fronts. First, because partitions are rare, there is little reason to forfeit C or A when the system is not partitioned. Second, the choice between C and A can occur many times within the same system at very fine granularity; not only can subsystems make different choices, but the choice can change according to the operation or even the specific data or user involved. Finally, all three properties are more continuous than binary. Availability is obviously continuous from 0 to 100 percent, but there are also many levels of consistency, and even partitions have nuances, including disagreement within the system about whether a partition exists."
The final conclusion: It all depends on your application, project and even know-how of technologies. As a final guideline if most part of your system must be highly consistent and you don't have a need to scaleyour application, do not go NoSQL, otherwise,you might find it a very usuful solution.
As a simple guideline, anlyse the needs of your project:
Scalability, Performance, and High Availability -> NoSQL
Transaction needs, complicated and more complex queries High Consistency -> RDBMS
I decided then to take a deeper look into those kind of solution, NoSQL solutions, and compare them with RDBMS solutions. This article is intended to help you understand NoSQL, and pick the solution that best fits your requirements and scenario. This article does not cover all the features of a Specific NoSQL solution, rather it shows the general scenario.
In order to fully understand NoSQL, let us first see some key concepts of distributed computer systems and storage systems.
What is NoSQL?
A NoSQL or Not Only SQL database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. Motivations for this approach include simplicity of design, horizontal scaling and finer control over availability. The data structure (e.g. key-value, graph, or document) differs from the RDBMS, and therefore some operations are faster in NoSQL and some in RDBMS. There are differences though, and the particular suitability of a given NoSQL DB depends on the problem it must solve (e.g., does the solution use graph algorithms).
The CAP Theorem
In theoretical computer science, the CAP theorem, also known as Brewer's theorem, states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees:
- Consistency (all nodes see the same data at the same time)
- Availability (a guarantee that every request receives a response about whether it was successful or failed)
- Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system)
To see where RDBMS are vs NoSQL, take a look at the image bellow:
This theorem, also known as the CAP principle,says that you should choose two out of three, let us figureout it better, by showing all the combinations:
CA - data is consistent between all nodes - as long as all nodes are online - and you can read/write from any node and be sure that the data is the same, but if you ever develop a partition between nodes, the data will be out of sync (and won't re-sync once the partition is resolved).
CP - data is consistent between all nodes, and maintains partition tolerance (preventing data desync) by becoming unavailable when a node goes down.
AP - nodes remain online even if they can't communicate with each other and will resync data once the partition is resolved, but you aren't guaranteed that all nodes will have the same data (either during or after the partition)
This theorem is very useful to gets us started in the paradigm of NoSQL and distributed systems.
Some NoSQL designs give up consistency in order to achieve availability and partitional tolerance. But it is also interisting to note that implementations of NoSQL give up partional tolerance and consistency in order to achieve high performance and to kill latency.
One ofthe biggest goals of NoSQL is to horizontal scalability. NoSQL systems typically accomplish this by relaxing relational abilities and/or loosening transactional semantics.
Now, let us review the ACID properties of RDBM Systems.
Atomicity
Atomicity requires that each transaction is "all or nothing": if one part of the transaction fails, the entire transaction fails, and the database state is left unchanged. An atomic system must guarantee atomicity in each and every situation, including power failures, errors, and crashes. To the outside world, a committed transaction appears (by its effects on the database) to be indivisible ("atomic"), and an aborted transaction does not happen.
Consistency
The consistency property ensures that any transaction will bring the database from one valid state to another. Any data written to the database must be valid according to all defined rules, including constraints, cascades, triggers, and any combination thereof. This does not guarantee correctness of the transaction in all ways the application programmer might have wanted (that is the responsibility of application-level code) but merely that any programming errors do not violate any defined rules.
Isolation
The isolation property ensures that the concurrent execution of transactions results in a system state that would be obtained if transactions were executed serially, i.e. one after the other. Providing isolation is the main goal of concurrency control. Depending on concurrency control method, the effects of an incomplete transaction might not even be visible to another transaction.
Durability
Durability means that once a transaction has been committed, it will remain so, even in the event of power loss, crashes, or errors. In a relational database, for instance, once a group of SQL statements execute, the results need to be stored permanently (even if the database crashes immediately thereafter). To defend against power loss, transactions (or their effects) must be recorded in a non-volatile memory.
The C in ACID and CAP
Now let us make an important observation, the 'C' in ACID it ino t the same as the 'C' in CAP. In ACID, it means being consitent with all the rules defined within the Database, this includes constrants(i.e. FKs), triggers, etc. While in CAP when we talk about Consistency we mean a single-copy consistency, a strict subset of ACID consistency. And important note is that ACID usually does not support Partitional Tolerance.
Comparisons and Conclusions
We can say that a RDBMS would be your first choice, if your application requires ACID transactions, and you do not need to scale. Also, if you want to perform more complex queries,usually NoSQL would not be your first choice. However, if scaling is your goal, and you evaluate performance as a major issue, you might consider "going NoSQL".
There is also the case that you might not choose NoSQL because you need joins in your application, and you also need transaction. However consider the possibility writing this on your application.
Now, about the consistency issue in NoSQL systems,you must consider the level of criticity and availability of your application you choose a NoSQL solution. To better undestand consider this part of Brewer's 2012 article:
"the '2 of 3' view is misleading on several fronts. First, because partitions are rare, there is little reason to forfeit C or A when the system is not partitioned. Second, the choice between C and A can occur many times within the same system at very fine granularity; not only can subsystems make different choices, but the choice can change according to the operation or even the specific data or user involved. Finally, all three properties are more continuous than binary. Availability is obviously continuous from 0 to 100 percent, but there are also many levels of consistency, and even partitions have nuances, including disagreement within the system about whether a partition exists."
The final conclusion: It all depends on your application, project and even know-how of technologies. As a final guideline if most part of your system must be highly consistent and you don't have a need to scaleyour application, do not go NoSQL, otherwise,you might find it a very usuful solution.
As a simple guideline, anlyse the needs of your project:
Scalability, Performance, and High Availability -> NoSQL
Transaction needs, complicated and more complex queries High Consistency -> RDBMS
Wonderful! Amazing! Never saw so complete explanation about CAP and NoSQL
ReplyDelete