Skip to main content

When to use NOSQL - An opinion based post of 2014...

In the last weeks, I've been hearing a lot about Cassandra, and other NoSQL Solutions that were candidates to one of the projects I am working on. Which is currently set to function properly with a RDBMS solution - Oracle 11g.

I decided then to take a deeper look into those kind of solution, NoSQL solutions, and compare them with RDBMS solutions. This article is intended to help you understand NoSQL, and pick the solution that best fits your requirements and scenario. This article does not cover all the features of a Specific NoSQL solution, rather it shows the general scenario.

In order to fully understand NoSQL, let us first see some key concepts of distributed computer systems and storage systems.

What is NoSQL?

A NoSQL or Not Only SQL database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. Motivations for this approach include simplicity of design, horizontal scaling and finer control over availability. The data structure (e.g. key-value, graph, or document) differs from the RDBMS, and therefore some operations are faster in NoSQL and some in RDBMS. There are differences though, and the particular suitability of a given NoSQL DB depends on the problem it must solve (e.g., does the solution use graph algorithms).


The CAP Theorem

In theoretical computer science, the CAP theorem, also known as Brewer's theorem, states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees:

- Consistency (all nodes see the same data at the same time)
- Availability (a guarantee that every request receives a response about whether it was successful or failed)
- Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system)

To see where RDBMS are vs NoSQL, take a look at the image bellow:



This theorem, also known as the CAP principle,says that you should choose two out of three, let us figureout it better, by showing all the combinations:

CA - data is consistent between all nodes - as long as all nodes are online - and you can read/write from any node and be sure that the data is the same, but if you ever develop a partition between nodes, the data will be out of sync (and won't re-sync once the partition is resolved).

CP - data is consistent between all nodes, and maintains partition tolerance (preventing data desync) by becoming unavailable when a node goes down.

AP - nodes remain online even if they can't communicate with each other and will resync data once the partition is resolved, but you aren't guaranteed that all nodes will have the same data (either during or after the partition)

This theorem is very useful to gets us started in the paradigm of NoSQL and distributed systems.

Some NoSQL designs give up consistency in order to achieve availability and partitional tolerance. But it is also interisting to note that implementations of NoSQL give up partional tolerance and consistency in order to achieve high performance and to kill latency.

One ofthe biggest goals of NoSQL is to horizontal scalability. NoSQL systems typically accomplish this by relaxing relational abilities and/or loosening transactional semantics.


Now, let us review the ACID properties of RDBM Systems.

Atomicity

Atomicity requires that each transaction is "all or nothing": if one part of the transaction fails, the entire transaction fails, and the database state is left unchanged. An atomic system must guarantee atomicity in each and every situation, including power failures, errors, and crashes. To the outside world, a committed transaction appears (by its effects on the database) to be indivisible ("atomic"), and an aborted transaction does not happen.

Consistency

The consistency property ensures that any transaction will bring the database from one valid state to another. Any data written to the database must be valid according to all defined rules, including constraints, cascades, triggers, and any combination thereof. This does not guarantee correctness of the transaction in all ways the application programmer might have wanted (that is the responsibility of application-level code) but merely that any programming errors do not violate any defined rules.

Isolation

The isolation property ensures that the concurrent execution of transactions results in a system state that would be obtained if transactions were executed serially, i.e. one after the other. Providing isolation is the main goal of concurrency control. Depending on concurrency control method, the effects of an incomplete transaction might not even be visible to another transaction.

Durability

Durability means that once a transaction has been committed, it will remain so, even in the event of power loss, crashes, or errors. In a relational database, for instance, once a group of SQL statements execute, the results need to be stored permanently (even if the database crashes immediately thereafter). To defend against power loss, transactions (or their effects) must be recorded in a non-volatile memory.

The C in ACID and CAP

Now let us make an important observation, the 'C' in ACID it ino t the same as the 'C' in CAP. In ACID, it means being consitent with all the rules defined within the Database, this includes constrants(i.e. FKs), triggers, etc. While in CAP when we talk about Consistency we mean a single-copy consistency, a strict subset of ACID consistency. And important note is that ACID usually does not support Partitional Tolerance.

Comparisons and Conclusions

We can say that a RDBMS would be your first choice, if your application requires ACID transactions, and you do not need to scale. Also, if you want to perform more complex queries,usually NoSQL would not be your first choice. However, if scaling is your goal, and you evaluate performance as a major issue, you might consider "going NoSQL".

There is also the case that you might not choose NoSQL because you need joins in your application, and you also need transaction. However consider the possibility writing this on your application.

Now, about the consistency issue in NoSQL systems,you must consider the level of criticity and availability of your application you choose a NoSQL solution. To better undestand consider this part of Brewer's 2012 article:

 "the '2 of 3' view is misleading on several fronts. First, because partitions are rare, there is little reason to forfeit C or A when the system is not partitioned. Second, the choice between C and A can occur many times within the same system at very fine granularity; not only can subsystems make different choices, but the choice can change according to the operation or even the specific data or user involved. Finally, all three properties are more continuous than binary. Availability is obviously continuous from 0 to 100 percent, but there are also many levels of consistency, and even partitions have nuances, including disagreement within the system about whether a partition exists."

The final conclusion: It all depends on your application, project and even know-how of technologies. As a final guideline if most part of your system must be highly consistent and you don't have a need to scaleyour application, do not go NoSQL, otherwise,you might find it a very usuful solution.

As a simple guideline, anlyse the needs of your project:

Scalability, Performance, and High Availability -> NoSQL


Transaction needs, complicated and more complex queries High Consistency -> RDBMS




Comments

  1. Wonderful! Amazing! Never saw so complete explanation about CAP and NoSQL

    ReplyDelete

Post a Comment

Popular posts from this blog

The Scala's Equivalent of Java ArrayList

ArrayList   is a widely misused "default"  List  in Java. It has terrible performance when elements are added/removed frequently, but works pretty well as a replacement for  Array .  But what about Scala? What is Scala's default collection? What Scala collection has characteristics similar to  ArrayList ? What's a good replacement for  Array   in Scala? So here are the answers for these: What is Scala's default collection? Scala's equivalent of Java's  List  interface is the  Seq . A more general interface exists as well, which is the  GenSeq  -- the main difference being that a  GenSeq  may have operations processed serially or in parallel, depending on the implementation. Because Scala allows programmers to use  Seq  as a factory, they don't often bother with defining a particular implementation unless they care about it. When they do, they'll usually pick either Scala's List  or  Vector . They are both immutable, and  Vector

Always Use StringBuilder while concatenating Strings within loops

A common tendency of Java programmers is to always concatenate Strings using + operator. Which is actually very good, and simplifies the code by improves readability, since we would have to use StringBuilder.append(String), if the single + operator was not allowed. In fact if we look in byte code generate from such concatenation style, we will see a StringBuilder being used to perform the action. Check the JSL:    JLS Now , the point is, although this facility, you should not use the + operator in loop concatenation. Why? A new  StringBuilder  Object will be constructed at every single loop iteration (with initial value of str) and at the end of every iteration there will be concatenation with initial String (actually  StringBuilder  with initial value of  str ). So you need to create StringBuilder by yourself only when you work with String concatenation in loop. Let us procuce the evidence First, run this code, and see how long it takes to be executed: Now, bellow is th

Maven Tips... and Tricks

Maven, one of the central actors in the Java World, resposible for managing the building life-cycles of many projects, is full of little features, that sometimes we forget to explore. Let us go straight away and take a look at some very useful Maven features that will make your builds shine. From where it stopped Sometimes it is needed to build a bunch of projects all together, artifact-a , artifact-b and so on. What do we usually do when one of them fail? Build it all again! But not anymore: By using this option you can run the build from the project that failed. Two out of ten Ok, suppose you have 10 projects, and you only want to build 2 of them, how would you do? The option -pl will do the job Multi-threaded Build If in the machine you run the build you have many Cores, tou can take advantage of them by using the following option(it means 2 Threads per Core): It is also possible to define 3 Threads per Core(T3C) Skip your Tests when you want to With a lot of test