Mongo Berlin 2010 and thoughts on NoSQL

NoSQL is currently one of the hottest buzzwords in the software business. Considering the hotness of the topic and the number of new NoSQL variants it was a privilege to be able to attend the Mongo Berlin Conference to find out ourselves what all the buzz is about.

The CAP theorem is a well known law of storage, stating only two of the following three properties can be satisfied simultaneously in a distributed system:

  • Consistency
  • Availability
  • Partition tolerance

Traditional RDBMS satisfy C and P, which is a bummer if you’re developing a high-profile website, so clearly something else is needed. Non-Relational Databases, or NoSQL for short, come to the rescue. NoSQL relaxes the requirement on the consistency, and instead provide A and P. This means that different users might see a different state of the data at a given time, but this model is much more resilient to network problems and heavy loads. NoSQL DBs usually store documents, opposed to normalized data in RDBMSs. Documents can be almost anything, but usually they are JSON documents. Transactions spanning multiple documents, foreign key relations and joins are not supported.

What makes this interesting is that implementing eventual consistency allows for much easier horizontal scaling, i.e. adding more nodes. RDBMSs can easily be scaled vertically, by adding more powerful hardware, but scaling with more nodes is very complex, and much of the added power is lost in overhead. Vertical scaling also has its limits; there is only so much one can do with a single piece of hardware.

NoSQL has already proven itself as a perfect fit for high traffic websites like Facebook or Foursquare. The sites are built around massive amounts of data generated by tens or even hundreds of millions of users (for example, Facebook gets 100 billion (10^11) hits and stores 130TB of data every single day!) Scaling to such volumes using a RDBMS is next to impossible, since guaranteeing ACID for each transaction will slow the service down to a grinding halt with such high traffic and data volumes. Horizontal scaling is the only solution. NoSQL can also be applied to other interesting use cases, like logging services, or high scale data mining and storing, like Google is doing. The rule of thumb seems to be to apply NoSQL where volumes are high, but losing a record or two every now and then does not make a big difference.

MongoDB is one go at creating a production quality NoSQL DB. The killer features in Mongo are high-availability and automatic failover through replica sets, automatic sharding, GridFS file storage and geo indexing. The development pace has been really high, with version 1.8 being planned for release in less than two years since the first public release. The target is to release a new stable version every three months, with the developers claiming to implement the most requested features first. Sounds good, doesn't it?

As I’m an old and cynical software consultant, I had my doubts about all of this, and after the conference I have to say I still do. Lately there has been a number of incidents (Foursquare, Facebook) in services using NoSQL have failed in spectacular fashion. The attention they have received is much due to the fact, that so many people use the services. They are bound to get large publicity, even from non-techies. As much as these kind of failures happen once in a while, one cannot dismiss the significance of lacking industry-wide best practices. Developers have to make up their own while developing. This was brought up several times during the conference as well. The latest info and knowledge is scattered in blog posts, discussion forums and informal conference discussions. The same can also be said about tooling, as higher level frameworks common to the RDBMS world, such as Hibernate and JPA, are yet either missing or immature in the NoSQL world.

The biggest hurdle for NoSQL, though, is fully breaking into the corporate world. Corporations tend to value predictability very high, and choose products based on their long time costs. Compared to RDBMS, NoSQL is still very much in its infancy. The knowledge about and interest in the field is still low in corporations, whereas the RDBMS has been well established in research, education and industry over the past few decades. Professionals and vendor support is thus readily available. NoSQL is currectly also missing a single unified query language, which means heightened dependency on a single vendor, compared to SQL on RDBMS side. NoSQL is still not the silver bullet of data storage, even if it comes with many benefits.

Nevertheless, the amount of data in corporations is constantly increasing, so even if big corporations might not be among the early adopters they are bound to follow, sooner or later. One has to remember that this is not an either-or decision. NoSQL can be applied where it’s best suited, while keeping the traditional RDBMS where transactions and data normalization is needed. It goes without saying, that this will add even more complexity, but it is still a the way to use the right tool for the right job.

So what’s our take-away from the conference? At least a lot of interesting ideas and thoughts. Some of the presentations were a bit off topic, from our perspective, but all-in-all I still learned a lot. I strongly encourage everyone to go to a similar conference for themselves. Being a NoSQL novice like I am, seeing the things happen at the conference and meeting the people behind the movement were a real eye-opener, and also increased the interest in the area. There seems to be much hype around NoSQL, and much of it even seems to be well deserved. NoSQL and MongoDB provide good tools for building highly scalable architectures, but it comes with a price. Handling massive amounts of users and data will always be complex and non-trivial. To quote my colleague Mike from our conference retrospective, “with big amounts of data even small glitches easily turn into massive problems”.