Guest Post: Unshackling Your Data-driven Web Apps

Written by Ines Sombra, Data Engineer, Engine Yard

It is so easy to start a fight. Ask your friends whose legacy is more timeless: the Beatles or the Rolling Stones. If you’re watching a soccer game, declare that Cristiano Ronaldo is highly overrated. Or, if you’re at a database conference, tell those relational database types that Web apps don’t all need to conform to ACID principles (atomicity, consistency, isolation, and durability).

Yes, the fur will fly. But you’ll have broached a topic that urgently needs more attention. The fact is, the evolution of application design has not kept pace with the explosion of data. In today’s “connected era,” in which billions of devices are now generating, collecting, analyzing, and sharing massive volumes of data with each other, a company’s ability to innovate is largely dependent on its ability to process all this data in a timely fashion. And one frequent obstacle to innovation is the long-standing notion that only ACID-compliant databases will satisfy the stringent requirements of data-driven Web apps. The time has come to investigate other options.

The Limitations of ACID

ACID means that database changes are “all or none” (atomic), that any change or read doesn’t interfere with others (isolated), that the result of any change is a new database state that remains fixed (durable), and that any transaction performed will take the database from one consistent state to another. These guarantees come with certain performance costs, and applications that don’t require all properties of ACID can afford to trade off some of them for higher throughput. Such applications often include those that index large numbers of documents, serve pages on high-traffic websites, or deliver streaming media. In these cases, maintaining the integrity of a transaction is secondary to completing a request as quickly as possible.

Strict adherence to ACID also leads to an increased amount of development work as the database load grows larger and the volume of data increases. Techniques such as sharding (horizontally partitioning your data using a hashing algorithm) have emerged over time to address the complications of manipulating large amounts of data, but these solutions unnecessarily influence application architecture and don’t solve the underlying problem. For example, if you’re “lucky” enough to have “big data,” and you need to change the schema,  a migration could still take months to run!

Nathan Hurst’s guide to NoSQL Systems

Beyond ACID

The ACID vs. non-ACID debate is not going to be settled here and now, nor does it need to be. The purpose of this article is simply to point out that there are alternatives to ACID, and that the time has come to explore them. Why now? Several methodologies, practices and technologies have gained enough traction to encourage us to look beyond ACID-only data stores. Here are a few of them:

●      Widespread adoption and validation of agile development methodologies, fostering rapid innovation and an iterative approach to feature development.

●      Mature development frameworks that enable developers to connect alternative data stores more easily. This also means that projects are more likely to use different types of data stores depending on technical requirements.

●      Service-oriented application design – data stores and application logic are increasingly decoupled in modern architectures, with applications developed as a collection of functional modules connected via APIs.

●      The vast amounts of data gathered by social platforms – and most applications today have a social component – can often be represented more efficiently using a non-relational schema.

●      Non-relational databases have matured with the guidance of early adopters and companies whose businesses would not be possible without them. Companies have also emerged to provide commercial support and development for these technologies.

The New Breed of Databases

Thanks to the open source movement and the accelerating pace of commercial development, there is no shortage of non-relational, distributed databases. Of course, each database has advantages and disadvantages, as well as specific use cases.

Before we delve into specific examples, let’s be clear about one key point: it’s a mistake to assume that non-ACID datastores make absolutely no guarantees. They simply provide a different set of them. These guarantees are governed by the “CAP Theorem,” which essentially states that you may choose any two of the following three:

●      Consistency: all nodes have the same view of the data.

●      Availability: every request to a non-failing node returns a response.

●      Partition tolerance: system properties (consistency or availability) hold true even when the system is partitioned.

For example, you can choose to have consistency and availability, while sacrificing partition tolerance. Or choose availability and partition tolerance, so processing can continue even in the case of network failure, but decide to forgo a consistent view of your data.

An excellent summary of this concept, along with this very intuitive and useful visual guide to Non-ACID systems, has been developed by Nathan Hurst and others.

An Overview of the Current Players:

The list below summarizes key characteristics and use cases of some of the most popular non-ACID datastores.

●      Riak is a Dynamo-based NoSQL database designed specifically for extreme distribution, fault tolerance and scalability. It shines in applications where even seconds of downtime are unacceptable. It has no single point of failure, scales simply and intelligently, and makes data highly available for use in read and write-intensive Web applications. Commercial support is available from Basho.

●      Redis, sponsored by VMware, is an open source, disk-backed, in-memory data store written in C. It is a datatype server, so it provides highly optimized operations on sets, lists, arrays, etc. Redis is extremely fast and easy to set up; it may not be best for large databases, but it’s a great choice for rapidly changing data with a DB size that fits in RAM.

●      MongoDB is a document-oriented database written in C++. It provides schema-free databases that store JSON documents in binary format. It is extremely popular and remarkably easy to get running. MongoDB offers automatic failover when replicated in a set, and its single-master, low-concurrency read performance benchmarks are impressive. MongoDB is a good choice for read-heavy applications where all data fits in RAM. Commercial support is available from 10gen.

●      CouchDB: Another open source, document-oriented store, CouchDB is written mostly in Erlang and is designed for local replication and horizontal scaling across a wide range of devices. Like MongoDB, CouchDB is easy to use, but it has a more robust replication model and greater data consistency guarantees than MongoDB.

●      Membase/CouchBase: CouchBase Server is the result of a Memcache company and a CouchDB company joining forces. In its Membase mode, it is optimized for storing data for highly interactive Web applications. It provides a high-speed distributed key/value store that is extremely performant and very easy to set up. In its Couchbase mode, it offers the same data persistence, clustering, and flexible replication modes of CouchDB. Commercial support is available from Couchbase.

●      Neo4j is a popular open source database that is optimized to represent graph relationships. Implemented in Java, it is an embedded, disk-based, fully transactional Java persistence engine that stores data structured in graphs rather than tables—meaning that operations that traverse a network are fast. Neo4j is developed by Neo Technology, a startup based in Malmo, Sweden and Menlo Park, CA.

●      Cassandra is a column-based DB designed to handle very large amounts of data spread out across many servers with no single point of failure. Cassandra is a NoSQL solution initially developed by Facebook to power its Inbox Search feature.[1] An industrial-strength DB, it is best for write-heavy applications. Cassandra can be a bit cumbersome to set up and manage, but commercial versions make this task easier. Commercial support is available from DataStax and Acunu.

●      Hadoop is an open source software framework with an entire ecosystem of tools, languages and knowledge. Inspired by Google’s MapReduce and Google File System (GFS) papers, Hadoop is best for heavy analytics and processing of vast amounts of data. It is extremely mature and battle-tested. Commercial support is available from ClouderaHortonWorks and MapR, among others.

Simplifying Exploration of New Database Options

Without question, ACID is an essential requirement for certain types of applications and will continue to find usage. But there is an increased understanding of the potential of non-ACID databases and the opportunities they provide in improving performance in Web applications that don’t require transactional guarantees.

As the evidence continues to accumulate regarding the benefits of non-ACID databases, it’s important to ask yourself, as part of your requirements analysis, whether one of these databases might be a better tool for the job at hand. You should also consider whether a combination of tools is the best solution, mixing ACID and non-ACID data stores as appropriate.

Learn More: Suggested Reading and Upcoming Events

For additional discussion of the topics covered in this article, please read the following publications and blogs.

●      Alex Popescu’s myNoSQL: http://nosql.mypopescu.com/

●      NoSQL Tapes: http://nosqltapes.com/

●      Basho’s video resources ( http://basho.com/resources/videos/ ) The basho speakers are fantastic. You’ll always learn something new from them.

●      InfoQ NoSQL presentations. Great collection at /a>

●      10gen’s videos and presentations: http://www.10gen.com/presentations

●      Kristóf Kovács NoSQL comparison http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis

In addition, the following events and meetups are highly recommended:

●      Boundary tech talks (great place to learn about distributed systems): http://www.meetup.com/Boundary-Tech-Talks/

●      Riak Meetup: http://www.meetup.com/San-Francisco-Riak-Meetup/

●      MongoDB Meetups: http://www.meetup.com/San-Francisco-MongoDB-User-Group

●      Cassandra Meetups: http://www.meetup.com/San-Francisco-Cassandra-User-Group/

●      HBase: http://www.meetup.com/hbaseusergroup/

●      SF Graph DBs: http://www.meetup.com/graphdb/

●      Couchbase: http://www.meetup.com/The-San-Francisco-Couchbase-Meetup-Group/

{Editors Note:  This is a guest post by Ines Sombra, Data Engineer, Engine Yard.  

Ines Sombra is a data engineer at Engine Yard, where she and her team are investing heavily in solutions that help customers create, deploy and scale Big Data Web apps. Ines is a co-organizer of the Dallas Ft. Worth Big Data group and a member of RailsBridge in San Francisco. }

About John Furrier

John Furrier is founder, co-CEO, and Editor-in-Chief of SiliconANGLE, a new media company covering the intersection of computer science and social science. Furrier is also the co-founder and CEO of CrowdChat a social media platform for large-scale group conversations over hashtags. In addition to SiliconANGLE John runs Broadband Developments a private incubator and investment firm for creating new startups. Furrier lives in Palo Alto, California with his wife and four children.