Boston-based software company, Sqrrl Data Inc., has been busy lately, building a small empire around the very needed market of security in Big Data solutions. With a fresh round of funding and a history of experience building solutions for government agencies, there’s a great deal to be learned from Sqrrl.
The $5.2 million raised in Sqrrl’s recent funding round will be used to accelerate its growth plans and to promote its market opportunity in Big Data security, which Sqrrl believes is a booming sector of which CIOs and cybercriminals alike will soon take notice.
We’ve been following the startup’s progress over the past year, taking every opportunity to learn from Sqrrl the technical details of security for Big Data. Research firm Wikibon hosted a four-part Whiteboard Series on Apache Accumulo which featured Sqrrl’s founding CTO Adam Fuchs. We’ve created a special collection based on that Whiteboard Series below.
Accumulo is a highly secure disk-based, key-value store that combines Google’s BigTable storage system with innovations Fuchs and his colleagues developed as part of their work for the NSA. It utilizes a data structure known as the log-structured merge tree to rapidly sort randomly ordered key-value pairs using as little disk space as possible.
So what lessons did Sqrrl impart in this four-part Whiteboard series?
4 Lessons in Big Data Security
- Lesson One: Start small and design for scalability
Fuchs worked at the National Security Agency and one of the lessons he learned there was to create applications that are designed for scalability.
In the past he has seen various ways of creating an app, such as getting everything they need to make the app work and spending huge amounts of time on the app, before finally releasing it. Then there’s the prototyping effort wherein you build an app and get it to market quickly, but you have to take the app offline in order to redesign it, which will stagnate the app’s growth for some time.
Then there’s the Sqrrl way, wherein the apps are quickly launched in the market but are designed for scalability, which means apps don’t need to taken offline for that to happen. This ensures continued adoption for the app.
- Lesson Two: Cell -Level Big Data Security Controls
Organizations have a difficult time bringing together huge amounts of data for analysis because of safety and security issues, but Sqrrl has found a way to secure Big Data environments.
According to Fuchs, Sqrrl’s cell-level security capabilities can overcome Big Data security issues by applying access controls to every data object. These controls can be integrated with an application’s authorization system, user attributes, internal information, system security policies, auditing and enterprise authentication.
- Lesson Three: Near Real-time Performance
Accumulo’s secret in delivering near real-time performance lies in the merging of tablets into a unified stream of key-value pairs in order to make data easily accessible for users.
Fuchs explains that Accumulo is made up of tablets where incoming data is partitioned. Incoming data is fed into an in-memory map and then replicated onto HDFS to maximize availability. The latter process involves buffering information into sequential streams that are flushed to disk as soon as they “fill up.”
He goes on to explain that the amount of latency is proportional to the number of tablets, but is greatly reduced by the major compaction that the platform carries out in the background. This operation integrates data into a globally sorted file that is ready to go through iterator keys.
- Lesson Four: How to Bring Structure to a Schema-less Database
In this lesson Fuchs explains that Accumulo limits querying to a range within a keyspace, and that range represents a hierarchical structure which follows a row, column, and timestamp format. The row determines how the data is partitioned in the database, the column defines vertical partitioning within the row, and the qualifier denounces the uniqueness of the value stored in the key-value pair. A user can search a specific row, a row in a particular column family, and any value or set of values that may be associated with it.
But in order to optimize NoSQL databases, Fuchs stated that the secret lies in pairing a document table with an inverted index. The document is organized using universally unique identifiers (UUIDs) that represent fields, which in turn contain values that can be retrieved by querying the IDs. This table design enables users to perform query based on the characteristics of a document (that is, the value or parts of the value they’re looking for) rather than its identifier.