UPDATED 18:27 EST / APRIL 08 2014

Twitter is developing multi-tenancy database to support 6,000 tweets per second

twitter bird caged cloudAs Twitter is now considered the global platform for public conversation, the company’s storage requirements have grown as well. In recent days, Twitter experiences something on order of 5,000 to 10,000 tweets a second (see the Twitter blog) from more than 240 million Twitter accounts (Twitter’s investor report.)

Over the last few years, the company found itself in need of a storage system that could serve millions of queries per second, with extremely low latency in a real-time environment. Availability and speed of the system became the utmost important factor. Not only did it need to be fast; it needed to be scalable across several regions around the world.

This is why at this time, an improvement plan the IT architecture in the long term is drawn. The company prioritizes infrastructure investments around enhancing the simplicity, scalability and performance of its database portfolio.

The Manhattan database system

 .

Last week, Twitter published a blog post detailing its Manhattan database system, built to power a wide variety of applications. The distributed, real-time database was designed to serve multiple teams and applications within the company that existing technologies can no longer handle.

The Manhattan database system is built to cope with the roughly 6,000 tweets, retweets per tweet and replies that flood into its system every second. Manhattan was also built in-house based on Twitter’s internal needs and the company felt as if it could not do that without building a new system from the ground up to operate reliably at scale. The company currently uses open source databases MySQL and Cassandra to run its massive online empire.

“We were spending far too much time firefighting production systems to meet the performance expectations of our various products, and standing up new storage capacity for a use case involved too much manual work and process. Our experience developing and operating production storage at Twitter’s scale made it clear that the situation was simply not sustainable,” Twitter’s program operative Peter Schuller wrote on its website.

Manhattan, which began development two years ago, is designed to be an all-in-one solution. Twitter’s Manhattan storage service is meant to be consumed just like any other cloud storage service. Currently, it uses key-value store to interact with users, but Twitter is looking to scale the database by introducing a graph-based capability.

The architecture

 .

hadoop_logo.jpgThe database system consists of three storage engines that are designed for read-only Hadoop data, write-heavy and read-heavy data, respectively. The engines are powered by numerous services including strong consistency service, allowing customers to have strong consistency when doing certain sets of operations; time-series counters service to handle high volume time-series counters in Manhattan; and for importing Hadoop data, ensuring strong consistency and counting time-series data.

The data captured by the social network is stored on three different systems. The first system is called seadb, which is a read only file format; the second on is sstable, which is a log-structured merge tree for heavy-write workloads; and lastly btree, which is a heavy-read and light-write system.

Manhattan automatically matches the incoming data based on the file format. The output of the workloads is then fed into its Hadoop File System, and Manhattan transforms that information into seadb files so they can then be imported into the cluster for fast serving from SSDs or memory.

Developers can select the consistency of data when reading from or writing to Manhattan, allowing them to create new services with varying tradeoffs between availability and consistency. The company also developed internal APIs to expose this data for cost analysis, which allows developers to determine what use cases are costing the business the most, as well as which ones aren’t being used as often.

“Engineers can provision what their application needs (storage size, queries per second, etc.) and start using storage in seconds without having to wait for hardware to be installed or for schemas to be set up,” Schuller wrote.

Looking ahead

 .

As for further development, Twitter plans to release a white paper outlining even more technical details on Manhattan. The company is also working to implement secondary indexes that will help developers add an additional set of range keys to the index for a database. The secondary indexes will further speed the database queries, and developers can navigate through large amounts of data.

“The challenges are increasing and the number of features being launched internally on Manhattan is growing at rapid pace. Pushing ourselves harder to be better and smarter is what drives us on the Core Storage team,” writes Peter Schuller by way of conclusion about the Manhattan database. See Twitter’s full rundown of the database for further reading.

Given a company’s gusto for open source, it wouldn’t be startling if it open sourced Manhattan in the coming days. The company recently contributed code to Facebook’s WebScaleSQL open source project to create the perfect database designed to scale to massive proportions.

Photo: HowToStartABlogOnline.net

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU