Your Guide to Big Data Jargon

Jargon is the secret language that those who are “in the know” know; a specific terminology for a group, profession, event, or activity. Those who are not “in the know” have no clue, which can be a big problem – especially when  you are the one in charge. Few disciplines have as much jargon as technology. From consultants discussing synergy to vendors describing new offerings, “talking tech” can be like speaking another language. On area where the words are flying particularly fast is big data.  This is a brief primer on the key jargon terms.

Speaking Big Data

Big Data: Big data is information that helps organizations make decisions, and act in real-time to analyze information in order to understand their customers and markets. That data is increasing in volume, velocity and variety making it more difficult for organizations and existing products to handle. There is no specific measurement or threshold that qualifies data as big, so many experts indicate data becomes big when it can’t be easily managed by traditional data management tools. Big data is can provide new business  insights, but it can impact everything from data management tools to storage strategies.

Structured Data: According to PC Magazine, structured data is “data that resides in fixed fields within a record or file.  Relational databases and spreadsheets are examples of structured data,” but the data doesn’t have to be in a fixed location, such as XML files.  The data in those types of files are “tagged and can be accurately identified.” Traditionally, IT has focused on managing structured data, but with the rise of social media and other Web 2.0 technologies, unstructured data is becoming an important source for business insight.

Unstructured Data: Data that falls outside of the purview of an organization’s “traditional” databases.   Unstructured data includes images, videos, email, documents and text. Businesses are increasingly examining content like emails, social media activity and customer contents to uncover trends and consumer sentiments. ZDNet believes that “organizations that do get their arms around this data will gain significant competitive edge.” However, taking advantage of unstructured data can require new platforms and technology skills.

Big Data Analytics:  Big data analytics refers to the practice examining large volumes of diverse varieties of data to find patterns, unknown correlations or other useful information.  The enormous volumes and types of data (e.g. web or call logs) examined in big data analytics may not be addressed by conventional business intelligence initiatives and  can help businesses make better decisions. However, traditional tools are increasingly unable to handle the processing requirements for big data. These needs have spawned an entirely new generation of technology like NoSQL, Hadoop and MapReduce.

Cloud: The definition of “the Cloud” is ambiguous. It can range from the narrow (“an updated version of utility computing: basically virtual servers available over the Internet,”) to the extremely broad (“anything you consume outside the firewall is in ‘the cloud’”).  To those not in the IT field, the Cloud is that seemingly invisible location where you can store music, photos, videos, and files without taking up extra space on your hard drive. Cloud computing plays a large role in big data by providing a virtually infinite pool of resources to store and process data.

Big Data Apps: Also known as BDA’s, are responsible for taking all the information gathered by big data, and turning it into an easy to consume visualization.  This is a broad category that can include everything from end user analytics to new data management tools like Hadoop.

Data Cholesterol: A condition that affects computers similarly to humans—the “excessive buildup of data leads to sluggishness across your systems,” which can lead to a hindrance in the systems ability to function.

Data-mining:  The task of turning analyzed information into useful information that can then be used by an organization to cut costs, increase efficiency, and better serve customers. Data mining has grown in popularity as businesses realize that previously untapped data sources may hold secrets that provide a competitive edge.

ETL (extract, transform, load) Modeling ToolsETL modeling tools are programs that give users the ability to extract the information that they need from their data and outside sources, transform that information to fit the operational needs of the organization, and load that information into the end target (usually a database).

Petabytes: 1,000,000,000,000,000 bytes; or one million gigabytes.

Exabytes: 1,000,000,000,000,000,000 bytes; or one billion gigabytes

Zettabyte:  1,000,000,000,000,000,000,000 bytes; or one trillion gigabytes.  No single storage system has reached a capability to hold this amount of data, it is estimated by the International Data Corporation that the TOTAL amount of global data will be 2.5 zettabytes this year.

Hadoop/Map Reduce: No big data discussion is complete without mentioning Hadoop. The open source technology has only existed for a few years, but in that time it has sky rocketed to popularity. Hadoop was born at Google and is now managed by Apache. It allows organizations to use commodity hardware to process extremely large volumes of data more quickly than has been historically possible. Internet giants from Twitter to Yahoo leverage Hadoop to uncover answers in their massive data pools.

NoSQL: Although many believe NoSQL implies no more SQL, the term actually means “not only SQL.” NoSQL is a new class of data management tools that abandons traditional relational data management principles and allows organizations to stores massive amounts of diverse data, in many cases, with no schema. The term NoSQL has existed since 1998, but it has only recently grown in popularity as businesses embrace the tools. NoSQL data stores include everything from key value stores to object databases and few standards exist in the space. NoSQL databases often lack relational database features like ACID transactions, but can easily scale across multiple machines. Open source projects like MongoDB, Cassandra and Hbase have gained increasing interest in recent years, but the market is still immature and many products don’t have the level of stability or support necessary for enterprise use, but vendors are moving rapidly to close holes.

In Memory Databases: A class of databases that store data in transient memory instead of on disk, which eliminates the latency associated with I/O making them much faster than traditional databases. In-memory databases have existed for years, but the rise of cheap memory has renewed interest and led to the emergence of products like SAP Hana and Oracle Times Ten. Enterprises are taking notice because the tools promise performance improvements that can be a thousand times faster than traditional repositories.

NewSQL: Relational databases and NoSQL are still not enough to store and manage continually shifting and growing data stores. Yet another technology has emerged, NewSQL, which includes products like SQLFire and StormDB. NewSQL databases can be thought of as a hybrid between NoSQL and traditional SQL-based relational databases. They provide the scalability and performance of NoSQL systems, but also offer ACID guarantees of a traditional database system. NewSQL systems generally target use cases that involve:

  • a large number of short-lived transactions 
  • the same queries repeatedly with different inputs
  • indexed information in a single structure (no complex joins/full table scans)

While this is only a small fraction of the jargon that is associated with the quickly growing field of big data, these terms, when used appropriately, will allow even a big data novice to sound like a seasoned professional.