 
						
											 
						
											
 
						
											 Google has finally taken the wraps off its Cloud Dataflow offering, which is designed to allow developers lacking in Hadoop skills to build sophisticated analytic “pipelines” capable of processing extremely large datasets.
Google has finally taken the wraps off its Cloud Dataflow offering, which is designed to allow developers lacking in Hadoop skills to build sophisticated analytic “pipelines” capable of processing extremely large datasets.
Cloud Dataflow was first introduced last summer, with Google touting it as a next-generation service for building systems that can ingest, transform, normalize, and analyze huge amounts of data, well into the exabyte range. Google had previously been accepting applications for a private alpha of the service, but now anyone can try the data processing system in beta mode. The software is built on Hadoop and Spark, but also relies on Google’s Flume Java and MillWheel technologies to move data within the hosted platform, but there’s not a trace of MapReduce to be seen.
As Google explained last year, the idea behind Dataflow is a simple one: By hiding the complexity of Hadoop behind a bunch of straightforward APIs and SDKs, and hosting everything in Google’s cloud, it enables just about anyone to make use of Big Data analytics, something that’s been the private domain of data scientists up until now.
“Today, nothing stands between you and the satisfaction of seeing your processing logic, applied in streaming or batch mode (your choice), via a fully managed processing service,” wrote Google product manager William Vambenepe in a blog post. “Just write a program, submit it and Cloud Dataflow will do the rest. No clusters to manage, Cloud Dataflow will start the needed resources, autoscale them (within the bounds you choose) and terminate them as soon as the work is done.”
As stated above, Dataflow relies on Google’s Compute Engine cloud service to provide the raw computing power, while Google Cloud Storage and BigQuery are employed to store and access the data. Basically, it makes use of several of the main components found in Google’s Cloud Platform, which competes with Amazon Web Services and Microsoft Azure.
Besides the Dataflow news, Google simultaneously announced an update to its BigQuery service, which provides a Structured Query Language (SQL) interface to help developers delve into large sets of unstructured data. SQL is one of the most common programing languages, used by almost all traditional relational databases, which means it’s well understood by the vast majority of database managers.
With the update, Google has enhanced BigQuery so it can now ingest up to 100,000 rows per second per table. In addition, Google is at last making the service available to European customers. BigQuery data can now be stored in Google’s European-based data centers, which means companies there will now be able to adhere to the EU’s strict data sovereignty regulations. Finally, Google has added new row-level permissions to BigQuery, which can be used to limit data accessibility based on user credentials. This means users can protect sensitive data such as people’s names and addresses while alllowing access to other details, for example customer’s anonymized purchasing history.
Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.
Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.