At Flink Forward, Ververica evolves Apache Flink into a unified data platform
Stream computing is a key platform for a growing range of data-rich, low-latency applications.
More online apps — such as mobility, the “internet of things,” media, gaming and serverless — require a robust, low-latency data processing backbone. Core features of many streaming apps now include real-time event processing, continuous computation, stateful semantics, publish-and-subscribe messaging, changed data capture and ACID transaction features.
Stream computing is rising fast
Over the coming decade, data-at-rest architectures — such as data warehouses, data lakes and transactional data stores — will become less central to enterprise data strategies. In Wikibon’s big data analytics market update a year ago, we uncovered several trends that point toward a new era in which stream computing is the foundation of most data architectures:
- Stream computing is the foundation of many new edge applications, including access by mobile, embedded and “internet of things” devices, with back-end infrastructure providing real-time device management and in-stream analytic processing.
- Enterprises are expanding their investments in in-memory, continuous computing, change data capture and other low-latency solutions while converging those investments with their big data at-rest environments, including Hadoop, NoSQL and RDBMSs.
- Streaming environments are evolving to support low-latency, application-level processing of live data in any volume, variety, frequency, format, payload, order or pattern.
- Stream computing backbones are being deployed to manage more stateful, transactional workloads, execute in-stream machine learning and handle other complex orchestrated scenarios that have heretofore been the province of relational databases and other at-rest repositories.
- Online transactional analytic processing, data transformation, data governance and machine learning are increasingly moving toward low-latency, stateful streaming backbones.
- Vendors are introducing innovative solutions that incorporate streaming platforms ensuring they can serve as a durable source of truth for diverse applications.
- Cloud providers have integrated streaming technologies into the heart of their solution portfolios for mobility, IoT, serverless computing and other key solution patterns.
- Enterprises are migrating more inferencing, training and other workloads toward edge devices that process real-time streams of locally acquired sensor data.
- Open-source streaming environments are becoming important enterprise big-data platforms.
- Batch-oriented big data deployments are giving way to more completely real-time, streaming, low-latency end-to-end environments.
- Most machine learning, deep learning and other artificial intelligence workloads will be processed in stream in real time.
Apache Flink sustains its momentum in stream computing
Over the past several years, the stream computing market has seen a glut of open-source projects come into use. Many of these are now under the Apache Software Foundation. In addition to the many mature commercial stream computing and complex event processing solutions on the market, enterprises can choose from such alternatives Apache Kafka, Flink, Spark Streaming, Apex, Heron, Samza, Storm, Pulsar and Beam.
Though the functional overlaps among these stream-computing projects are considerable, Wikibon has been seeing a growing number of enterprise implementations that use two or more of them, leveraging the advantages of each. Next to Kafka, Apache Flink is the most popular stream computing open-source project.
Already in the 10th year since its invention and fifth year since it became an Apache project, Flink’s strong suit is its architectural versatility. Apache Flink can ingest millions of data points per second and do so while keeping track of relevant contextual information. Its most prominent users include Netflix Inc., Uber Technologies Inc., Lyft Inc. and Alibaba Group Holding Ltd.
Though it lacks the publish-and-subscribe features at the heart of Kafka, Flink provides a robust framework and scalable distributed engine for the vast majority of stream computing use cases. In fact, it’s not uncommon to see both Kafka and Flink deployed in complementary fashion in many enterprise stream-computing applications.
As it currently stands, the core features of the Apache Flink open-source codebase, currently available in latest stable release 1.7.2, are that it:
- Supports stateful, event-driven, high-throughput, continuous-processing applications;
- Performs event-driven computations at in-memory speed;
- Operates at any scale;
- Runs in all common cluster environments, including Kubernetes, Docker, Mesos and YARN;
- Processes unbounded and bounded streams;
- Supports batch and continuous latencies;
- Guarantees exactly-once consistency of very large distributed state persisted across sharded tables on multiple nodes;
- Provides incremental checkpointing;
- Executes sophisticated real-time data processing;
- Supports SQL for querying low-latency applications;
- Support hybrid cloud distributed deployments through connectors to a wide range of on-premises enterprise database and computing platforms, and to Alibaba and other public clouds;
- Allows developers to build stateful streaming apps for deployment to Flink clusters;
- Can process millions of events per second and save up to terabytes of live state in a back-end RocksDB implementation;
- Supports metrics, logging and operationalization of live data streams;
- Enables forking of live streaming Flink apps and replaying of streams using historical data to guarantee strong data consistency;
- Allows developers to take snapshots of running apps, start new code from those snapshots;
- Integrates with third-party DevOps tools such as Jenkins; and
- Uses a common development abstraction that includes both DataStream and DataSet APIs.
This week at the third annual Flink Forward developer conference in San Francisco, attendees learned how the Apache Flink project and the community that uses it is likely to fare now that its principal developer — data Artisans GmbH, recently renamed Ververica — has been acquired by China-based cloud powerhouse Alibaba.
In the conference keynote, executives from Ververica and Alibaba laid out the company’s priorities for the coming decade. What was most noteworthy was how accurate Wikibon’s forecasts for the streaming market — especially its convergence with batch processing and machine learning — truly are.
Growing the Apache Flink open-source community
Apache Flink is on a roll and is becoming indispensable to a growing range of streaming use cases. Adoption, open-source code commitment, and other metrics presented at Flink Forward 2019 show that it’s becoming a key pillar in enterprise data strategies.
Robert Metzger, engineering lead at Ververica, showed stats pointing to Flink’s growing adoption on a global scale, especially in China. So it was no surprise, given Ververica’s new corporate parentage, when Metzger discussed how Ververica is launching a new Chinese-language user-support mailing list for the Apache Flink community. He also discussed the company’s efforts to integrate the substantial Flink user base in China into the open-source project’s Apache community.
To support these and other community members, Metzger discussed Ververica’s investments in improving the Flink website. Key enhancements underway include improving the ability to manage issue and bug tracking, publish community packages, and handle workflows for pull-request review and labeling.
Contributing innovations to the Apache Flink open-source codebase
Ververica plans to continue to evolve Apache Flink from stream processor into a unified data processing system. To the end it is focusing on developing Flink’s batch processing, machine learning and streaming analytics, and data warehouse/ETL integration features.
In batch processing, Xiaowei Jiang, senior staff platform engineer at Alibaba, discussed its work with the Ververica team to build out the “Blink” batch-processing capabilities in the open-source platform. To this end, planned additions to the Flink codebases will include a new Table API and an enhanced SQL query processor. According to Ververica CTO Stephan Ewen, it is working with Alibaba on improving the performance and fault tolerance of batch jobs running across distributed nodes.
In machine learning, Ververica CEO Kostas Tzoumas discussed the company’s investments in deepening Apache Flink’s algorithm libraries, utilities, and user interface for serving teams of data scientists who are building artificial intelligence and stream analytics applications for real-time continuous computation. They are also adding support for development of Flink machine learning apps in Zeppelin notebooks.
In data warehouse and ETL integration, Flink, according to Tzoumas, is being integrated more tightly with Hive’s metastore and data catalog. It’s also seeing performance enhancements in its embedded SQL query processing engine.
In addition, various breakouts during the day focused on ongoing Apache Flink enhancements that will tighten its integration with TensorFlow, Apache Beam and Apache Pulsar.
Taken together, these architectural improvements will enable open-source Apache Flink to support more enterprise use cases that have historically gone to at-rest data platforms such as Apache Hadoop.
Developing the Apache Flink commercial ecosystem
Last year, data Artisans launched a commercial version of Flink aimed at enterprises. The platform includes features for automating the setup and maintenance of large-scale deployments. It also provides support for ACID, an approach that makes it possible to guarantee the reliability of important information such as financial records.
To sustain the commercial momentum of the Flink ecosystem, Ververica has retained and rebranded all of data Artisan’s products. Formerly known as dA Platform, the newly renamed Ververica Platform, which is delivered as licensed software, includes three core components:
- Apache Flink (the open-source engine for distributed, stateful, real-time in-stream computation);
- Ververica Application Manager (a framework for lifecycle management of live, stateful computing production applications on Flink); and
- Ververica Streaming Ledger (a library on top of Flink for serializable ACID transactions among shared, distributed state tables).
According to Tzoumas, Ververica is expanding its Flink training and consulting programs. They are also recruiting new platform and services partners to drive the company’s solutions into more opportunities around the world.
What’s missing from Ververica’s go-to-market strategy?
If it hopes to expand adoption for Flink in the enterprise, Ververica will need to take the following strategic steps:
- Position Apache Flink’s differentiated use cases more succinctly within the growing range of hybrid stream environments in enterprise, especially with respect to Kafka and Spark Streaming;
- Bring more of an edge focus to Ververica product development in order to gain further footholds for Flink in mobile, embedded, and IoT devices;
- Bring data science toolchains and DevOps vendors into Ververica’s partner ecosystem to ensure that more machine learning applications are built and trained for deployment in distributed Flink environments;
- Take a line-of-business and vertical-industry focus in Ververica’s go-to-market strategy in order to reach more business customers with quick-value streaming solutions that incorporate embedded Flink runners;
- Build out Ververica’s public cloud integrations in order to ensure that Flink can be incorporated as the stream computing platform of choice in more enterprise hybrid cloud deployments.
- Expose the Ververica platform’s containerized capabilities as serverless functions and plug its engine in as a back-end to Knative so that Flink can more easily be integrated into cloud-native applications
For further news from Flink Forward 2019, check out the Ververica blog. And to see how far the company has come in the past year, check out is what Tzoumas had to say on theCUBE at Flink Forward 2018.
Image: Apache Flink
A message from John Furrier, co-founder of SiliconANGLE:
Your vote of support is important to us and it helps us keep the content FREE.
One click below supports our mission to provide free, deep, and relevant content.
Join our community on YouTube
Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.
THANK YOU