UPDATED 17:47 EST / MAY 21 2018

EMERGING TECH

Pushing AI performance benchmarks to the edge

You can’t optimize your artificial intelligence applications unless you’ve benchmarked their performance on a range of target hardware and software infrastructures.

As I discussed recently, the AI industry is developing benchmarking suites that will help practitioners determine the target environment in which their machine learning, deep learning or other statistical models might perform best. Increasingly, these frameworks are turning their focus to benchmarking AI workloads that run on edge devices, such as “internet of things” endpoints, smartphones and embedded systems.

There are as yet no widely adopted AI benchmarking suites. Of the ones under development, these are the ones that stand the greatest chance of prevailing down the road:

  • Transaction Processing Performance Council’s AI Working Group: The TPC includes more than 20 top server and software makers. Late last year, the organization formed a working group to define AI hardware and software benchmarks that are agnostic to the underlying chipsets where the workloads are executed. The working group plans to focus its initial benchmark suite on AI-training workloads. However, it hasn’t indicated whether or when it will broaden the initiative’s scope to include AI workloads running on edge nodes.
  • MLPerf: Early this month, Google Inc. and Baidu Inc. announced that they are teaming with chipmakers and academic research centers to create the AI benchmark MLPerf. The benchmark’s first version of is expected to be ready for use in August. As I noted here, MLPerf currently provides benchmarks for ML-training use cases that predominate in today’s AI deployments: computer vision, image classification, object detection, speech recognition, machine translation, recommendation, sentiment analysis and gaming. The suite will initially focus on benchmarking the performance of AI-training jobs on a wide range of systems, ranging from workstations to large data centers. The initial benchmark will only measure the average time to train a model to a minimum quality and use Nvidia Corp.’s P100 Volta graphics processing unit chip as a reference standard because of its widespread use in ML training in data centers. MLPerf’s later releases will measure inferencing performance and be extended to include embedded and other edge-client AI workloads.
  • EEMBC’s Machine Learning Benchmark Suite: Industry alliance EEMBC recently started a new effort to define a benchmark suite for ML executing in optimized chipsets running in power-constrained edge devices. Chaired by Intel Corp., EEMBC’s Machine Learning Benchmark Suite group will use real-world ML workloads from virtual assistants, smartphones, IoT devices, smart speakers, IoT gateways and other embedded/edge systems to identify the performance potential and power efficiency of processor cores used for accelerating ML inferencing jobs. The EEMBC Machine Learning benchmark will measure inferencing performance, neural-net spin-up time and power efficiency of low-, moderate- and high-complexity inferencing tasks. The benchmark will be agnostic to ML front-end frameworks, back-end runtime environments and hardware-accelerator targets. Some of the hardware-accelerator targets they’re addressing include Almotive Alware, Cadence Vision P6, Cambricon CPU, Ceva NeuPro, Imagination PowerVR 2NX, Nvidia NVDLA, Synopsys EV64, VeriSilicon VIP and Videantis v-MP6000. So far, the group has about a dozen members from embedded processor providers, including Analog Devices Inc., ARM Holdings plc, Flex Ltd., Green Hills Software Inc., Intel, Nvidia Corp., NXP Semiconductors NV, Samsung Electronics Co. Ltd., STMicroelectronics NV, Synopsys Inc. and Texas Instruments Inc. It is currently working on a proof-of-concept and plans to release its initial benchmark suite by June 2019, addressing a range of neural-net architectures and use cases for edge-based inferencing.
  • EEMBC’s ADASMARK: This is an application-specific AI benchmarking framework focused on smart vehicles. Separate from its Machine Learning Benchmark effort, EEMBC has for the past two years been developing a separate performance measurement framework for AI chips embedded in advanced driver assistance systems or ADAS. Due for release in late second or third quarter of this year and already in beta test with multiple users, EEMBC’s ADASMARK suite will assist in measuring the performance of AI inferencing workloads executing in multidevice, multichip, multiapplication smart-vehicle platforms. ADASMARK will benchmark real-world ADAS inferencing workloads associated with highly parallel applications, such as computer vision, autonomous driving, automotive surround view, image recognition and mobile augmented reality. It will measure ADAS inferencing performance across complex architectures consisting of CPUs, GPUs and other hardware-accelerator chipsets.

Clearly, benchmarking an AI application’s performance can be relatively straightforward if you narrowly constrain the use cases, workloads and hardware targets over which performance measurements are to be made. For example, you might simply benchmark your image-classification model’s processing of a widely used convolutional neural network algorithm such as Resnet-50 v1 using an open image repository such as ImageNet on a specific cloud architecture incorporating GPUs.

Most industry efforts will provide benchmarking frameworks for the most common such AI use cases and targets. But if you seek industry-standard AI benchmarks that address every possible use case and target architecture, don’t hold your breath. Several factors make it unlikely that one-size-fits-all benchmarks of that sort will emerge:

  • Complexity of AI edge-target platforms: Benchmarks will have a tough time addressing the full range of heterogeneous multidevice system architectures (such as drones, autonomous vehicles, and smart buildings) and commercial systems-on-a-chip platforms (such as smartphones and computer-vision systems) into which AI apps will be deployed in edge scenarios.
  • Proliferation of innovative AI edge apps: Benchmarking suites may not be able to keep pace with the growing assortment of AI apps being deployed to every type of mobile, IoT or embedded device. In addition, innovative edge-based AI inferencing algorithms, such as real-time browser-based human-pose estimation, will continue to emerge and evolve rapidly, not crystallizing into standard approaches long enough to warrant creating standard benchmarks.
  • Diversity of AI cloud-to-edge training and inferencing workflows: Benchmarking suites will generally include a few reference workflows for AI training and inferencing. However, the range of alternative training and inferencing workflows (on the edge, at the gateway, in the data center, etc.) and the diversity of interactions among nodes in these tiers will make it unlikely that any one benchmarking suite can do them all justice. For example, consider the rich workflow supported by Intel’s OpenVINO Toolkit: building and training computer-vision models in the cloud, deploying them across a broad range of edge devices, rapidly analyzing vast streams of data near the edge, responding in real time, but moving only the most relevant insights back to the cloud asynchronously.

In all of these scenarios, AI developers may need to build bespoke benchmarks to assess the comparative performance of target architectures in scenarios that are too complex for standardized measurement frameworks. That’s why the MLPerf group was wise to break the benchmarks into two modes:

  • Closed-mode benchmarks: This is when a benchmark — such as sentiment analysis via Seq-CNN applied to IMDB dataset — specifies a model and data set to be used and will restrict hyperparameters, batch size, learning rate and other implementation details.
  • Open-mode benchmarks: This is when that same benchmark will have fewer implementation restrictions so that users can experiment with benchmarking newer algorithms, models, software configurations and target architectures.

Wikibon expects that it may take two to three years for the disparate industry initiatives discussed earlier in this article to converge into a common architecture for benchmarking AI applications within complex cloud and edge architectures. Here are the likely milestones we see occurring by the end of this decade:

  • Consensus benchmarks for the predominant AI use cases — especially computer vision, speech recognition, machine translation, sentiment analysis, and gaming — will emerge first from cross-industry initiatives.
  • The mobility industry will define common benchmarks for the core AI apps — such as face recognition and digital-assistant recommenders — that are built into smartphones.
  • Specialized AI performance benchmarks for complex edge platforms — especially autonomous vehicles and drones — will probably be hammered out by those particular industries, referencing and extending any cross-industry benchmarks.

This Intel video is a good tutorial on benchmarking of high-performance hardware for distributed deep learning:

Image: Kin Lane/Flickr

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU