The challenge of finding reliable AI performance benchmarks
Artificial intelligence can be extremely resource-intensive. Generally, AI practitioners seek out the fastest, most scalable, most power-efficient and lowest-cost hardware, software and cloud platforms to run their workloads.
As the AI arena shifts toward workload-optimized architectures, there’s a growing need for standard benchmarking tools to help machine learning developers and enterprise information technology professionals assess which target environments are best suited for any specific training or inferencing job. Historically, the AI industry has lacked reliable, transparent, standard and vendor-neutral benchmarks for flagging performance differences between different hardware, software, algorithms and cloud configurations that might be used to handle a given workload.
In a key AI industry milestone, the newly formed MLPerf open-source benchmark group last week announced the launch of a standard suite for benchmarking the performance of ML software frameworks, hardware accelerators and cloud platforms. The group — which includes Google, Baidu, Intel, AMD and other commercial vendors, as well as research universities such as Harvard and Stanford — is attempting to create an ML performance-comparison tool that is open, fair, reliable, comprehensive, flexible and affordable.
Available on GitHub and currently in preliminary release 0.5, MLPerf provides reference implementations for some bounded use cases that predominate in today’s AI deployments:
- Image classification: Resnet-50 v1 applied to Imagenet.
- Object detection: Mask R-CNN applied to COCO.
- Speech recognition: DeepSpeech2 applied to Librispeech.
- Translation: Transformer applied to WMT English-German.
- Recommendation: Neural Collaborative Filtering applied to MovieLens 20 Million (ml-20m).
- Sentiment analysis: Seq-CNN applied to IMDB dataset.
- Reinforcement: Mini-go applied to predicting pro game moves.
The first MLPerf release focuses on ML-training benchmarks applicable to jobs. Currently, each MLPerf reference implementation addressing a particular AI use cases provides the following:
- Documentation on the dataset, model and machine setup, as well as a user guide.
- Code that implements the model in at least one ML/DL framework and a dockerfile for running the benchmark in a container;
- Scripts that download the referenced dataset, train the model and measure its performance against a prespecified target value (aka “quality”).
The MLPerf group has published a repository of reference implementations for the benchmark. Reference implementations are valid as starting points for benchmark implementations but are not fully optimized and are not intended to be used for performance measurements on target production AI systems. Currently, MLPerf published benchmarks have been tested on the following reference implementation:
- 16 central processing unit chips and one Nvidia P100 Volta graphics processing unit;
- Ubuntu 16.04, including docker with Nvidia support;
- 600 gigabytes of disk (though many benchmarks require less disk); and
- Either CPython 2 or CPython 3, depending on benchmark.
The MLPerf group plans to release each benchmark — or a specific problem using specific AI models — in two modes:
- Closed: In this mode, a benchmark — such as sentiment analysis via Seq-CNN applied to IMDB dataset — will specify a model and data set to be used and will restrict hyperparameters, batch size, learning rate and other implementation details.
- Open: In this mode, that same benchmark will have fewer implementation restrictions so that users can experiment with benchmarking newer algorithms, models, software configurations and other AI approaches.
Each benchmark runs until the target metric is reached and then the tool records the result. The MLPerf group currently publishes benchmark metrics in terms of average “wall clock” time needed to train a model to a minimum quality. The tool takes into consideration the costs of jobs as long as price does not vary over the time of day that they are run. For each benchmark, the target metric is based on the original publication result, minus a small delta to allow for run-to-run variance.
The MLPerf group plans to update published benchmark results every three months. It will publish a score that summarizes performance across its entire set of closed and open benchmarks, calculated as the geometric mean of results for the full suite. It will also report power consumption for mobile devices and on-premises system to execute benchmark tasks and will report cost for cloud-based systems performing those tasks.
The next version of the benchmarking suite, slated for August release, will run on a range of AI frameworks. Subsequent updates will include support for inferencing workloads, eventually to be extended to include those executing run on embedded client systems. It plans to incorporate any benchmarking advances developed in “open” benchmarks into future versions of the “closed” benchmarks. And it plans to evolve reference implementations to incorporate more hardware capacity and optimized configurations for a range of workloads.
MLPerf is not the first industry framework for benchmarking AI platforms’ performance on specific workloads, though it certainly has the broadest participation and the most ambitious agenda. Going forward, Wikibon expects these established benchmarking initiatives to converge or align with MLPerf:
- DAWNBench is for end-to-end DL training and inferencing. Developed by MLPerf member, Stanford University, DAWNBench provides a reference set of common DL workloads for quantifying training time, training cost, inference latency and inference cost across different optimization strategies, model architectures, software frameworks, clouds and hardware. It supports cross-algorithm benchmarking of image classification and question answering tasks.
- DeepBench benchmarks training and inferencing performance of DL frameworks such as TensorFlow, Torch, Theano and PaddlePaddle. Developed by Baidu, which is also an MLPerf member, this tool benchmarks the performance of basic DL operations (dense matrix multiplies, convolutions and communication) run on different AI-accelerator chipsets. It includes training results for seven hardware platforms (Nvidia’s TitanX, M40, TitanX Pascal, TitanXp, 1080 Ti, P100 and Intel’s Knights Landing) and inference results for three server platforms (Nvidia’s TitanX Pascal, TitanXp and 1080 Ti) and for three mobile devices (iPhone 6 &7, RaspBerry Pi 3). However, it does not measure the time required to train an entire model.
- Microsoft has open-sourced a GitHub repo that creates what it calls a “Rosetta Stone of deep-learning frameworks” to facilitate cross-framework benchmarking of GPU-optimized DL models. The repo includes optimized modeling code that is accessible through up-to-date high-level APIs (Keras and Gluon) supported in various frameworks. For alternative multi-GPU configurations, it publishes benchmarks for performance comparisons of these models – specifically, training-time results for CNN and RNN models performing ResNet50 image recognitions on CIFAR-10 datasets and for RNN models doing sentiment analysis on IMDB movie reviews. These benchmarks compare training-time performance for these DL models across frameworks and languages. Microsoft has also invited any data scientist to spin up an Azure Deep Learning Virtual Machine and contribute his or her own benchmarks for any DL task, framework, API, language and GPU configuration they wish.
- CEA N2D2 is an open-source benchmarking framework that simulates performance of DL models on various hardware configurations. Built by Paris-based research institutes with industrial and academic partners. N2D2 enables designers to explore and generate DL models. It compares different hardware on the basis of DL model accuracy, processing time, hardware cost and energy consumption. It supports simulated benchmarking on multi- or many-core CPUs, GPUs and field-programmable gate array targets.
- OpenAI’s Universe uses reinforcement learning to support benchmarking of AI apps’ automated performance against training data collected from human interactions within the same application environment. It provides an environment within which such user interaction benchmark data can be collected organically from within interactive online app environments. Human-user sessions are recorded in Universe to provide interaction-sourced training data for benchmarking AI app performance.
Here’s a video where you can learn more about MLPerf:
Image: TeroVesalainen/Pixabay
A message from John Furrier, co-founder of SiliconANGLE:
Your vote of support is important to us and it helps us keep the content FREE.
One click below supports our mission to provide free, deep, and relevant content.
Join our community on YouTube
Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.
THANK YOU