UPDATED 12:00 EDT / OCTOBER 09 2019

AI

IBM and MIT break new ground in video recognition model training

IBM Corp. has teamed up with researchers from the Massachusetts Institute of Technology to create a new method for training “video recognition” deep learning models more efficiently.

Deep learning is a branch of machine learning that aims to replicate how the human brain solves problems. It has led to major breakthroughs in areas such as language translation and image and voice recognition.

Video recognition is similar to image classification, in that the deep learning model basically tries to identify what’s going on in a video, including the objects and people it sees, what they’re doing and so on. The main difference between the two is that videos have a lot more moving parts than a simple, static image, and so training deep learning models to understand them takes much more time and effort.

“By one estimate, training a video recognition model can take up to 50 times more data and eight times more processing power than training an image classification model,” MIT explained in a blog post today.

Of course, no one likes devoting huge amounts of compute resources to such a task because it can often be prohibitively expensive. Moreover, the resources needed makes it next to impossible to run video recognition models on low-powered mobile devices, where many AI applications are going.

Those problems are what inspired a research team led by Song Han, an assistant professor at MIT’s Department of Electrical Engineering and Computer Science, to come up with a more efficient model for video recognition training. The new technique dramatically reduces the size of video recognition models in order to speed up training times and improve performance on mobile devices.

“Our goal is to make AI accessible to anyone with a low-power device,” Han said. “To do that we need to design efficient AI models that use less energy and can run smoothly on edge devices where so much of AI is moving.”

Image classification models work by looking for patterns in the pixels of an image in order to build up a representation of what they see. With enough examples, the models can learn to recognize people, objects and the ways they relate to one another.

Video recognition works in a similar way, but the deep learning models go further by using “three-dimensional convolutions” to encode the passage of time in a sequence of images (video frames), which leads to bigger and more computationally-intensive models. To reduce the calculations involved, Han and his colleagues designed an operation they call a “temporal shift module” which shifts the feature maps of a selected video frame to its neighboring frames. By mingling spatial representations of the past, present and future, the model gets a sense of time passing without explicitly representing it.

The new technique resulted in a model that can be trained three times faster than existing models on the Something-Something video dataset, which is a collection of densely labeled video clips that show humans performing predefined basic actions with everyday objects.

The model can even understand people’s movements in real time and is also extremely power-efficient. For example, it enabled a single-board computer rigged to a video camera to instantly classify hand gestures, using the same amount of energy required to power a bike light.

Machine Learning is still in its early phases and so are the gains that can be achieved with innovative approaches such as this, said Holger Mueller, principal analyst and vice president at Constellation Research Inc. “Today it is the turn of MIT and IBM to accelerate video recognition, which happens to be one of the hardest ML jobs there is.”

IBM and MIT say their new video recognition model could have useful applications in a variety of fields. For example, it could be used to help catalog videos on YouTube or a similar service more quickly. It could also enable hospitals to run AI applications locally instead of in the cloud, helping to keep confidential data more secure.

Image: mohamed_hassan/Pixabay

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU