

Google LLC today debuted VideoBERT, an artificial intelligence that can watch part of a video and extrapolate what will happen in the next few seconds like a human.
Equipping a computer with the ability to understand and draw correct conclusions from a visual scene requires an incredibly sophisticated algorithm. For Google’s researchers, however, the challenge wasn’t building the algorithm but finding enough data with which to train it. Machine learning models must ingest enormous amounts of information to understand even basic concepts and that information typically must be prepared by hand.
That wasn’t feasible for VideoBERT, since teaching the model how to predict future events required more sample videos that what Google’s researchers could’ve assembled by hand. They would have additionally had to write descriptions for each individual frame of every clip just so the AI could follow what’s happening. So the team came up with an alternative: freely available instructional videos.
In a video that shows how to cook an omelette or fill a tire, the person demonstrating the task will often explain each step as they perform it, narration that the researchers used as a substitute for the frame-by-frame descriptions they would have had to create for the AI otherwise. The team compiled over a million clips spanning categories such as cooking and gardening. They then fed them to VideoBERT to teach the model how to trace the progress of common activities.
After the training, the model was set loose on a collection of cooking videos it had never seen before. When presented with a video fragment showing a bowl of flour and cocoa powder, VideoBERT astutely predicted that the ingredients will be placed in an oven and become a brownie or a cupcake. The researchers also managed to harness the algorithm’s observation skills to extract a recipe from a video in which a chef explained how to cook a steak.
The methods Google developed to train VideoBERT could eventually find use in far more serious applications. Self-driving cars, for instance, might become safer if they gained the ability to predict accurately where nearby vehicles will be a few seconds into the future. Such foresight can also be a big asset for drones and industrial robots that operate in close proximity to human workers.
THANK YOU