Facebook’s DINO enables self-supervised learning for computer vision AI
Facebook Inc.’s artificial intelligence research team today announced more breakthroughs, this time in the areas of self-supervised learning and semi-supervised learning for computer vision.
Self-supervised learning in AI, also known as unsupervised learning, refers to teaching computers, or AI models, to perform certain tasks without humans having to provide labeled data.
In computer vision, AI models have traditionally been trained on labeled images such as a picture of a dog accompanied by the label “dog.” With self-supervised learning, the AI model works this out for itself, without the pictures its shown being labeled.
Facebook’s AI team said in a blog post today that it has successfully used a self-supervised learning method to train what’s known as a “vision transformer model” that can discover and segment the objects it sees in images and video, entirely on its own.
Facebook has christened its new self-supervised learning method “DINO.” It’s used to train vision transformers, which enable AI models to selectively focus on certain parts of their input and thus reason more effectively. DINO’s ability to discover and segment objects by itself has numerous potential applications, Facebook’s researchers said. For example, it could facilitate tasks such as swapping out the background of a video chat or teaching robots to navigate through a cluttered environment.
Object segmentation has always been seen as one of the most difficult challenges in computer vision because it requires the AI to understand everything it sees in an image. That traditionally always required supervised learning with large volumes of annotated examples, the researchers explained.
The DINO model is based on two components from previous self-supervised approaches, known as the “momentum teacher and “multicrop training.” The researchers said that by combining these with DINO’s “self-attention layers,” the model is capable of building a “high-level understanding” of each scene it is shown.
“DINO learns a great deal about the visual world. By discovering object parts and shared characteristics across images, the model learns a feature space that exhibits a very interesting structure,” Facebook’s AI team said. “If we embed ImageNet classes using the features computed using DINO, we see that they organize in an interpretable way, with similar categories landing near one another. This suggests that the model managed to connect categories based on visual properties, a bit like humans do.”
The researchers say DINO is well-suited for general image classification tasks and also excels at identifying image copies, even though it was never designed to do that. They say DINO even has the potential to become the industry standard for copy detection systems used to spot copyright infringement and identity misinformation.
Semi-supervised learning
Facebook’s other breakthrough today is a new method for semi-supervised learning that uses only a small number of images to achieve “state of the art results” with a tenth of the training steps.
The researchers explain that many researchers lack access to large-scale computing resources needed for to train high-performance computer vision models on lots of training data. PAWS, a new model training approach that can be used to create extremely accurate computer vision models, could well be the answer.
PAWS is said to build on self-learning approaches such as DINO, though it relies on a small amount of labeled data together with lots of unlabeled data to speed things up.
“Similar to self-supervised approaches, the focus during pretraining is to train a neural network to map images to latent representations,” the researchers explained. “Given an unlabeled training image, we generate two or more views of the image using random data augmentations and transformations, and we train the neural network to make the representations of these views similar to one another.”
Facebook said that PAWS, when training a standard ResNet-50 model using just 1% of the labels in the ImageNet training data set and with a tenth of the pre-training steps, nonetheless achieved “state-of-the-art accuracy.”
The ultimate potential of both DINO and PAWS is they can be used to build new computer vision systems that are much less dependent on labeled data and do not require massive amounts of compute resources. In other words, the DINO and PAWS methods will make computer vision AI far more accessible than before. And the models can also be far more accurate, too.
“The need for human annotation is usually a bottleneck in the development of computer vision systems. By making our approaches more annotation-efficient, we allow models to be applied to a larger set of tasks and potentially scale the number of concepts they can recognize,” the researchers said.
Facebook said it’s making DINO and PAWS open-source, and the code for both training techniques is available now on GitHub.
Images: Facebook
A message from John Furrier, co-founder of SiliconANGLE:
Your vote of support is important to us and it helps us keep the content FREE.
One click below supports our mission to provide free, deep, and relevant content.
Join our community on YouTube
Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.
THANK YOU