The Anatomy of the Kinect Algorithms Explained
The Microsoft Kinect represents more than just a breakthrough in user-interface for video gamers as we’ve seen multiple times. It has a surprising number of applications throughout human-computer interaction. But how does it do what it does? Well, a paper has been written outlining the anatomy of the system underlying this extremely popular peripheral,
What the team did next was to train a type of classifier called a decision forest, i.e. a collection of decision trees. Each tree was trained on a set of features on depth images that were pre-labeled with the target body parts. That is the decision trees were modified until they gave the correct classification for a particular body part across the test set of images. Training just three trees using 1 million test images took about a day using a 1000 core cluster.
The trained classifiers assign a probably of a pixel being in each body part and the next stage of the algorithm simply picks out areas of maximum probability for each body part type. So an area will be assigned to the category “leg” if the leg classifier has a probability maximum in the area. The final stage is to compute suggested joint positions relative to the areas identified as particular body parts. In the diagram below the different body part probability maxima are indicated as colored areas:
Decisions trees and forests are a mechanism not too uncommon across a lot of computer science when a computer needs to predict activity across a set of dimensions. For example, a decision forest might be employed when attempting to predict real-time changes in data such as the motion of a ship on water, the rise and fall of multiple stocks on the market, or even image stabilization for a camera behind held in the hand.
Knowing what an object is, how it can move, and where it can go allows a discrete set of possible actions—i.e. the classifiers mentioned above allow for setting an object to say a “knee joint” which has a particular number of movements it can make in relation to the “hip joint” and the “ankle joint.” In fact, should the knee change position between time + 0 seconds and time + 1 seconds it has a very specific region that it must fall within and the change between those positions can be drawn in a line that can be guessed extremely easily. When the Kinect goes to detect bodies it has models of how bodies work already pre-loaded. The “knee joint” will never suddenly be six feet away from the “ankle joint” and the “hip joint” (without something horribly happening to the person in the process) knowing this, the Kinect can easily re-acquire the location of each of the joints by keeping track of at least two of them, even if it momentarily loses track of one of the joints second-to-second.
Gesture and facial detection works in a very similar fashion. Picking points on the face that interact with other points on the face in a predictable model gives the Kinect best-guess engine a model where it only needs to see a certain percentage of the points at a time and can make some pretty good guesses where the rest must be in relation to those visible.
We can extrapolate the same for fingers.
The paper is quite complex, but if you’re into equations, go and read it. You’ll find a very comprehensive explanation of how the algorithms work. In fact, motion-capture animators and their kindred souls will probably greatly enjoy the mechanisms behind the Kinect’s guessing algorithms and modeling.
For the rest of you, here’s a flashy video describing the process:
A message from John Furrier, co-founder of SiliconANGLE:
Your vote of support is important to us and it helps us keep the content FREE.
One click below supports our mission to provide free, deep, and relevant content.
Join our community on YouTube
Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.
THANK YOU