UPDATED 14:27 EDT / MARCH 30 2011

The Anatomy of the Kinect Algorithms Explained

bodyparts The Microsoft Kinect represents more than just a breakthrough in user-interface for video gamers as we’ve seen multiple times. It has a surprising number of applications throughout human-computer interaction. But how does it do what it does? Well, a paper has been written outlining the anatomy of the system underlying this extremely popular peripheral,

What the team did next was to train a type of classifier called a decision forest, i.e. a collection of decision trees. Each tree was trained on a set of features on depth images that were pre-labeled with the target body parts. That is the decision trees were modified until they gave the correct classification for a particular body part across the test set of images. Training just three trees using 1 million test images took about a day using a 1000 core cluster.

The trained classifiers assign a probably of a pixel being in each body part and the next stage of the algorithm simply picks out areas of maximum probability for each body part type. So an area will be assigned to the category “leg” if the leg classifier has a probability maximum in the area. The final stage is to compute suggested joint positions relative to the areas identified as particular body parts. In the diagram below the different body part probability maxima are indicated as colored areas:

Decisions trees and forests are a mechanism not too uncommon across a lot of computer science when a computer needs to predict activity across a set of dimensions. For example, a decision forest might be employed when attempting to predict real-time changes in data such as the motion of a ship on water, the rise and fall of multiple stocks on the market, or even image stabilization for a camera behind held in the hand.

Knowing what an object is, how it can move, and where it can go allows a discrete set of possible actions—i.e. the classifiers mentioned above allow for setting an object to say a “knee joint” which has a particular number of movements it can make in relation to the “hip joint” and the “ankle joint.” In fact, should the knee change position between time + 0 seconds and time + 1 seconds it has a very specific region that it must fall within and the change between those positions can be drawn in a line that can be guessed extremely easily. When the Kinect goes to detect bodies it has models of how bodies work already pre-loaded. The “knee joint” will never suddenly be six feet away from the “ankle joint” and the “hip joint” (without something horribly happening to the person in the process) knowing this, the Kinect can easily re-acquire the location of each of the joints by keeping track of at least two of them, even if it momentarily loses track of one of the joints second-to-second.

Gesture and facial detection works in a very similar fashion. Picking points on the face that interact with other points on the face in a predictable model gives the Kinect best-guess engine a model where it only needs to see a certain percentage of the points at a time and can make some pretty good guesses where the rest must be in relation to those visible.

We can extrapolate the same for fingers.

The paper is quite complex, but if you’re into equations, go and read it. You’ll find a very comprehensive explanation of how the algorithms work. In fact, motion-capture animators and their kindred souls will probably greatly enjoy the mechanisms behind the Kinect’s guessing algorithms and modeling.

For the rest of you, here’s a flashy video describing the process:

A message from John Furrier, co-founder of SiliconANGLE:

Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.

15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.

Are you AWS customer? Support SiliconANGLE Financially by buying your AWS services from our Marketplace portal page and links.

https://siliconangle.com/aws-marketplace/

About SiliconANGLE Media

SiliconANGLE Media is a recognized leader in digital media innovation, uniting breakthrough technology, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — with flagship locations in Silicon Valley and the New York Stock Exchange — SiliconANGLE Media operates at the intersection of media, technology and AI.

Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.

The Anatomy of the Kinect Algorithms Explained

A message from John Furrier, co-founder of SiliconANGLE:

Are you AWS customer? Support SiliconANGLE Financially by buying your AWS services from our Marketplace portal page and links.

LATEST FROM THECUBE

UPCOMING CUBE EVENTS

RECENT CUBE EVENTS

RAISE Summit 2026

Pure Accelerate 2026

FinOps X 2026

Snowflake Summit 2026

Freshworks Refresh 2026

The Anatomy of the Kinect Algorithms Explained

A message from John Furrier, co-founder of SiliconANGLE:

Are you AWS customer? Support SiliconANGLE Financially by buying your AWS services from our Marketplace portal page and links.

LATEST STORIES

LATEST STORIES

RAISE Summit 2026

Pure Accelerate 2026

FinOps X 2026

Snowflake Summit 2026

Freshworks Refresh 2026