Skip to main content

Hand Tracking Algorithm Benchmark

This page presents the performance evaluation of the MPS Hand Tracking algorithm for Aria Gen2 glasses.

Benchmark Datasets

We evaluate MPS Hand Tracking on two complementary internal datasets.

HOI Dataset

The HOI dataset is specifically designed for hand-object interaction scenarios, which are particularly valuable for robotics applications and present significant challenges for hand tracking due to object-induced occlusions.

PropertyValue
Total Duration1 hour
Frame Rate30 fps
Cameras UsedAll 4 CV cameras on Aria Gen2
Number of Participants7
Number of Objects20 everyday objects

Seven participants were recruited to ensure diversity across ethnicity, age, gender, hand size, and arm length.

Each participant interacts naturally with a pool of 20 everyday objects. The dataset captures:

  • Frequent hand occlusions from objects or the other hand during natural interaction
  • Out-of-field-of-view segments where the subjects' hands move outside the Aria Gen2 field of view

Bystander Dataset

The bystander dataset is designed to evaluate the algorithm's ability to track only the wearer's hands and reject hands of nearby people. It is critical for both contextual AI use cases (where bystanders are common) and robotics use cases (where false positives degrade dataset quality).

PropertyValue
Total Frames69,000
Cameras UsedAll 4 CV cameras on Aria Gen2
Bystander Distance Buckets< 50 cm, 50 – 100 cm, ≥ 100 cm

Each frame is bucketed by the closest bystander distance to the wearer, enabling per-distance analysis. The dataset includes a dedicated subset focused on close-range face-to-face conversation, historically a major pain point for hand tracking in social interaction settings.

Ground Truth Annotations

Hand pose ground truth annotations follow the standard UmeTrack format and are generated using an internal marker-free reconstruction system with millimeter-level keypoint accuracy.

Thanks to the extensive field-of-view coverage of outside-in cameras in the reconstruction system, nearly all frames have valid hand pose ground truth annotations.

Evaluation Metrics

We report two complementary sets of metrics:

MetricDescriptionUsed On
MKPE (Mean Keypoint Error) ↓Average Euclidean distance between predicted and ground truth keypoints, measured in millimetersHOI Dataset
LTR (Lose Track Ratio) ↓Percentage of frames where tracking is lostHOI Dataset
Precision @ IoU = 0.25 ↑Fraction of detected hands that are valid wearer handsBystander Dataset, HOI Dataset
Recall @ IoU = 0.25 ↑Fraction of wearer hands that are correctly detectedBystander Dataset, HOI Dataset

Results

HOI Dataset

MethodMKPE (mm) ↓LTR (%) ↓Precision @ IoU = 0.25 ↑Recall @ IoU = 0.25 ↑
On-device HT45.012.7
MPS HT 3.1.119.98.80.9630.925
MPS HT 3.2.0 (latest)19.79.30.9690.915

Bystander Dataset

Aggregated across all distance buckets:

MethodPrecision @ IoU = 0.25 ↑Recall @ IoU = 0.25 ↑
MPS HT 3.1.10.8930.785
MPS HT 3.2.0 (latest)0.9380.765

Close-range Face-to-face Conversation

A particularly challenging bystander scenario is close-range face-to-face conversation. MPS HT 3.2.0 substantially improves precision in this regime, especially when the bystander is within 50 cm:

Bystander Distance3.1.1 Precision ↑3.2.0 Precision ↑3.1.1 Recall ↑3.2.0 Recall ↑
< 50 cm0.780.930.700.72
50 – 100 cm0.940.950.850.80
≥ 100 cm0.970.970.991.00