ADT Data Format

The Aria Digital Twin Dataset (ADT) provides real world and synthetic raw Project Aria data, derived data generated by ADT Ground Truth data processing services as well as derived data generated by Project Aria's Machine Perception Services (MPS).

All ADT data is recorded with up to two users in a single scene, however, sometimes only one user is wearing an Aria device. For multi-person recordings, we break each Aria recording and all associated ground truth data into separate sequences. This allows you to filter for any single Aria device recording, and all tooling is designed to operate on a single sequence data.

Subsequences have been removed

In versions V1.X of ADT, we grouped concurrent recordings into a single sequence with two sub-sequences. We have removed the concept of subsequences in V2.0 (June 2024). If you have data with versions prior to V2.0, we recommend you re-download the sequences.

note

All tooling is still backwards compatible. You should still be able to use old data, but is not recommended.

:::

Sequence structure

|Sequence1Name|
    ├──video.vrs  # Aria recording data
    ├──instances.json  # metadata of all instances in a sequence. An instance can be an object or a skeleton
    ├──aria_trajectory.csv  # 6DoF Aria trajectory
    ├──2d_bounding_box.csv  # 2D bounding box data for instances in three Aria sensors: RGB camera, left SLAM camera, right SLAM camera
    ├──3d_bounding_box.csv  # 3D AABB of each object
    ├──scene_objects.csv    # 6 DoF poses of objects
    ├──eyegaze.csv          # Eye gaze
    ├──synthetic_video.vrs  # Synthetic rendering of video.vrs
    ├──depth_images.vrs     # Depth images of video.vrs
    ├──segmentations.vrs    # Instance segmentations of video.vrs
    ├──skeleton_aria_association.json [optional]  # File showing association between Aria devices and skeletons, if they exist. Omitted if a sequence does not have skeleton ground truth.
    ├──Skeleton_*.json [optional]   # Body skeleton data. * is the skeleton name. Omitted if a sequence does not have skeleton ground truth
    ├──2d_bounding_box_with_skeleton.csv [optional]  # 2D bounding box data with body mesh occlusions. Omitted if a sequence does not have skeleton ground truth
    ├──depth_images_with_skeleton.vrs [optional]  # Depth images with body mesh occlusions. Omitted if a sequence does not have skeleton ground truth
    ├──segmentations_with_skeleton.vrs [optional]  # Segmentations with body mesh occlusions. Omitted if a sequence does not have skeleton ground truth
    ├──metadata.json # stores important information about the sequence
    ├──MPS # Go to Data Formats/MPS Output for more information about the data in this directory
        ├── eye_gaze
            ├── general_eye_gaze.csv
            ├── summary.json
        ├── slam
            ├── alignment_results.json # Alignment results between the MPS closed loop trajectory and the ADT GT trajectory
            ├── closed_loop_trajectory.csv
            ├── open_loop_trajectory.csv
            ├── online_calibration.csv
            ├── semidense_observations.csv.gz
            ├── semidense_points.csv.gz
            ├── summary.json

SkeletonMetaData.json name change

Prior to v1.1 of the dataset, skeleton_aria_association.json was called SkeletonMetaData.json.

Timestamps Mapping Data

Project Aria glasses recording concurrently in the same location leverage SMPTE timecode to receive a synchronized time clock with sub-millisecond accuracy.

The mapping between device time and timecode clock for each sequence is stored in the VRS file as a Time Domain Mapping Class. Go to Timestamps in Aria VRS Files for more information about how Aria sensor data is timestamped.

Go to Multiperson Synchronization for how to get synchronized ground truth data in a multi-person sequence.

Ground Truth Data

You can use the AriaDigitalTwinDataPathProvider to load a sequence and select a subsequence. AriaDigitalTwinDataPathProvider will manage all the ground truth files in a subsequence folder (not the MPS files).

Aligning Ground Truth and MPS Data

The alignment_results.json file in mps/slam directory contains the alignment results between the MPS closed loop trajectory and the ADT GT trajectory. The alignment results have already been applied to the closed loop trajectory and the semidense pointcloud to convert from the SLAM frame to the ADT frame, ensuring all ADT data is expressed in the same coordinate frame for all sequences.

Skeleton Data and Availability

Not all ADT sequences have skeleton tracking. For those sequences with skeleton tracking enabled, we use the marker measurements from the bodysuit to generate a 3D mesh estimate of the wearer which is then used in our ground truth generation pipeline to calculate 2D bounding boxes, segmentation images and depth images.

In these cases, ADT provides two sets of ground truth data: one with skeleton occlusion, one without.

segmentations.vrs vs. segmentations_with_skeleton.vrs
depth_images.vrs vs. depth_images_with_skeleton.vrs
'2d_bounding_box.csv' vs. '2d_bounding_box_with_skeleton.csv'

You can use AriaDigitalTwinDataPathsProvider to switch between these two sets.

Ground Truth Data Format

Our data loader loads all this data into a single class with useful tools for accessing data. For more information on the data classes returned by the loader, go to the Data Loader page.

2d_bounding_box.csv or 2d_bounding_box_with_skeleton.csv

Column	Type	Description
stream_id	string	camera stream id associated with the bounding box image
object_uid	uint64_t	id of the instance (object or skeleton)
timestamp[ns]	int64_t	timestamp of the image in nanoseconds
x_min[pixel]	int	minimum dimension in the x axis
x_max[pixel]	int	maximum dimension in the x axis
y_min[pixel]	int	minimum dimension in the y axis
y_max[pixel]	int	maximum dimension in the y axis
visibility_ratio[%]	double	percentage of the object that is visible (0: not visible, 1: fully visible)

3d_bounding_box.csv

Column	Type	Description
object_uid	uint64_t	id of the instance (object or skeleton)
timestamp[ns]	int64_t	timestamp of the image in nanoseconds. -1 means the instance is static
p_local_obj_xmin[m]	double	minimum dimension in the x axis (in meters) of the bounding box
p_local_obj_xmax[m]	double	maximum dimension in the x axis (in meters) of the bounding box
p_local_obj_ymin[m]	double	minimum dimension in the y axis (in meters) of the bounding box
p_local_obj_ymax[m]	double	maximum dimension in the y axis (in meters) of the bounding box
p_local_obj_zmin[m]	double	minimum dimension in the z axis (in meters) of the bounding box
p_local_obj_zmax[m]	double	maximum dimension in the z axis (in meters) of the bounding box

aria_trajectory.csv

ADT uses the same trajectory format as closed loop trajectory in MPS.

While the data structure is the same, the file is generated by the ADT ground truth system, not by MPS.

eyegaze.csv

ADT uses the same eye gaze format as MPS.

Unlike MPS outputs, the ground truth eyegaze.csv contains depth mapping estimated by the ADT ground truth system.

scene_objects.csv

Column	Type	Description
object_uid	uint64_t	id of the instance (object or skeleton)
timestamp[ns]	int64_t	timestamp of the image in nanoseconds. -1 means the instance is static
t_wo_x[m]	double	x translation from object frame to world (scene) frame (in meters)
t_wo_y[m]	double	y translation from object frame to world (scene) frame (in meters)
t_wo_z[m]	double	z translation from object frame to world (scene) frame (in meters)
q_wo_w	double	w component of quaternion from object frame to world (scene) frame
q_wo_x	double	x component of quaternion from object frame to world (scene) frame
q_wo_y	double	y component of quaternion from object frame to world (scene) frame
q_wo_z	double	z component of quaternion from object frame to world (scene) frame

instances.json

{
    "IID1": {
    "instance_id": IID1,
    "instance_name": "XXXX",
    "prototype_name": "XXXX",
    "category": "XXXX",
    "category_uid": XXXX,
    "motion_type": "static/dynamic",
    "instance_type": "object/human",
    "rigidity": "rigid/deformable",
    "rotational_symmetry": {
      "is_annotated": true/false
    },
    "canonical_pose": {
      "up_vector": [
        x,
        y,
        z
      ],
      "front_vector": [
        x,
        y,
        z
      ]
    }
  },
  ...
}

Skeleton_T.json or Skeleton_C.json

{
  "dt_optitrack_minus_device_ns": {
    "1WM103600M1292": XXXXX
  },
  "frames": [
    {
      "markers": [
        [
          mx1
          my1
          mz1
        ],
        ...
       ],
       "joints": [
         [
          jx1
          jy1
          jz1
         ],
        ...
       ],
       "timestamp_ns": tsns1
    },
    ...
  ]
}

skeleton_aria_association.json

This file shows the skeleton info including name, Id, and associated Aria device for each human in the sequence.

Because it's possible to have a person wearing a bodysuit that is not wearing an Aria device, it's possible to have a skeleton with no associated AriaDeviceSerial.

It's also possible to have an Aria wearer with no bodysuit, which means there may be an empty skeleton Id and a name associated with an Aria device.

{
    "SkeletonMetadata": [
        {
            "AssociatedDeviceSerial": "AriaSerial1/NONE",
            "SkeletonId": ID1,
            "SkeletonName": "SkeletonName1/NONE"
        },
        ...
    ]
}

video.vrs

video.vrs contains the raw sensor recording from the Aria device.

Aria Hardware Specifications shows the sensors used to make recordings
- Images were all recorded at 30 fps

depth_images.vrs

depth_images.vrs1 contains 3 streams of images corresponding to the exact streams in video.vrs.

Each depth image is the same size as their corresponding raw image, where the pixel contents are integers expressing the depth in the camera’s Z-axis, in units of mm.
- This should not to be confused with ASE depth images, which describe the depth along each pixel ray
Depth data is calculated using ADT’s ground truth system

segmentations.vrs

segmentations.vrs contains 3 streams of images corresponding to the exact streams in video.vrs.

Each segmentation image is the same size as their corresponding raw image, where the pixel contents are integers expressing the Instance Id that was observed by that pixel
Segmentation data is calculated using ADT’s ground truth system

Sequence structure​

Timestamps Mapping Data​

Ground Truth Data​

Aligning Ground Truth and MPS Data​

Skeleton Data and Availability​

Ground Truth Data Format​

2d_bounding_box.csv or 2d_bounding_box_with_skeleton.csv​

3d_bounding_box.csv​

aria_trajectory.csv​

eyegaze.csv​

scene_objects.csv​

instances.json​

Skeleton_T.json or Skeleton_C.json​

skeleton_aria_association.json​

video.vrs​

depth_images.vrs​

segmentations.vrs​