Aria Gen 2 Pilot Dataset Format
The dataset contains 12 sequences in total. Each Aria glasses recording is stored as its own sequence with all data related to that recording self contained within that sequence folder. An example sequence folder would look like this:
└── sequence_name
├── video.vrs
├── vrs_health_check_results.json
├── mps
│ ├── slam
│ │ ├── closed_loop_trajectory.csv
│ │ ├── open_loop_trajectory.csv
│ │ ├── semidense_observations.csv.gz
│ │ ├── semidense_points.csv.gz
│ │ ├── online_calibration.jsonl
│ │ └── summary.json
│ └── hand_tracking
│ ├── hand_tracking_results.csv
│ └── summary.json
├── diarization
│ ├── diarization_results.csv
│ └── summary.json
├── scene
│ ├── 2d_bounding_box.csv
│ ├── 3d_bounding_box.csv
│ ├── instances.json
│ ├── scene_objects.csv
│ └── summary.json
├── heart_rate
│ ├── heart_rate_results.csv
│ └── summary.json
├── depth
│ ├── depth
│ │ ├── depth_00000001.png
│ │ ├── …
│ │ └── depth_{:08d}.png
│ ├── rectified_images
│ │ ├── image_00000000.png
│ │ ├── …
│ │ └── image_{:08d}.png
│ ├── pinhole_camera_parameters.json
│ └── summary.json
└── hand_object_interaction
├── hand_object_interaction_results.json
└── summary.json
File format
closed_loop_trajectory.csv, open_loop_trajectory.csv, semidense_observations.csv.gz, semidense_points.csv.gz, online_calibration.jsonl and hand_tracking_results.csv follow MPS data format.
diarization_results.csv
| Column | Type | Description |
|---|---|---|
| start_timestamp_ns | int | Timestamp, in nanoseconds, in device time domain. |
| end_timestamp_ns | int | Timestamp, in nanoseconds, in device time domain. |
| speaker | string | Unique identifier of the speaker |
| content | string | The ASR results in text. |
2d_bounding_box.csv
| Column | Type | Description |
|---|---|---|
| stream_id | string | camera stream id associated with the bounding box image |
| object_uid | uint64_t | id of the object instance |
| timestamp[ns] | int64_t | timestamp of the image in nanoseconds |
| x_min[pixel] | int | minimum dimension in the x axis |
| x_max[pixel] | int | maximum dimension in the x axis |
| y_min[pixel] | int | minimum dimension in the y axis |
| y_max[pixel] | int | maximum dimension in the y axis |
| visibility_ratio[%] | double | percentage of the object that is visible (0: not visible, 1: fully visible). NOTE: for EVL this is estimated from semidense points and is NOT accurate and fills -1. |
3d_bounding_box.csv
| Column | Type | Description |
|---|---|---|
| object_uid | uint64_t | id of the instance |
| timestamp[ns] | int64_t | timestamp of the image in nanoseconds. -1 means the instance is static |
| p_local_obj_xmin[m] | double | minimum dimension in the x axis (in meters) of the bounding box |
| p_local_obj_xmax[m] | double | maximum dimension in the x axis (in meters) of the bounding box |
| p_local_obj_ymin[m] | double | minimum dimension in the y axis (in meters) of the bounding box |
| p_local_obj_ymax[m] | double | maximum dimension in the y axis (in meters) of the bounding box |
| p_local_obj_zmin[m] | double | minimum dimension in the z axis (in meters) of the bounding box |
| p_local_obj_zmax[m] | double | maximum dimension in the z axis (in meters) of the bounding box |
instances.json
{
object_id:{
"canonical_pose":{"front_vector":unit_vector,"up_vector":unit_vector},
"category":xxx,
"category_uid":xxx,
"instance_id":xxx,
"instance_name":text description,
"instance_type":"object",
"motion_type":"static",
"prototype_name":text description,"rigidity": "rigid",
"rotational_symmetry":{"is_annotated":false}
},
...
# example
"12":{
"canonical_pose":{"front_vector":[0,0,1],"up_vector":[0,1,0]},"
category":"Screen",
"category_uid":10,
"instance_id":12,
"instance_name":"screen, television, laptop screen (not keyboard), tablet screen, computer monitor, display, not mobile phone screen",
"instance_type":"object",
"motion_type":"static",
"prototype_name":"screen, television, laptop screen (not keyboard), tablet screen, computer monitor, display, not mobile phone screen",
"rigidity":"rigid",
"rotational_symmetry":{"is_annotated":false}
},
...
}
scene\_objects.csv
| Column | Type | Description |
|---|---|---|
| object_uid | uint64_t | id of the object instance |
| timestamp[ns] | int64_t | timestamp of the image in nanoseconds. -1 means the instance is static |
| t_wo_x[m] | double | x translation from object frame to world (scene) frame (in meters) |
| t_wo_y[m] | double | y translation from object frame to world (scene) frame (in meters) |
| t_wo_z[m] | double | z translation from object frame to world (scene) frame (in meters) |
| q_wo_w | double | w component of quaternion from object frame to world (scene) frame |
| q_wo_x | double | x component of quaternion from object frame to world (scene) frame |
| q_wo_y | double | y component of quaternion from object frame to world (scene) frame |
| q_wo_z | double | z component of quaternion from object frame to world (scene) frame |
heart_rate_results.csv
| Column | Type | Description |
|---|---|---|
| timestamp_ns | int | Timestamp, in nanoseconds, in device time domain. |
| heart_rate_bpm | int | The estimated heart rate (beats per minutes) at the specific timestamp |
Depth Folder
Depth output consists of the following files in a single directory named “depth”:
- Rectified depth maps as 512 x 512, 16-bit grayscale PNG images. The pixel contents are integers expressing the depth along the pixel’s ray direction, in units of mm. This is the same format used in ASE.
depth_{:08d}.png
- Matching rectified front-left SLAM camera images as 8-bit grayscale PNGs.
image_{:08d}.png
- A JSON file containing camera transforms and intrinsics for the rectified pinhole camera, for each frame.
pinhole_camera_parameters.json
Example JSON:
[
{
"T_world_camera": {
"QuaternionXYZW": [
-0.56967133283615112,
0.35075613856315613,
0.60195386409759521,
0.4360002875328064
],
"Translation": [
-0.0005431128665804863,
0.0053895660676062107,
-0.0027622696943581104
]
},
"camera": {
"ModelName": "Linear:fu,fv,u0,v0",
"Parameters": [
306.38043212890625,
306.38043212890625,
254.6942138671875,
257.29779052734375
]
},
"index": 0,
"frameTimestampNs": 34234234234,
}
]
hand_object_interaction_results.json
Standard COCO json format, with category_id 1 (left hand) 2 (right hand) 3 (hand interacting object). Example Json file:
[
{
"segmentation": {
"size": [
1512,
2016
],
"counts": "TUk`13U_10L8L0K7N0UJOYlN9eS1G[lN9aS11okN=mS1CSlN=kS1GokNa0a1kNRl0d0]ROa0a1kNnk0]1aPOhNSN`1Q5dNYl0T1cPOhNSN`1Q5dNYl0T1cPOhNSN`1Q5dNQl0T2mnNT3V1QMko0KonNT3V1QMio0Z7jnNRIVQ1n6jnNRIVQ1`:00N200L400M3000000N200N200N2000000N200000002N00000N4N0N200N4N0000002N00000N4N0000002N004L004L0000002N002N00000000000020N02N002N0020N02N0020N03M002N002N00f0ZO004L002N00000N4N000000000000000000000000000000000002L2000000000000000002N000N20000000000000000000000000000002N00000000000000000000N2000N20N20000000N2000000000N202N00000000000000000000000000000000000000000000000N20000000000000000N202N00000000000N20000000N200000000000000N200N20000000000000N02000000000000000000000N202N000N202L202L202L200N20000N200N200N202I504hNT102iNU103K202H606XOb004mJYmNTMPS1l2PmNTMRS1f2VmNRMnR1n2RmNRMPS1c2amNaLSS1_3mlNaLdU1^Ok50?VO;06^O<0de]n0"
},
"bbox": [
1050.1774193548385,
738.1073369565217,
325.16129032258095,
535.0801630434783
],
"score": 1.0,
"image_id": 2620886, # timestamp_ns = image_id * 1e6
"category_id": 3
},
...
]```