NTU60-AUDIO: Audio-Augmented Benchmark for 3D Indoor Action Recognition
NTU60-AUDIO Pipeline · Distributions · Download

NTU60-AUDIO: Audio-Augmented Benchmark
for 3D Indoor Action Recognition

Real-World Videos Are Not Silent: Synthesized Audio Dataset Augmentation for Multimodal Action Recognition
Interspeech 2026 — Anonymous Submission
About the Dataset
NTU60-AUDIO is a large-scale synthesized audio extension of NTU RGB+D 60, providing quality-controlled audio tracks for all silent RGB videos across 60 indoor action classes. Audio is synthesized through a two-stage pipeline: action-aware textual descriptions are first generated from video frames using Qwen3-VL-8B-Instruct, then used alongside the video to condition MMAudio for temporally grounded waveform synthesis. Each generated track is automatically evaluated by three complementary quality metrics and regenerated up to three times if any metric falls below its P5 threshold. NTU60-AUDIO is the first large-scale audio augmentation of a fixed-viewpoint 3D indoor action recognition benchmark, designed to enable systematic audio-visual research in a domain previously limited to egocentric and in-the-wild settings.

Sample Clips

A002 · Daily Action
A2. Eat Meal
AA-PQ 6.03 PAM 0.970 Sync 0.940
A004 · Daily Action
A4. Brush Hair
AA-PQ 6.43 PAM 0.963 Sync 0.960
A006 · Daily Action
A6. Pick Up
AA-PQ 6.39 PAM 0.939 Sync 0.979
A010 · Daily Action
A10. Clapping
AA-PQ 6.35 PAM 0.941 Sync 0.969
A011 · Daily Action
A11. Reading
AA-PQ 6.22 PAM 0.920 Sync 0.919
A029 · Daily Action
A29. Play with Phone / Tablet
AA-PQ 5.82 PAM 0.924 Sync 0.971
A030 · Daily Action
A30. Type on a Keyboard
AA-PQ 6.15 PAM 0.950 Sync 0.987
A043 · Medical Condition
A43. Falling Down
AA-PQ 6.48 PAM 0.867 Sync 0.887
A044 · Medical Condition
A44. Headache
AA-PQ 6.20 PAM 0.868 Sync 0.936

Dataset Construction Pipeline

Every video in NTU RGB+D 60 is processed through a three-stage pipeline to produce a synchronized audio track.

01
Prompted Video-to-Audio Synthesis

Eight uniformly sampled frames are passed to Qwen3-VL-8B-Instruct to generate an action-aware textual description and plausible sound events (e.g., footsteps, impacts, coughing). The description, together with the video, conditions MMAudio to synthesize a semantically aligned mono waveform at 44.1 kHz, resampled to 32 kHz.

02
Multi-Metric Quality Assessment

Each generated audio track is scored by three complementary metrics. Thresholds are set at the 5th percentile (P5) computed on a stratified 10% subset.

03
Quality-Aware Regeneration Loop

Tracks failing any threshold are regenerated up to three times. If all attempts fail, the candidate with the highest aggregated (normalized + averaged) score is selected. 85.9% of samples pass on the first attempt; 9.8% require one regeneration; 2.4% require two; 1.9% fall back to best-of-3 selection.

Audiobox-Aesthetics (AA-PQ) Production quality — detects clipping, excessive noise, and low fidelity in generated audio.
PAM (Prompting Audio-Language Models) Perceptual audio quality — CLAP-embedding-based metric that prompts an audio-language model to score cleanness on a 0–1 scale; higher indicates cleaner audio.
Synchformer Temporal alignment between audio and video; probability of zero temporal offset used as synchronization score.

Quality Metric Distributions

Each of the 56,578 synthesized audio tracks is evaluated by three independent quality metrics. Tracks failing any threshold (set at the 5th percentile, P5, of a stratified 10% subset) are automatically regenerated up to three times. The distributions below reflect the final accepted dataset after all regeneration attempts.

The sample count of 56,578 reflects NTU RGB+D 60 after excluding samples with corrupted or missing skeleton data, following the preprocessing protocol of Hyperformer.

AA_PQ Distribution

Distribution is approximately normal and well-centered above the P5 threshold (4.59). The regeneration loop effectively reduces the left tail — only 2.16% of final tracks remain below threshold, typically from acoustically ambiguous action classes.

PAM Distribution

Strongly right-skewed with a median of 0.80 — most tracks score above 0.6. PAM prompts a CLAP-based audio-language model to assess perceptual cleanness; it is the least common failure mode (0.57%), as a single re-synthesis almost always produces a cleaner result.

Sync Zero Probability Distribution

U-shaped distribution reflecting bimodal alignment — tracks are either well-synced (near 1.0) or partially offset. Sync is the hardest metric to enforce, with 2.09% remaining below threshold even after regeneration.

Regeneration Count Distribution

85.9% of tracks pass on the first attempt. The sharp drop from 0× to 1× shows that a single re-synthesis resolves most failures. Only 1.9% exhaust all three attempts, and 4.75% of final tracks remain below at least one threshold.

Downloads

All resources are freely available for non-commercial research use under CC BY-NC 4.0.
NTU RGB+D 60 video frames are subject to their original license.

Acknowledgments

NTU60-AUDIO is built upon the NTU RGB+D dataset and uses Qwen3-VL and MMAudio for audio synthesis.
We appreciate the authors' excellent work!