NTU60-AUDIO

NTU60-AUDIO: Audio-Augmented Benchmark for 3D Indoor Action Recognition

⬇ Download Dataset 📄 Paper (arXiv) Code (WIP)

About the Dataset

NTU60-AUDIO is a large-scale synthesized audio extension of NTU RGB+D 60, providing quality-controlled audio tracks for all silent RGB videos across 60 indoor action classes. Audio is synthesized through a two-stage pipeline: action-aware textual descriptions are first generated from video frames using Qwen3-VL-8B-Instruct, then used alongside the video to condition MMAudio for temporally grounded waveform synthesis. Each generated track is automatically evaluated by three complementary quality metrics and regenerated up to three times if any metric falls below its P5 threshold. NTU60-AUDIO is the first large-scale audio augmentation of a fixed-viewpoint 3D indoor action recognition benchmark, designed to enable systematic audio-visual research in a domain previously limited to egocentric and in-the-wild settings.

Sample Clips

A002 · Daily Action

A2. Eat Meal

AA-PQ 6.03 PAM 0.970 Sync 0.940

A004 · Daily Action

A4. Brush Hair

AA-PQ 6.43 PAM 0.963 Sync 0.960

A006 · Daily Action

A6. Pick Up

AA-PQ 6.39 PAM 0.939 Sync 0.979

A010 · Daily Action

A10. Clapping

AA-PQ 6.35 PAM 0.941 Sync 0.969

A011 · Daily Action

A11. Reading

AA-PQ 6.22 PAM 0.920 Sync 0.919

A029 · Daily Action

A29. Play with Phone / Tablet

AA-PQ 5.82 PAM 0.924 Sync 0.971

A030 · Daily Action

A30. Type on a Keyboard

AA-PQ 6.15 PAM 0.950 Sync 0.987

A043 · Medical Condition

A43. Falling Down

AA-PQ 6.48 PAM 0.867 Sync 0.887

A044 · Medical Condition

A44. Headache

AA-PQ 6.20 PAM 0.868 Sync 0.936

Dataset Construction Pipeline

Every video in NTU RGB+D 60 is processed through a three-stage pipeline to produce a synchronized audio track.

Prompted Video-to-Audio Synthesis

Eight uniformly sampled frames are passed to Qwen3-VL-8B-Instruct to generate an action-aware textual description and plausible sound events (e.g., footsteps, impacts, coughing). The description, together with the video, conditions MMAudio to synthesize a semantically aligned mono waveform at 44.1 kHz, resampled to 32 kHz.

Multi-Metric Quality Assessment

Each generated audio track is scored by three complementary metrics. Thresholds are set at the 5th percentile (P5) computed on a stratified 10% subset.

Quality-Aware Regeneration Loop

Tracks failing any threshold are regenerated up to three times. If all attempts fail, the candidate with the highest aggregated (normalized + averaged) score is selected. 85.9% of samples pass on the first attempt; 9.8% require one regeneration; 2.4% require two; 1.9% fall back to best-of-3 selection.

Audiobox-Aesthetics (AA-PQ) Production quality — detects clipping, excessive noise, and low fidelity in generated audio.

PAM (Prompting Audio-Language Models) Perceptual audio quality — CLAP-embedding-based metric that prompts an audio-language model to score cleanness on a 0–1 scale; higher indicates cleaner audio.

Synchformer Temporal alignment between audio and video; probability of zero temporal offset used as synchronization score.

Quality Metric Distributions

Each of the 56,578 synthesized audio tracks is evaluated by three independent quality metrics. Tracks failing any threshold (set at the 5th percentile, P5, of a stratified 10% subset) are automatically regenerated up to three times. The distributions below reflect the final accepted dataset after all regeneration attempts.

The sample count of 56,578 reflects NTU RGB+D 60 after excluding samples with corrupted or missing skeleton data, following the preprocessing protocol of Hyperformer.

Distribution is approximately normal and well-centered above the P5 threshold (4.59). The regeneration loop effectively reduces the left tail — only 2.16% of final tracks remain below threshold, typically from acoustically ambiguous action classes.

Strongly right-skewed with a median of 0.80 — most tracks score above 0.6. PAM prompts a CLAP-based audio-language model to assess perceptual cleanness; it is the least common failure mode (0.57%), as a single re-synthesis almost always produces a cleaner result.

U-shaped distribution reflecting bimodal alignment — tracks are either well-synced (near 1.0) or partially offset. Sync is the hardest metric to enforce, with 2.09% remaining below threshold even after regeneration.

85.9% of tracks pass on the first attempt. The sharp drop from 0× to 1× shows that a single re-synthesis resolves most failures. Only 1.9% exhaust all three attempts, and 4.75% of final tracks remain below at least one threshold.

Downloads

All resources are freely available for non-commercial research use under CC BY-NC 4.0.
NTU RGB+D 60 video frames are subject to their original license.

🔊 NTU60-AUDIO 56,578 samples · .flac + .mp4 + .json (169,734 files) · 44.1 kHz · ~120 GB. Hugging Face →
🔊 NTU120-AUDIO Synthesized audio extension of NTU RGB+D 120 (120 classes, ~114K tracks). WIP
💾 Model Checkpoint Best X-Sub checkpoint. X3D+TSM + EfficientAT MN04 + gate. PyTorch (.pth). ~10 MB. WIP
📄 Paper Interspeech 2026 anonymous submission. PDF. arXiv →

Acknowledgments

NTU60-AUDIO is built upon the NTU RGB+D dataset and uses Qwen3-VL and MMAudio for audio synthesis.
We appreciate the authors' excellent work!

NTU60-AUDIO: Audio-Augmented Benchmarkfor 3D Indoor Action Recognition

Sample Clips

Dataset Construction Pipeline

Quality Metric Distributions

Downloads

Acknowledgments

NTU60-AUDIO: Audio-Augmented Benchmark
for 3D Indoor Action Recognition