This session covers the dataset setup, 3D pose estimation accuracy, motion recognition, genre classification, and final conclusions.
The authors evaluate their method on the UID (University of Illinois Dance) and AIST++ datasets, using metrics like MPJPE and F-score.
Attribute | Value |
---|---|
Genres | 9 |
Total Clips | 1,143 |
Total Frames | 2,788,157 |
Total Duration | 108,089s |
Min / Max Clip Length | 4s / 824s |
Min / Max Clips per Genre | 30 / 304 |
Includes both simple (tutorial) and complex (multi-dancer, noisy background) dance videos.
Method | Supervision | Extra Data | MPJPE β |
---|---|---|---|
Martinez [ICCVβ17] | Supervised | β | 110.0 |
Wandt [CVPRβ19] | Supervised | β | 323.7 |
Pavllo [CVPRβ19] | Supervised | β | 77.6 |
Pavllo (semi-sup.) | Semi | β | 446.1 |
Ours (semi-sup.) | Semi | β | 73.7 |
Zhou [ICCVβ17] | Weakly | β | 93.1 |
Kocabas [CVPRβ19] | Self-sup. | Multiview | 87.4 |
Ours (unsup.) | Unsupervised | β | 246.4 |
Method | Supervision | Extra Data | MPJPE β |
---|---|---|---|
Pavllo [CVPRβ19] | Supervised | β | 46.8 |
Ours (semi-sup.) | Semi | β | 47.3 |
Martinez [ICCVβ17] | Supervised | β | 87.3 |
Zanfir [CVPRβ18] | Supervised | β | 69.0 |
Ours (unsup.) | Unsupervised | β | 82.1 |
The semi-supervised version performs on par with fully supervised models, without using any 3D ground truth.
Body Part | 2D Pose | 3D Pose |
---|---|---|
Head | 0.93 | 0.97 |
L Shoulder | 0.95 | 0.93 |
R Arm | 0.89 | 0.94 |
Torso | 0.91 | 0.93 |
Hips | 0.81 | 1.00 |
L Foot | 0.85 | 0.98 |
3D pose leads to improved movement recognition, especially for hips and lower body.
Input | F-score |
---|---|
2D Pose | 0.44 |
3D Pose | 0.47 |
Movements (2D) | 0.50 |
Movements (3D) | 0.55 |
2D + Movements (2D) | 0.73 |
3D + Movements (3D) | 0.86 |
Best performance is achieved when both 3D pose and movement features are fused.
This section showcases how careful modular design can unlock performance even under limited supervision.
The authors make a compelling case for replacing raw appearance with pose-level abstraction, especially for dance-related tasks.
Iβd like to test the movement fusion strategy on my own multi-person dance data and see how well it generalizes with lighter LSTMs or even Transformers.