š Paper Info
- Title: Unsupervised 3D Pose Estimation for Hierarchical Dance Video Recognition
- Authors: Xiaodan Hu, Narendra Ahuja
- Link: https://arxiv.org/abs/2109.09166
- Published: ICCV 2021 ā University of Illinois Urbana-Champaign
š§ Day 3 Review ā 3D Pose Estimation & Genre Classification (Sections 2.2 ~ 2.4)
This session covers the core methodology behind 3D pose lifting, body part motion recognition, and final genre classification.
The paper presents a fully unsupervised approach, structured into three stages: (1) 3D Pose Estimation, (2) Body Part Motion Modeling, and (3) Genre Classification.
š¹ Section 2.2 ā 3D Pose Estimation via Multi-Seed Optimization
The 2D keypoints are lifted to 3D pose representations without any 3D ground truth.
To address the ambiguity of this inverse problem, the authors introduce a multi-seed optimization strategy:
- Generate multiple candidate 3D poses
\[\{P_t^k, w_t^k\}\]
- Evaluate each candidate with a composite loss function
- Select the best pose seed using total error minimization:
\[k^* = \arg\min_k \sum_t e_t^k\]
\[\hat{P}_t = P_t^{k^*}, \quad w_t = w_t^{k^*}\]
Loss Terms:
\[L_{2D} = \| \hat{p}_t - p_t \|\]
\[L_{\text{smooth2D}} = \| \hat{p}_t - \hat{p}_{t-1} \|\]
\[L_{\text{smooth3D}} = \| \hat{P}_t - \hat{P}_{t-1} \|\]
\[L_{3D} = \| \hat{P}_t - P_t^* \|\]
This strategy enables unsupervised, temporally coherent 3D pose reconstruction from 2D keypoints.
š¹ Section 2.3 ā Body Part Movement Recognition
Each body part $e \in E$ is associated with its own LSTM model to recognize basic motion types over time.
- Input: 3D joint trajectories for the joints:
\[\left\{ \left\{ \hat{p}_t^j \right\}_{j \in J_e} \right\}_{t=0}^{T-1}\]
- Output: Multi-label motion vector per time step:
\[\left\{ \hat{y}_t^e \right\}_{t=0}^{T-1}\]
- Loss Function: Binary cross-entropy for each body part
\[L_{\text{BCE}}^e = \text{BCE}\left( \left\{ \hat{y}_t^e \right\}, \left\{ y_t^e \right\} \right)\]
This component models the localized movement patterns across different body regions in a time-aware manner.
š¹ Section 2.4 ā Dance Genre Recognition
Finally, the predicted motion vectors from all body parts are concatenated and fed into a separate LSTM for genre classification.
\[\left\{ \left\{ \hat{y}_t^e \right\}_{t=0}^{T-1} \right\}_{e \in E}\]
- Output: Final genre prediction from last LSTM hidden state
\[\hat{g} = \text{Softmax}(W h_T + b)\]
\[L_{\text{genre}} = -\sum_{k=1}^K g_k \log(\hat{g}_k)\]
This fusion stage captures the global movement semantics needed to infer the genre label from distributed joint dynamics.
ā
Key Takeaways
- The 3D pose lifting component is fully unsupervised and relies on temporal smoothness, reprojection, and pose consistency.
- Motion recognition is handled separately per body part using LSTM-based sequence models.
- Dance genre classification is conducted by aggregating part-level motions into a high-level spatiotemporal representation.
š Reflections
This section offers a technically elegant solution to an otherwise annotation-heavy problem.
By leveraging weak priors and smoothness constraints, the method produces coherent 3D pose sequences from 2D data.
Iām particularly impressed by the modularity ā each phase (pose, motion, genre) is independently optimized but integrally linked.
I plan to examine the 3D lifting in more detail later and possibly attempt a lightweight re-implementation using custom dance clips.