šŸ“Œ Paper Info


🧠 Day 3 Review — 3D Pose Estimation & Genre Classification (Sections 2.2 ~ 2.4)

This session covers the core methodology behind 3D pose lifting, body part motion recognition, and final genre classification.
The paper presents a fully unsupervised approach, structured into three stages: (1) 3D Pose Estimation, (2) Body Part Motion Modeling, and (3) Genre Classification.


šŸ”¹ Section 2.2 — 3D Pose Estimation via Multi-Seed Optimization

The 2D keypoints are lifted to 3D pose representations without any 3D ground truth.
To address the ambiguity of this inverse problem, the authors introduce a multi-seed optimization strategy:

\[\{P_t^k, w_t^k\}\] \[k^* = \arg\min_k \sum_t e_t^k\] \[\hat{P}_t = P_t^{k^*}, \quad w_t = w_t^{k^*}\]

Loss Terms:

\[L_{2D} = \| \hat{p}_t - p_t \|\] \[L_{\text{smooth2D}} = \| \hat{p}_t - \hat{p}_{t-1} \|\] \[L_{\text{smooth3D}} = \| \hat{P}_t - \hat{P}_{t-1} \|\] \[L_{3D} = \| \hat{P}_t - P_t^* \|\]

This strategy enables unsupervised, temporally coherent 3D pose reconstruction from 2D keypoints.


šŸ”¹ Section 2.3 — Body Part Movement Recognition

Each body part $e \in E$ is associated with its own LSTM model to recognize basic motion types over time.

\[\left\{ \left\{ \hat{p}_t^j \right\}_{j \in J_e} \right\}_{t=0}^{T-1}\] \[\left\{ \hat{y}_t^e \right\}_{t=0}^{T-1}\] \[L_{\text{BCE}}^e = \text{BCE}\left( \left\{ \hat{y}_t^e \right\}, \left\{ y_t^e \right\} \right)\]

This component models the localized movement patterns across different body regions in a time-aware manner.


šŸ”¹ Section 2.4 — Dance Genre Recognition

Finally, the predicted motion vectors from all body parts are concatenated and fed into a separate LSTM for genre classification.

\[\left\{ \left\{ \hat{y}_t^e \right\}_{t=0}^{T-1} \right\}_{e \in E}\] \[\hat{g} = \text{Softmax}(W h_T + b)\] \[L_{\text{genre}} = -\sum_{k=1}^K g_k \log(\hat{g}_k)\]

This fusion stage captures the global movement semantics needed to infer the genre label from distributed joint dynamics.


āœ… Key Takeaways


šŸ’­ Reflections

This section offers a technically elegant solution to an otherwise annotation-heavy problem.
By leveraging weak priors and smoothness constraints, the method produces coherent 3D pose sequences from 2D data.
I’m particularly impressed by the modularity — each phase (pose, motion, genre) is independently optimized but integrally linked.
I plan to examine the 3D lifting in more detail later and possibly attempt a lightweight re-implementation using custom dance clips.