✅ Day 5 – From Paper to Practice: My First Experiment
After reviewing the HDVR (Hierarchical Dance Video Recognition) paper in depth,
I wanted to start building a simplified version of the pipeline using real pose data and my own code.
The original paper relies on 3D lifting and motion segmentation, but I decided to begin with the 2D keypoint side: extracting features like joint distances, angles, and velocities from pose sequences and visualizing motion.
🛠️ What I Built – 2D_Pose_Feature_Builder.ipynb
🎯 Purpose
- Reproduce the lower half of the HDVR pipeline, focused on body part movement from pose sequences.
- Lay groundwork for temporal modeling (LSTM/TCN) using only pose-derived features.
📂 Functionality
- Input:
.json
or .csv
file of 2D pose keypoints (from MediaPipe or OpenPose)
- For each frame:
- Compute joint-to-joint distances (e.g., wrist to elbow)
- Calculate joint angles (e.g., elbow angle from shoulder–elbow–wrist)
- Optionally compute velocity over time (joint displacement)
- Visualize:
- Skeleton overlay by frame
- Feature value sequences (time-series)
💡 Why This Matters
This experiment helped me:
- Understand which joints are stable vs. noisy
- See that even simple handcrafted features (distances, angles) encode a lot of motion semantics
- Identify where future smoothing or filtering would help
- Confirm that pose-only pipelines are viable for lightweight modeling
It was also important to validate that a full 3D lifting module isn’t strictly necessary for building a useful system, especially in constrained or real-time environments.
📊 My Model Setup (So Far)
No learning model yet – this was a feature engineering stage only.
But these outputs will soon be fed into a temporal model like:
- LSTM / GRU for frame-level genre classification
- TCN for learning local body part movement patterns
- Possibly Transformer-based encoder for attention over joint importance
🔭 Next Steps (aka Day 6 Plan)
- Integrate pose segmentation: split long videos into movement segments
- Build a sequence classifier for genre (e.g., hip-hop vs. waacking)
- Try movement encoding from multiple dancers simultaneously
- Optionally explore 3D lifting with mock or learned constraints
📝 Reflection
Implementing this part manually gave me a better grasp of why the HDVR paper’s hierarchical structure makes sense.
Rather than depending on raw pixel data or supervised 3D annotation, I can now construct explainable, modular pipelines based entirely on pose motion.
This also sets the foundation for a real-time dance feedback system, or downstream applications in fitness, rehab, or choreography assistance.