π§ Daily Study Log [2025-09-23]
Submitted an assignment, explored a new healthcare idea, studied the Vision Transformer paper (Day 1), and advanced the dance-beat alignment pipeline.
π Assignment Submission
- Submitted a business plan project.
- Fortunately, the assignment included an βideaβ section β reused one of the prepared ideas.
π‘ Idea
- Proposed Smoke-Free Journey.
- Concept: Enhance smoking cessation by embedding supportive interventions into daily travel routines.
- Paper: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ICLR 2021)
- Day 1 β Abstract & Introduction (Summary)
- CNNs have dominated CV for a decade but suffer from strong inductive biases and limited scalability.
- Transformers in NLP scale efficiently and achieve state-of-the-art results.
- Core question: Can the same Transformer be applied directly to vision, without convolutions?
- Core idea: Treat an image as a sequence of flattened patches, feed them into a standard Transformer with a [CLS] token and positional embeddings.
- Contributions:
- Introduced Vision Transformer (ViT), the first pure Transformer for large-scale vision.
- Showed ViT rivals or surpasses CNNs when trained on huge datasets (e.g., JFT-300M).
- Confirmed that scaling laws from NLP also apply in vision.
π¬ Paper Study β Dance Beat Alignment
- Built and tested a skeleton + beat alignment pipeline.
- Methods implemented:
- Extracted skeletons from both real and generated videos using MediaPipe Pose.
- Used librosa to detect tempo/beat times β overlayed on skeleton trajectories.
- Implemented BPM sweep (50β150 BPM) β generated error curves β identified optimal BPM (minimum error point).
- Defined alignment score based on mean/std error.
- Validated metric by comparing original vs tempo-shifted music (128 BPM).
- Progress Today:
- Pipeline fully established (skeleton extraction + beat overlay + BPM sweep).
- Error metric defined and partially validated.
- Next Steps:
- Apply BPM sweep to multiple generated dance videos.
- Compare βon-beatβ vs βoff-beatβ generated videos directly.
- Scale up to ~100 generated videos for benchmark construction.
β
TL;DR
π Submitted business plan assignment using a prepared idea
π New idea: Smoke-Free Journey for healthcare + daily life
π ViT study (Day 1): Patch-based Transformer rivaling CNNs
π Dance-beat alignment pipeline completed β ready for large-scale testing