Today, I completed and summarized my reproduction experiments for EfficientNet’s compound scaling strategy.
By testing base, depth-only, width-only, and compound-scaled models on CIFAR-10, I was able to confirm that compound scaling consistently delivers the highest validation accuracy and lowest loss — even if the gap is small due to the dataset’s simplicity.
Model | FLOPs (MMac) | Params (M) | Val Acc (Best) | Val Loss (Lowest) |
---|---|---|---|---|
Base (B0) | 408.93 | 4.02 | 93.82% | 0.2045 |
Depth-only | 533.91 | 4.02 | 93.85% | 0.1951 |
Width-only | 578.40 | 4.02 | 93.69% | 0.1936 |
Compound | 838.07 | 4.02 | 93.98% | 0.1924 |
📌 While differences were small, compound scaling still showed the best overall performance.
I expect the gap to widen on more complex datasets like CIFAR-100 or TinyImageNet.
Visualizations and training logs are available in my GitHub repo under Paper-Review/vision/02_efficientnet/
.
Starting today, I’m moving on to a new topic:
Pose-based Action Recognition, especially for dance genre classification (e.g., hip-hop, waacking, etc.).
arXiv link |
I believe this shift from model scaling to skeleton-based temporal modeling will give me practical insight into human-centric vision, especially for motion and genre classification.
📍 Wrapped up my EfficientNet scaling experiments (base vs depth vs width vs compound)
📍 Confirmed compound scaling performs best (even on CIFAR-10)
📍 Ready to explore new tasks: Pose-based Action Recognition using keypoints
📍 Next paper: Action Recognition using Pose Estimation (2019)
Stay tuned for pose modeling, temporal sequence classification, and experiments with dance video data!