๐ง Daily Study Log [2025-06-19]
Today was all about combining hands-on computer vision tasks with core CNN theory.
I trained a segmentation model on real urban scenes and revisited a sign language classification task to solidify my understanding of CNN fundamentals.
๐๏ธ Image Segmentation โ Urban Scene Understanding
- Objective: Assign semantic labels (road, sky, building, etc.) to every pixel in an image โ essential for self-driving perception.
- Dataset: Cityscapes (
_leftImg8bit.png
, _gtFine_labelIds.png
)
- Model:
Unet(resnet34)
from segmentation_models_pytorch
- Input Size: Resized to 256x256
- Loss Function:
CrossEntropyLoss()
with classes=34
- Training: 5 epochs โ loss steadily decreased
- Result: Predicted masks matched ground truth very well; roads, skies, cars, and buildings were segmented accurately
โ
Takeaways
- I made a mistake in setting the number of output classes, which helped me understand how
CrossEntropyLoss()
behaves โ a valuable learning moment.
- Semantic segmentation gives visual feedback, which made model performance much more intuitive to grasp.
๐งช CNN Basics Recap โ Sign Language Classifier
- Task: Classify images of hand gestures into 10 sign language categories
- Model: Simple CNN (Conv โ ReLU โ MaxPool โ Dense)
- Input: RGB images resized to 224x224
- Loss: Categorical CrossEntropy
- Augmentation: Horizontal flips, random rotations using
ImageDataGenerator
- Performance Check: Printed classification report and plotted training accuracy
โ
Reflections
- The simple CNN worked decently and was great for reviewing layer flow and preprocessing steps.
- Reinforced the importance of proper image labeling and class balancing.
๐ฏ Next Steps
- Try advanced architectures like SegFormer and DeepLab v3+
- Evaluate using mIoU, pixel accuracy, and visualization tools
- Build a proper inference pipeline โ Prepare for deployment
โ
TL;DR
๐ U-Net: Deployed on urban segmentation
๐ CNN: Refreshed using sign language dataset
๐ Next: Test SegFormer + Evaluate with mIoU