π§ Daily Study Log [2025-06-28]
Today was a blend of competition modeling, CV theory reinforcement, and long-term dataset preparation.
I mainly focused on feature engineering in a tabular campaign dataset, studied core cross-validation strategies, and continued data wrangling for a renewable energy forecasting competition.
π SCU_Competition β Final Optimization Phase
Focus: Maximizing AUC score on a marketing campaign acceptance prediction task
Model: LGBMClassifier
with Optuna
and RandomSearchCV
tuning
Feature Strategy:
- Carefully selected cluster-based features (from KMeans on income, spending, visits)
- Simplified to top engineered features (purchase sum, wine ratio, webΓcampaign)
Ensemble Attempt:
- Soft voting of top models (11, 9, 23)
- Weighted voting (7:3 ratio) outperformed standard ensemble
β
Takeaways
- Found that over-complex stacking/SHAP filtering degraded generalization
- Simpler + well-tuned models performed better, especially with clean features + clustering
π§ͺ CV Theory & Cross-Validation Review
- Reviewed
StratifiedKFold
, GroupKFold
, and TimeSeriesSplit
- Ran multiple
cross_val_score()
tests with different seeds and splits
- Analyzed gap between local CV AUC and Kaggle leaderboard AUC
β
Reflections
- Reinforced that Kaggle public set β CV folds
- CV is not about getting high score β itβs about stability and generalization
π Data Collection β Renewable Energy Forecasting
Target: Predict energy generation per region and energy source
Data: 2019β2023, 5-year monthly generation data from KEPCO and KPX
Work:
- Combined multiple Excel sources into unified tables
- Processed generation amount, capacity, and aging factors
- Structured final format: (region, source, month) β generation
β
Progress
- Cleaned and merged over 30+ CSVs
- Grouped by region and source for time-series modeling readiness
π― Next Steps
SCU_Competition
- Try meta-feature stacking from top model outputs
- Build a KMeans cluster β acceptance rate encoding feature
CV Practice
- Try nested CV and review model selection pitfalls
- Visualize score variance across folds
Forecasting Competition
- Add weather & calendar features
- Test basic
XGBRegressor
time-series model
β
TL;DR
π SCU: Cluster features + weighted voting boosted AUC
π CV: Refined understanding of validation strategy gaps
π Energy: Merged 5-year data β ready for modeling