🧠 Daily Study Log [2025-06-28]
Today was a blend of competition modeling, CV theory reinforcement, and long-term dataset preparation.
I mainly focused on feature engineering in a tabular campaign dataset, studied core cross-validation strategies, and continued data wrangling for a renewable energy forecasting competition.

📊 SCU_Competition — Final Optimization Phase

Focus: Maximizing AUC score on a marketing campaign acceptance prediction task
Model: LGBMClassifier with Optuna and RandomSearchCV tuning

Feature Strategy:

Carefully selected cluster-based features (from KMeans on income, spending, visits)
Simplified to top engineered features (purchase sum, wine ratio, web×campaign)

Ensemble Attempt:

Soft voting of top models (11, 9, 23)
Weighted voting (7:3 ratio) outperformed standard ensemble

✅ Takeaways

Found that over-complex stacking/SHAP filtering degraded generalization
Simpler + well-tuned models performed better, especially with clean features + clustering

🧪 CV Theory & Cross-Validation Review

Reviewed StratifiedKFold, GroupKFold, and TimeSeriesSplit
Ran multiple cross_val_score() tests with different seeds and splits
Analyzed gap between local CV AUC and Kaggle leaderboard AUC

✅ Reflections

Reinforced that Kaggle public set ≠ CV folds
CV is not about getting high score — it’s about stability and generalization

🔄 Data Collection — Renewable Energy Forecasting

Target: Predict energy generation per region and energy source
Data: 2019–2023, 5-year monthly generation data from KEPCO and KPX

Work:

Combined multiple Excel sources into unified tables
Processed generation amount, capacity, and aging factors
Structured final format: (region, source, month) → generation

✅ Progress

Cleaned and merged over 30+ CSVs
Grouped by region and source for time-series modeling readiness

🎯 Next Steps

SCU_Competition

Try meta-feature stacking from top model outputs
Build a KMeans cluster → acceptance rate encoding feature

CV Practice

Try nested CV and review model selection pitfalls
Visualize score variance across folds

Forecasting Competition

Add weather & calendar features
Test basic XGBRegressor time-series model

✅ TL;DR

📍 SCU: Cluster features + weighted voting boosted AUC
📍 CV: Refined understanding of validation strategy gaps
📍 Energy: Merged 5-year data — ready for modeling