Recipe | Shorthand | Description |
---|---|---|
Feature Engineering 3 | FE3 | Predicts damage grade with all predictors. Removes explicit missingness encoded as numerical values in the building age variable and imputes with other highly-correlated predictors. ‘Others’ rare levels contributing to <5% of the dataset. Dummies nominal predictors, removes near-zero-variance columns, and normalizes all numeric predictors. Also includes steps featured in FE3. |
Principal Component Analysis | PCA | Implements principal component analysis by creating interaction terms between all predictors and then reducing dimensionality. Also includes steps featured in FE3. |
Tuning step_other |
Other | Tunes the threshold for ‘othering’ a level in step_other . Also includes steps featured in FE3. |
Earthquake: Damage Grade Prediction Executive Summary
Final Project
Data Science 3 with R (STAT 301-3)
1 Jules. (2022). Earthquake Richter Prediction Competition. Kaggle. https://kaggle.com/competitions/earthquake-richter-prediction-competition
Objective and Data
The project seeks to predict the damage grade of buildings following the April 2015 Gorkha earthquake in Nepal using the structural characteristics of buildings. The problem is a multi-class classification problem categorizing damage into minimal, medium/considerable, or near complete damage levels using the target variable damage_grade
. The data source was downsampled to 18,000 observations with 6,000 observations in each damage grade category. 39 predictors were included in the dataset which was complete with no missingness.
Methods Overview
Model development is split into an initial exploratory phase followed by further refinement of best-performing models. Vfold cross-validation was used during tuning via a resample set of 10 folds with 3 repeats stratified by damage grade.
The Macro averaged F-measure (MAFS) was used as the performance metric because this metric is good for working with multiclass classification problems
Initial Model Exploration
Recipes Used:
- Kitchen Sink
- Variable Selection
- Feature engineering 1
- Feature engineering 2
Model Types Tuned:
Each of the following models were tuned using the recipes above. The Naive bayes baseline model used the kitchen sink model while the other tuned models used a mix of variable selection and feature engineering recipes.
- Naive bayes baseline model
- Boosted tree
- Multivariate adaptive regression splines
- Neural network
- Elastic net
- Random forest
- Support vector machine polynomial function
- Support vector machine radial basis function
- Ensemble with boosted tree, random forest, elastic net, and support vector machine member models
Model Performance:
Boosted tree, random forest, and multinomial regression models performed best and were moved to the next stage., depicted visually in Figure 1. An ensemble model using these types as members also performed well.
Model Refinement Stage
Model Types:
- Boosted tree with
xgboost
,lightboost
, andcatboost
engines - Random forest with
ranger
engine - Ensemble with boosted tree, random forest, and elastic net member models
Improved Feature Engineering:
Table 1 summarizes the new recipes developed during this stage. Each model type was fitted with the new recipes developed during this stage.
Model Performance:
Table 2 shows the MAFS and run times of the best performing random forest and boosted tree models along with the average run times for each model.
wflow_id | .metric | mean | std_err | average runtime (s) |
---|---|---|---|---|
rf_fe3 | f_meas | 0.6276903 | 0.0023860 | 1708.1116 |
bt_catboost_other | f_meas | 0.6269855 | 0.0022532 | 167.2114 |
The ensembles perform the best with the Other recipe. Table 3 shows the best estimated performance of each member model. This ensemble model weighted boosted tree models the most heavily.
Model Type | mean | std_err |
---|---|---|
rf | 0.6323375 | 0.0025989 |
bt | 0.6248109 | 0.0025596 |
en | 0.6106990 | 0.0025337 |
Final Model Analysis
Considering the highest estimated MAFS achieved by models during refinement, the tuned ensemble model using the Other recipe was selected as the best/winning model. Table 4 shows the performance metrics of the final ensemble model making predictions on the testing data, which qualitatively indicates fairly good performance according to roc_auc
, but not good performance for f_meas
.
metric | performance | runtime (s) |
---|---|---|
f_meas | 0.6413545 | 112.915 |
roc_auc | 0.8255142 | 112.915 |
Figure 2 displays the confusion matrix of this final model.
Conclusions and Future Directions
We found that while most variables included in this dataset are important for prediction accuracy, but not all levels in each variable are important or relevant, suggesting that streamlining of architectural classification can be useful for earthquake damage prediction.
Immediate next steps to expand upon these findings are to:
Repeat the analysis on the remainder of the data that was unseen due to downsampling, or to acquire greater computational resources to better incorporate the data we were given into our modeling pipeline.
Determine what the obfuscated variable levels represent, which could inform improved feature engineering.
References
- Jules. (2022). Earthquake Richter Prediction Competition. Kaggle. https://kaggle.com/competitions/earthquake-richter-prediction-competition