Earthquake: Damage Grade Prediction
Final Project
Data Science 3 with R (STAT 301-3)
1 Jules. (2022). Earthquake Richter Prediction Competition. Kaggle. https://kaggle.com/competitions/earthquake-richter-prediction-competition
Introduction
This project focuses on multi-class classification of damage sustained by buildings following a magnitude 8 earthquake. We utilize data sourced from the Earthquake Richter Prediction Competition hosted on Kaggle in December 2022, and use the structural characteristics of buildings in the data to predict the level of damage sustained (low, medium, high). This project provides insight on a relevant current event, given Taiwan’s recent earthquake, and particular architectural concerns related to earthquake resistance.
About the Target Variable
The target variable is damage_grade
which is a factor variable made up of 3 levels:
1
: minimal damage2
: medium/considerable damage3
: near complete destruction
This variable measures the level of damage to buildings. Further analysis of damage_grade
is shown below.
The target variable has no missing observations, but does contain class imbalance, where 2
has significantly more observations than the rest of the levels and 1
significantly fewer.
Data overview
The original dataset has 156,360 observations and 40 variables. Each instance in the data represents one building hit by the April 2015 Gorkha earthquake in Nepal, and the columns contain structural and geographic information about the building. Being provided as part of a Kaggle competition, the data contains no explicit missingness and a good variety of predictor types.
We performed a brief exploratory data analysis on a randomly sampled subset of the training data. Each variable was plotted against the damage grade outcome. Box plots were used for numeric predictors, and mosaic plots were used for nominal predictors.
Exploratory data analysis showed that only the age of the building had outliers or noticeable errors in data collection. Figure 1 shows the distribution of age across the three damage grades. Observations of buildings being 1000 years old could either represent cultural heritage buildings or errors in data collection.
Additionally, there is noticeable variation in damage grade across foundation type, having adobe or mud superstructure, having mud mortar - stone superstructure, and normalized price of the building footprint (Figure 2). This exploration helped us understand our predictors better and informed initial stages of feature engineering and our initial decision to try feature selection through random forest variable importance.
Methods overview
Assessment metrics
F-measure
We elected to use the Macro-averaged F-measure as our assessment metric. This metric describes the accuracy of the model through taking the harmonic mean of the sensitivity2 and precision3 of model predictions. The macro averaged F-measure is a non-weighted average of the sum of F-measures across each class divided by the number of classes, making it suitable for multi-class prediction. This metric is useful when there is an imbalance between classes. It ranges from 0 to 1, with 1 being the most desirable.
2 sensitivity = true positives/all actual positives
3 precision = true positive/all predicted positives
A f_meas
less than 0.5 indicates that the model’s performance is poor due to low precision and sensitivity or an imbalance between the two. A f_meas
at 0.5 means that model is no better than random guessing. A f_meas
more than 0.5 indicates the model’s performance is better with higher precision and sensitivity.
Macro-Averaged F1 Score = \(\frac{1}{n}\sum_{i=1}^{n}{F1\ Score_{i}}\)
For brevity, we will refer to this as the MAFS
metric for the remainder of this report.
ROC_AUC
We also opted to use average roc_auc
as a second metric when performing model refinement with the top three models from the initial stage and to build ensemble models. We chose to do so due to our familiarity with the roc_auc
metric and the requirement of using class probabilities when building ensemble models. roc_auc
incorporates the receiver operating curve that plots the true positive rate against the false positive rate and area under the curve (AUC) to represent the degree of variance. An average roc_auc
score of 0.5 means the model is no better than random guessing. An average roc_auc
less than 0.5 means the model is performing worse than random guessing. An average roc_auc
score between 0.6 to 0.7 indicates poor to fair performance. 0.7 to 0.8 is fair to good performance. 0.8 to 0.9 is good to very good performance as the model is able to effectively distinguish between negative and positive classes. An average roc_auc
close to 1 is excellent model performance.
Data Splitting
Due to the size of the dataset and class imbalance, we subsampled the dataset down to 18,000 observations, with 6,000 observations in each damage grade. We split this subsampled data into a training set of 13,500 observations and testing set of 4,500 observations (75/25 split). Model training was validated using vfold cross-validation on 10 folds with 3 repeats, stratified by damage_grade
.
Model development
Model development was split into an initial development and secondary model refinement stage. The process used in each is described in further detail in Section 5 and Section 6, but each section featured the following the following models and feature engineering steps:
Model Development Summary
Initial Stage | Model Refinement | |
---|---|---|
Models |
|
|
Recipes |
|
|
Initial stage
Recipes and Feature Engineering
The initial stage of work on this project featured experimental feature-engineering and exploration of several model types. Firstly we developed the following recipes:
Recipe | Shorthand | Description |
---|---|---|
Kitchen Sink | KS | Predicts damage grade with all predictor columns. Dummies nominal predictors, removes near-zero-variance columns, and normalizes all numeric predictors |
Variable Selection | VS | Predicts damage grade using predictors with a variable importance score >60 as determined by a random forest selection process. Dummies nominal predictors, removes near-zero-variance columns, and normalizes all numeric predictors. |
Feature Engineering 1 | FE | Predicts damage grade with all predictors, featuring mean and mode imputation for numeric and nominal predictors, respectively. Handles novel and missing factor levels by relegating them to new, specified categories. Dummies nominal predictors, removes near-zero-variance columns, and normalizes all numeric predictors. |
Feature Engineering 2 | FE2 | Predicts damage grade with all predictors except some id variables with a large number of levels. Removes explicit missingness encoded as numerical values in the building age variable and imputes with other highly-correlated predictors. ‘Others’ rare levels contributing to <5% of the dataset. Dummies nominal predictors, removes near-zero-variance columns, and normalizes all numeric predictors. |
Baseline Model
Our baseline model is a naive bayes model fit with a kitchen sink recipe.
wflow_id | mean | std_err | total runtime (s) |
---|---|---|---|
nbayes_ks | 0.4013995 | 0.0132339 | 255 |
An average F-measure of 0.40 indicates relatively poor predictive capability from the baseline model, suggesting that fitting more complex models in the pursuit of our prediction problem is warranted.
Model Performance
After tuning our more complex models with the recipes in Table 1, the following best-performing models were obtained when scored by the MAFS metric:
model | recipe | mean | std_err | total runtime (s) |
---|---|---|---|---|
Random Forest | FE | 0.6704303 | 0.0025183 | 10020.862 |
Boosted Tree | FE2 | 0.6506890 | 0.0020727 | 2642.424 |
Elastic Net | FE | 0.6433707 | 0.0019431 | 949.310 |
Neural Network | VS | 0.5915064 | 0.0027660 | 553.475 |
SVM radial | FE2 | 0.5809771 | 0.0026705 | 5913.928 |
Random Forest | VS | 0.5574154 | 0.0022169 | 3851.986 |
SVM polynomial | VS | 0.5062007 | 0.0018512 | 8343.116 |
MARS | FE | 0.2286743 | 0.0023838 | 264.788 |
MARS | VS | 0.2189345 | 0.0027280 | 27.932 |
Figure 4 depicts the performance of each model using f_meas
visually:
Not shown above, we also ran an ensemble model that will be analyzed further below.
Given their prior success in past classwork and projects, we are unsurprised that tree-based models performed best at this stage. We are also pleasantly surprised that the elastic net model performs reasonably well. There appears to be a natural divide around a MAFS score of 0.60 that divides our models into “better” and “worse” models. The standard errors recorded for each model also suggest that the differences in MAFS observed across models are significantly different. At a mean MAFS score of 0.21-22, we are surprised that the MARS models performed so poorly, as a MAFS score of about 0.5 is equivalent to random guessing of the model, and anything less is qualitatively equivalent to “intentionally” selecting the wrong classifications.
We also note that variable selection appears to have worsened model performance compared to our other feature engineering recipes, suggesting that all variables include information relevant to the prediction problem, which we implement in the next stage of model development.
Model Tuning Parameters
Following tuning, we found the below hyperparameter values to perform best for each model.
Hyperparameter | RF (FE) | RF (VS) | Boosted Tree | SVM Radial | SVM Poly | Elastic Net | Neural Network | MARS 1 | MARS 2 |
---|---|---|---|---|---|---|---|---|---|
mtry | 20 | 3 | 5.000 | ||||||
min_n | 21 | 11 | 2.000 | ||||||
trees | 1455.000 | ||||||||
learn_rate | 0.021 | ||||||||
cost | 0.863 | 0.024 | |||||||
rbf_sigma | 0.054 | ||||||||
margin | 0.046 | ||||||||
degree | 3.000 | ||||||||
scale_factor | 0.052 | ||||||||
penalty | 0.077 | 0.004 | |||||||
mixture | 0.778 | ||||||||
hidden_units | 9.000 | ||||||||
num_terms | 1 | 1 | |||||||
prod_degree | 1 | 1 |
Given that our efforts were spread across exploring as many model types and recipes as possible, it is difficult to come to well-supported conclusions about optimal hyperparameter values at this time. However, some loose observations can be made:
A high
mtry
andmin_n
seem to improve random forest model performance when observing the differences in MAFS score (shown in Table 3) between the two random forest models, however the discrepancy may also be due to the variable selection recipe which we mentioned previously as depressing model performanceCuriously, the boosted tree model seemed to favor a low
mtry
andmin_n
value.The elastic net model was able to achieve good results with a low penalty, but high mixture, which loosely supports the idea that variable selection is detrimental to model performance.
Initial Stage Ensembling
An ensemble model was also produced, with model members fitted with FE2. Table 5 displays the best MAFS score for each model type included in the ensemble.
Model Type | mean | std_err |
---|---|---|
Random Forest | 0.6588691 | 0.0017454 |
Boosted Tree | 0.6520562 | 0.0018971 |
Elastic Net | 0.6442161 | 0.0020036 |
SVM Polynomial | 0.5688290 | 0.0029013 |
The natural divide in performance across models found in Table 3 seems to reappear here. Given this, we elected to move forward with our random forest, boosted tree, and ensemble models (excluding SVM Poly as a member) for further refinement.
Model refinement
The second stage of model refinement featured another round of recipe development and model tuning on the models mentioned above. The primary changes we made to our feature engineering centered around geographic location identifiers included in the raw dataset, which were mostly ignored during the first stage due to the sheer number of levels included in them.
geo_level_1_id
, geo_level_2_id
, and geo_level_3_id
are obfuscated variables which specify the geographic region each building exists in. The level of specificity increases with each variable, with geo_level_1_id
containing 31 levels, geo_level_2_id
containing 1428 levels, and geo_level_3_id
containing 12568 levels. To reduce tuning times, the latter two variables were removed as predictors in all recipes used during the initial stage. But these variables are likely to contain information highly relevant to improving model accuracy, which is why they became the focus of the model refinement stage.
Recipes and Feature Engineering
Table 6 summarizes the new recipes developed during this stage. Given that FE2 produced the best-performing models during the initial modeling stage, it was used again during model refinement to set a baseline threshold of performance for models (but now renamed FE3 to distinguish it from its earlier iteration). The remaining recipes also include the steps found in this recipe. Since the primary conclusion from the initial stage that more information leads to better results, our PCA recipe experiments with capturing interactions between all predictors and the “Other” recipe tunes with the threshold of step_other
, which becomes important when considering the addition of information given by the geo ID variables.
Recipe | Shorthand | Description |
---|---|---|
Feature Engineering 3 | FE3 | Predicts damage grade with all predictors. Geo ID levels were converted to factor type using step_mutate . Removes explicit missingness encoded as numerical values in the building age variable and imputes with other highly-correlated predictors. ‘Others’ rare levels contributing to <5% of the dataset. Dummies nominal predictors, removes near-zero-variance columns, and normalizes all numeric predictors. |
Principal Component Analysis | PCA | Implements principal component analysis by creating interaction terms between all predictors and then reducing dimensionality by condensing terms into a certain number of components, which was tuned during model tuning. Utilizes all geo ID levels. |
Tuning step_other |
Other | Tunes the threshold for ‘othering’ a level in step_other , as we anticipated that the frequency of the different levels found in the geo ID levels would correlate with importance in prediction. Utilizes all geo ID levels. |
While creating the recipe for the Other recipe, we discovered that the notation to expand all two way interactions, .*.
, caused the outcome variable to be included in the recipe. This created models with very high MAFS and ROC AUC (>0.9) estimates that were inconsistent with what was expected of using PCA with tree based models. Upon further exploration, we discovered that the outcome variable had been incorporated into the recipe while using step_interact
. To address this issue, we used all_predictors() * all_predictors()
to expand all two way interactions.
In addition, we also adjusted hyperparameter tuning ranges during model refinement. To reiterate, we ran a boosted tree model, random forest model, and an ensemble model with boosted tree, random forest, and elastic net members. Three models of each type were produced, each fitted with a different feature engineering recipe as described above.
Model Types and Tuning
Random Forest
Table 7 and Table 8 shows the MAFS
and roc_auc
of the random forest models along with the average run times for each model. The random forest using the FE3 recipe was the best performing random forest model. There were a total of 25 models run for each random forest model.
wflow_id | .metric | mean | std_err | average runtime (s) |
---|---|---|---|---|
rf_fe3 | f_meas | 0.6276903 | 0.0023860 | 1708.112 |
rf_other | f_meas | 0.6265786 | 0.0024843 | 1708.120 |
rf_pca | f_meas | 0.5903976 | 0.0021067 | 421.377 |
wflow_id | .metric | mean | std_err | average runtime (s) |
---|---|---|---|---|
rf_fe3 | roc_auc | 0.8123370 | 0.0015207 | 1708.112 |
rf_other | roc_auc | 0.8115979 | 0.0015000 | 1708.120 |
rf_pca | roc_auc | 0.7787781 | 0.0016641 | 421.377 |
Boosted Tree
We ran boosted tree models with each of the three recipes and updated tuning parameters based on the autoplots of the boosted tree model in the initial stage of modeling. These three models were ran using the xgboost
engine. Given that the model fitted with the “Other” recipe performed best, we also tuned boosted tree models using the lightgbm
and catboost
engines. Table 9 and Table 10 shows the MAFS and ROC_AUC of these boosted tree models along with the average run times for each model. The boosted tree model using the catboost
engine was the best performing boosted tree model.
wflow_id | .metric | mean | std_err | average runtime (s) |
---|---|---|---|---|
bt_catboost_other | f_meas | 0.6269855 | 0.0022532 | 167.2114 |
bt_lightgbm_other | f_meas | 0.6259222 | 0.0020285 | 115.0461 |
bt_other | f_meas | 0.6248485 | 0.0025401 | 273.4695 |
bt_fe3 | f_meas | 0.6236902 | 0.0022898 | 399.4303 |
bt_pca | f_meas | 0.5954955 | 0.0024196 | 793.3656 |
wflow_id | .metric | mean | std_err | average runtime (s) |
---|---|---|---|---|
bt_catboost_other | roc_auc | 0.8141758 | 0.0015412 | 167.2114 |
bt_lightgbm_other | roc_auc | 0.8131996 | 0.0013899 | 115.0461 |
bt_other | roc_auc | 0.8118673 | 0.0014865 | 273.4695 |
bt_fe3 | roc_auc | 0.8112522 | 0.0014710 | 399.4303 |
bt_pca | roc_auc | 0.7891539 | 0.0016715 | 793.3656 |
Ensemble
We created three ensemble models based on the three recipes used in model refinement. Each ensemble model was blended using boosted tree, random forest, and elastic net members models.
FE3
Model Type | mean | std_err |
---|---|---|
rf | 0.6276125 | 0.0023544 |
bt | 0.6223128 | 0.0021921 |
en | 0.6046279 | 0.0023360 |
PCA
Model Type | mean | std_err |
---|---|---|
en | 0.6132573 | 0.0022513 |
rf | 0.5906554 | 0.0022357 |
bt | 0.5779136 | 0.0025291 |
Other
Model Type | mean | std_err |
---|---|---|
rf | 0.6323375 | 0.0025989 |
bt | 0.6248109 | 0.0025596 |
en | 0.6106990 | 0.0025337 |
Observing Table 11, Table 12, and Table 13 leads to the conclusion that the ensemble models perform best with the Other recipe, in accordance with the boosted tree models but not the random forest models. This might be because most ensembles tended to weigh the boosted tree models more heavily than the other members, as demonstrated in Figure 5.
However, when considering this model against all models tuned during model refinement, it still performed best. Thus, we will move forward with it as our final model.
Final model analysis
We selected the tuned ensemble model using the Other recipe during model refinement. Table 14 shows the performance metrics of the final ensemble model when fitted to the testing data. The final ensemble model performed better than the estimated performance of each member model as seen in Table 13.
metric | performance | runtime (s) |
---|---|---|
f_meas | 0.6413545 | 112.915 |
roc_auc | 0.8255142 | 112.915 |
Figure 6 displays the confusion matrix of this final model. The model distinguished between damage grade 1 and 3 the best, while damage grade 2 predictions appear to have lower accuracy. The model seemed to distinguish damage grade 1 and 2 better than damage grade 2 and 3.
While the final model’s MAFS appears relatively low, its corresponding ROC AUC score of 0.826 reflects fairly good performance. On the other hand, it is interesting to note that while the final model performed significantly better than the baseline model (MAFS = 0.401), it did not outperform the estimated performance of the random forest model built during stage 1 fitted with minimal feature engineering (MAFS = 0.67), which did not discard outlier building age values and treated geo ID variables as continuous numeric variables.
In terms of the confusion matrix, the largest values were 1106, 771, and 1005 which were the number of predicted values that were predicted correctly. 1106 is the number of predicted 1
damage grade that was predicted correctly. 771 is the number of 2
predicted damage grade that was predicted correctly. 1005 is the number of 3
damage grade that was predicted correctly.
Regarding the age
variable, perhaps the unusual data points we observed were not mistakes in data entry after all, which may explain the underperformance of models built during model refinement. Furthermore, if the geo ID variables were in fact encoded on an ordered scale, perhaps it would have been more efficient to leave it as a continuous numeric variable (as the initial random forest model did) rather than changing them to an unordered factor variable as we did during model refinement. However, we are not given further details about the meaning of each level in the geo ID variables so this claim remains unverified.
Regardless of the performance of the final model, the development of this final model offered a good exercise in data exploration and critical thinking in feature engineering, which were undoubtedly useful even given the limitations on what we know about the data.
Conclusion
To summarize, in this project we utilized information on buildings struck by the Gorkha earthquake to build a model predicting damage to a building characteristics using the variable damage_grade
. Using the Macro-Averaged F1 Score (MAFS) metric, we found that an ensemble model utilizing tree-based models (random forest and boosted tree) and multinomial regression performed best, achieving a final MAFS score on the testing set of 0.641. We found that, variable selection causes models to underperform, and our recipe that tunes the threshold level in step_other
boosts model performance. This suggests that while most variables are important, not all levels in each variable are important.
An immediate next step would be to perhaps repeat the analysis on the remainder of the data that was unseen due to downsampling, or to acquire greater computational resources to better incorporate the data we were given into our modeling pipeline. We could also tweak the tuning parameters more in the models during the model refinement process, so if the best model used a parameter at the end values of a specific range, we could increase or decrease the range. In addition, a further area of research is to determine what some of the obfuscated variable levels represent, which could inform improved feature engineering. While it’s obvious here that the further feature engineering we took on during model refinement in this project was not particularly helpful, the competition leaderboard shows that most competitors achieved micro-averaged F-scores of 0.7 or above, suggesting that further engineering can be conducted and will be useful.
References
- Jules. (2022). Earthquake Richter Prediction Competition. Kaggle. https://kaggle.com/competitions/earthquake-richter-prediction-competition