Earthquake: Damage Grade Prediction

Final Project
Data Science 3 with R (STAT 301-3)

Authors

Sydney Lee

Cassie Lee

Hannah Ma

Published

June 4, 2024

Github Repo and Data Source

GitHub Repo Link

Earthquakes Data Set ¹

¹ Jules. (2022). Earthquake Richter Prediction Competition. Kaggle. https://kaggle.com/competitions/earthquake-richter-prediction-competition

Introduction

This project focuses on multi-class classification of damage sustained by buildings following a magnitude 8 earthquake. We utilize data sourced from the Earthquake Richter Prediction Competition hosted on Kaggle in December 2022, and use the structural characteristics of buildings in the data to predict the level of damage sustained (low, medium, high). This project provides insight on a relevant current event, given Taiwan’s recent earthquake, and particular architectural concerns related to earthquake resistance.

About the Target Variable

The target variable is damage_grade which is a factor variable made up of 3 levels:

1: minimal damage
2: medium/considerable damage
3: near complete destruction

This variable measures the level of damage to buildings. Further analysis of damage_grade is shown below.

The target variable has no missing observations, but does contain class imbalance, where 2 has significantly more observations than the rest of the levels and 1 significantly fewer.

Data overview

The original dataset has 156,360 observations and 40 variables. Each instance in the data represents one building hit by the April 2015 Gorkha earthquake in Nepal, and the columns contain structural and geographic information about the building. Being provided as part of a Kaggle competition, the data contains no explicit missingness and a good variety of predictor types.

We performed a brief exploratory data analysis on a randomly sampled subset of the training data. Each variable was plotted against the damage grade outcome. Box plots were used for numeric predictors, and mosaic plots were used for nominal predictors.

Exploratory data analysis showed that only the age of the building had outliers or noticeable errors in data collection. Figure 1 shows the distribution of age across the three damage grades. Observations of buildings being 1000 years old could either represent cultural heritage buildings or errors in data collection.

Figure 1: This figure shows the distribution of age of the buildings relative to the damage grade after the earthquake.

Additionally, there is noticeable variation in damage grade across foundation type, having adobe or mud superstructure, having mud mortar - stone superstructure, and normalized price of the building footprint (Figure 2). This exploration helped us understand our predictors better and informed initial stages of feature engineering and our initial decision to try feature selection through random forest variable importance.

Methods overview

Assessment metrics

F-measure

We elected to use the Macro-averaged F-measure as our assessment metric. This metric describes the accuracy of the model through taking the harmonic mean of the sensitivity² and precision³ of model predictions. The macro averaged F-measure is a non-weighted average of the sum of F-measures across each class divided by the number of classes, making it suitable for multi-class prediction. This metric is useful when there is an imbalance between classes. It ranges from 0 to 1, with 1 being the most desirable.

² sensitivity = true positives/all actual positives

³ precision = true positive/all predicted positives

A f_meas less than 0.5 indicates that the model’s performance is poor due to low precision and sensitivity or an imbalance between the two. A f_meas at 0.5 means that model is no better than random guessing. A f_meas more than 0.5 indicates the model’s performance is better with higher precision and sensitivity.

Macro-Averaged F1 Score = \(\frac{1}{n}\sum_{i=1}^{n}{F1\ Score_{i}}\)

For brevity, we will refer to this as the MAFS metric for the remainder of this report.

ROC_AUC

We also opted to use average roc_auc as a second metric when performing model refinement with the top three models from the initial stage and to build ensemble models. We chose to do so due to our familiarity with the roc_auc metric and the requirement of using class probabilities when building ensemble models. roc_auc incorporates the receiver operating curve that plots the true positive rate against the false positive rate and area under the curve (AUC) to represent the degree of variance. An average roc_auc score of 0.5 means the model is no better than random guessing. An average roc_auc less than 0.5 means the model is performing worse than random guessing. An average roc_auc score between 0.6 to 0.7 indicates poor to fair performance. 0.7 to 0.8 is fair to good performance. 0.8 to 0.9 is good to very good performance as the model is able to effectively distinguish between negative and positive classes. An average roc_auc close to 1 is excellent model performance.

Data Splitting

Due to the size of the dataset and class imbalance, we subsampled the dataset down to 18,000 observations, with 6,000 observations in each damage grade. We split this subsampled data into a training set of 13,500 observations and testing set of 4,500 observations (75/25 split). Model training was validated using vfold cross-validation on 10 folds with 3 repeats, stratified by damage_grade.

Model development

Model development was split into an initial development and secondary model refinement stage. The process used in each is described in further detail in Section 5 and Section 6, but each section featured the following the following models and feature engineering steps:

Model Development Summary

	Initial Stage	Model Refinement
Models	Random Forest Boosted Tree SVM Polynomial Function SVM Radial Basis Function Logistic Regression Neural Network MARS Ensemble	Random Forest Boosted Tree Ensemble
Recipes	Kitchen Sink Random Forest Variable Selection Parametric Featured Engineering Non-parametric Feature Engineering	Feature Engineering (Parametric or Non-parametric) PCA Feature Engineering (Parametric or Non-parametric) Tuned `step_other` (Parametric or Non-parametric)

Initial stage

Recipes and Feature Engineering

The initial stage of work on this project featured experimental feature-engineering and exploration of several model types. Firstly we developed the following recipes:

Table 1: Initial Stage Recipes
Recipe	Shorthand	Description
Kitchen Sink	KS	Predicts damage grade with all predictor columns. Dummies nominal predictors, removes near-zero-variance columns, and normalizes all numeric predictors
Variable Selection	VS	Predicts damage grade using predictors with a variable importance score >60 as determined by a random forest selection process. Dummies nominal predictors, removes near-zero-variance columns, and normalizes all numeric predictors.
Feature Engineering 1	FE	Predicts damage grade with all predictors, featuring mean and mode imputation for numeric and nominal predictors, respectively. Handles novel and missing factor levels by relegating them to new, specified categories. Dummies nominal predictors, removes near-zero-variance columns, and normalizes all numeric predictors.
Feature Engineering 2	FE2	Predicts damage grade with all predictors except some id variables with a large number of levels. Removes explicit missingness encoded as numerical values in the building age variable and imputes with other highly-correlated predictors. ‘Others’ rare levels contributing to <5% of the dataset. Dummies nominal predictors, removes near-zero-variance columns, and normalizes all numeric predictors.

Figure 3: Histogram of Variable Importance generated using random forest variable selection. Variables with a variable importance score above 60 (red line) were selected for the VS recipe.

Baseline Model

Our baseline model is a naive bayes model fit with a kitchen sink recipe.

Table 2: Baseline Model Performance (measured by MAFS)
wflow_id	mean	std_err	total runtime (s)
nbayes_ks	0.4013995	0.0132339	255

An average F-measure of 0.40 indicates relatively poor predictive capability from the baseline model, suggesting that fitting more complex models in the pursuit of our prediction problem is warranted.

Model Performance

After tuning our more complex models with the recipes in Table 1, the following best-performing models were obtained when scored by the MAFS metric:

Table 3: Initial Stage Model Performance (measured by MAFS)
model	recipe	mean	std_err	total runtime (s)
Random Forest	FE	0.6704303	0.0025183	10020.862
Boosted Tree	FE2	0.6506890	0.0020727	2642.424
Elastic Net	FE	0.6433707	0.0019431	949.310
Neural Network	VS	0.5915064	0.0027660	553.475
SVM radial	FE2	0.5809771	0.0026705	5913.928
Random Forest	VS	0.5574154	0.0022169	3851.986
SVM polynomial	VS	0.5062007	0.0018512	8343.116
MARS	FE	0.2286743	0.0023838	264.788
MARS	VS	0.2189345	0.0027280	27.932

Figure 4 depicts the performance of each model using f_meas visually:

Figure 4: Visual depiction of Initial Model Performance from Autoplot

Not shown above, we also ran an ensemble model that will be analyzed further below.

Given their prior success in past classwork and projects, we are unsurprised that tree-based models performed best at this stage. We are also pleasantly surprised that the elastic net model performs reasonably well. There appears to be a natural divide around a MAFS score of 0.60 that divides our models into “better” and “worse” models. The standard errors recorded for each model also suggest that the differences in MAFS observed across models are significantly different. At a mean MAFS score of 0.21-22, we are surprised that the MARS models performed so poorly, as a MAFS score of about 0.5 is equivalent to random guessing of the model, and anything less is qualitatively equivalent to “intentionally” selecting the wrong classifications.

We also note that variable selection appears to have worsened model performance compared to our other feature engineering recipes, suggesting that all variables include information relevant to the prediction problem, which we implement in the next stage of model development.

Model Tuning Parameters

Following tuning, we found the below hyperparameter values to perform best for each model.

Table 4: Best-Performing Hyperparameters
Hyperparameter	RF (FE)	RF (VS)	Boosted Tree	SVM Radial	SVM Poly	Elastic Net	Neural Network	MARS 1	MARS 2
mtry	20	3	5.000
min_n	21	11	2.000
trees			1455.000
learn_rate			0.021
cost				0.863	0.024
rbf_sigma				0.054
margin				0.046
degree					3.000
scale_factor					0.052
penalty						0.077	0.004
mixture						0.778
hidden_units							9.000
num_terms								1	1
prod_degree								1	1

Given that our efforts were spread across exploring as many model types and recipes as possible, it is difficult to come to well-supported conclusions about optimal hyperparameter values at this time. However, some loose observations can be made:

A high mtry and min_n seem to improve random forest model performance when observing the differences in MAFS score (shown in Table 3) between the two random forest models, however the discrepancy may also be due to the variable selection recipe which we mentioned previously as depressing model performance
Curiously, the boosted tree model seemed to favor a low mtry and min_n value.
The elastic net model was able to achieve good results with a low penalty, but high mixture, which loosely supports the idea that variable selection is detrimental to model performance.

Initial Stage Ensembling

An ensemble model was also produced, with model members fitted with FE2. Table 5 displays the best MAFS score for each model type included in the ensemble.

Table 5: Initial Stage Ensemble Member Model Performance (measured by MAFS)
Model Type	mean	std_err
Random Forest	0.6588691	0.0017454
Boosted Tree	0.6520562	0.0018971
Elastic Net	0.6442161	0.0020036
SVM Polynomial	0.5688290	0.0029013

The natural divide in performance across models found in Table 3 seems to reappear here. Given this, we elected to move forward with our random forest, boosted tree, and ensemble models (excluding SVM Poly as a member) for further refinement.

Model refinement

The second stage of model refinement featured another round of recipe development and model tuning on the models mentioned above. The primary changes we made to our feature engineering centered around geographic location identifiers included in the raw dataset, which were mostly ignored during the first stage due to the sheer number of levels included in them.

geo_level_1_id, geo_level_2_id, and geo_level_3_id are obfuscated variables which specify the geographic region each building exists in. The level of specificity increases with each variable, with geo_level_1_id containing 31 levels, geo_level_2_id containing 1428 levels, and geo_level_3_id containing 12568 levels. To reduce tuning times, the latter two variables were removed as predictors in all recipes used during the initial stage. But these variables are likely to contain information highly relevant to improving model accuracy, which is why they became the focus of the model refinement stage.

Recipes and Feature Engineering

Table 6 summarizes the new recipes developed during this stage. Given that FE2 produced the best-performing models during the initial modeling stage, it was used again during model refinement to set a baseline threshold of performance for models (but now renamed FE3 to distinguish it from its earlier iteration). The remaining recipes also include the steps found in this recipe. Since the primary conclusion from the initial stage that more information leads to better results, our PCA recipe experiments with capturing interactions between all predictors and the “Other” recipe tunes with the threshold of step_other, which becomes important when considering the addition of information given by the geo ID variables.

Table 6: Descriptions of Recipes Used in Model Refinement
Recipe	Shorthand	Description
Feature Engineering 3	FE3	Predicts damage grade with all predictors. Geo ID levels were converted to factor type using `step_mutate`. Removes explicit missingness encoded as numerical values in the building age variable and imputes with other highly-correlated predictors. ‘Others’ rare levels contributing to <5% of the dataset. Dummies nominal predictors, removes near-zero-variance columns, and normalizes all numeric predictors.
Principal Component Analysis	PCA	Implements principal component analysis by creating interaction terms between all predictors and then reducing dimensionality by condensing terms into a certain number of components, which was tuned during model tuning. Utilizes all geo ID levels.
Tuning `step_other`	Other	Tunes the threshold for ‘othering’ a level in `step_other`, as we anticipated that the frequency of the different levels found in the geo ID levels would correlate with importance in prediction. Utilizes all geo ID levels.

While creating the recipe for the Other recipe, we discovered that the notation to expand all two way interactions, .*., caused the outcome variable to be included in the recipe. This created models with very high MAFS and ROC AUC (>0.9) estimates that were inconsistent with what was expected of using PCA with tree based models. Upon further exploration, we discovered that the outcome variable had been incorporated into the recipe while using step_interact. To address this issue, we used all_predictors() * all_predictors() to expand all two way interactions.

In addition, we also adjusted hyperparameter tuning ranges during model refinement. To reiterate, we ran a boosted tree model, random forest model, and an ensemble model with boosted tree, random forest, and elastic net members. Three models of each type were produced, each fitted with a different feature engineering recipe as described above.

Model Types and Tuning

Random Forest

Table 7 and Table 8 shows the MAFS and roc_auc of the random forest models along with the average run times for each model. The random forest using the FE3 recipe was the best performing random forest model. There were a total of 25 models run for each random forest model.

Table 7: MAFS and average model run times of random forest models.
wflow_id	.metric	mean	std_err	average runtime (s)
rf_fe3	f_meas	0.6276903	0.0023860	1708.112
rf_other	f_meas	0.6265786	0.0024843	1708.120
rf_pca	f_meas	0.5903976	0.0021067	421.377

Table 8: ROC AUC and average model run times of random forest models.
wflow_id	.metric	mean	std_err	average runtime (s)
rf_fe3	roc_auc	0.8123370	0.0015207	1708.112
rf_other	roc_auc	0.8115979	0.0015000	1708.120
rf_pca	roc_auc	0.7787781	0.0016641	421.377

Boosted Tree

We ran boosted tree models with each of the three recipes and updated tuning parameters based on the autoplots of the boosted tree model in the initial stage of modeling. These three models were ran using the xgboost engine. Given that the model fitted with the “Other” recipe performed best, we also tuned boosted tree models using the lightgbm and catboost engines. Table 9 and Table 10 shows the MAFS and ROC_AUC of these boosted tree models along with the average run times for each model. The boosted tree model using the catboost engine was the best performing boosted tree model.

Table 9: MAFS and average model run times of boosted tree models.
wflow_id	.metric	mean	std_err	average runtime (s)
bt_catboost_other	f_meas	0.6269855	0.0022532	167.2114
bt_lightgbm_other	f_meas	0.6259222	0.0020285	115.0461
bt_other	f_meas	0.6248485	0.0025401	273.4695
bt_fe3	f_meas	0.6236902	0.0022898	399.4303
bt_pca	f_meas	0.5954955	0.0024196	793.3656

Table 10: ROC AUC and average model run times of boosted tree models.
wflow_id	.metric	mean	std_err	average runtime (s)
bt_catboost_other	roc_auc	0.8141758	0.0015412	167.2114
bt_lightgbm_other	roc_auc	0.8131996	0.0013899	115.0461
bt_other	roc_auc	0.8118673	0.0014865	273.4695
bt_fe3	roc_auc	0.8112522	0.0014710	399.4303
bt_pca	roc_auc	0.7891539	0.0016715	793.3656

Ensemble

We created three ensemble models based on the three recipes used in model refinement. Each ensemble model was blended using boosted tree, random forest, and elastic net members models.

FE3

Table 11: Best-Performing models of each Type in FE3 Ensemble
Model Type	mean	std_err
rf	0.6276125	0.0023544
bt	0.6223128	0.0021921
en	0.6046279	0.0023360

PCA

Table 12: Best-Performing models of each Type in PCA Ensemble
Model Type	mean	std_err
en	0.6132573	0.0022513
rf	0.5906554	0.0022357
bt	0.5779136	0.0025291

Other

Table 13: Best-Performing models of each Type in Other Ensemble
Model Type	mean	std_err
rf	0.6323375	0.0025989
bt	0.6248109	0.0025596
en	0.6106990	0.0025337

Observing Table 11, Table 12, and Table 13 leads to the conclusion that the ensemble models perform best with the Other recipe, in accordance with the boosted tree models but not the random forest models. This might be because most ensembles tended to weigh the boosted tree models more heavily than the other members, as demonstrated in Figure 5.

However, when considering this model against all models tuned during model refinement, it still performed best. Thus, we will move forward with it as our final model.

Final model analysis

We selected the tuned ensemble model using the Other recipe during model refinement. Table 14 shows the performance metrics of the final ensemble model when fitted to the testing data. The final ensemble model performed better than the estimated performance of each member model as seen in Table 13.

Table 14: Performance metrics and runtime of final ensemble model using Other recipe.
metric	performance	runtime (s)
f_meas	0.6413545	112.915
roc_auc	0.8255142	112.915

Figure 6 displays the confusion matrix of this final model. The model distinguished between damage grade 1 and 3 the best, while damage grade 2 predictions appear to have lower accuracy. The model seemed to distinguish damage grade 1 and 2 better than damage grade 2 and 3.

Figure 6: Confusion matrix of final ensemble model

While the final model’s MAFS appears relatively low, its corresponding ROC AUC score of 0.826 reflects fairly good performance. On the other hand, it is interesting to note that while the final model performed significantly better than the baseline model (MAFS = 0.401), it did not outperform the estimated performance of the random forest model built during stage 1 fitted with minimal feature engineering (MAFS = 0.67), which did not discard outlier building age values and treated geo ID variables as continuous numeric variables.

In terms of the confusion matrix, the largest values were 1106, 771, and 1005 which were the number of predicted values that were predicted correctly. 1106 is the number of predicted 1 damage grade that was predicted correctly. 771 is the number of 2 predicted damage grade that was predicted correctly. 1005 is the number of 3 damage grade that was predicted correctly.

Regarding the age variable, perhaps the unusual data points we observed were not mistakes in data entry after all, which may explain the underperformance of models built during model refinement. Furthermore, if the geo ID variables were in fact encoded on an ordered scale, perhaps it would have been more efficient to leave it as a continuous numeric variable (as the initial random forest model did) rather than changing them to an unordered factor variable as we did during model refinement. However, we are not given further details about the meaning of each level in the geo ID variables so this claim remains unverified.

Regardless of the performance of the final model, the development of this final model offered a good exercise in data exploration and critical thinking in feature engineering, which were undoubtedly useful even given the limitations on what we know about the data.

Conclusion

To summarize, in this project we utilized information on buildings struck by the Gorkha earthquake to build a model predicting damage to a building characteristics using the variable damage_grade. Using the Macro-Averaged F1 Score (MAFS) metric, we found that an ensemble model utilizing tree-based models (random forest and boosted tree) and multinomial regression performed best, achieving a final MAFS score on the testing set of 0.641. We found that, variable selection causes models to underperform, and our recipe that tunes the threshold level in step_other boosts model performance. This suggests that while most variables are important, not all levels in each variable are important.

An immediate next step would be to perhaps repeat the analysis on the remainder of the data that was unseen due to downsampling, or to acquire greater computational resources to better incorporate the data we were given into our modeling pipeline. We could also tweak the tuning parameters more in the models during the model refinement process, so if the best model used a parameter at the end values of a specific range, we could increase or decrease the range. In addition, a further area of research is to determine what some of the obfuscated variable levels represent, which could inform improved feature engineering. While it’s obvious here that the further feature engineering we took on during model refinement in this project was not particularly helpful, the competition leaderboard shows that most competitors achieved micro-averaged F-scores of 0.7 or above, suggesting that further engineering can be conducted and will be useful.

References

Jules. (2022). Earthquake Richter Prediction Competition. Kaggle. https://kaggle.com/competitions/earthquake-richter-prediction-competition