Earthquake: Damage Grade Prediction

Final Project
Data Science 3 with R (STAT 301-3)

Authors

Sydney Lee

Cassie Lee

Hannah Ma

Published

June 4, 2024

Github Repo and Data Source
  • 1 Jules. (2022). Earthquake Richter Prediction Competition. Kaggle. https://kaggle.com/competitions/earthquake-richter-prediction-competition

  • Introduction

    This project focuses on multi-class classification of damage sustained by buildings following a magnitude 8 earthquake. We utilize data sourced from the Earthquake Richter Prediction Competition hosted on Kaggle in December 2022, and use the structural characteristics of buildings in the data to predict the level of damage sustained (low, medium, high). This project provides insight on a relevant current event, given Taiwan’s recent earthquake, and particular architectural concerns related to earthquake resistance.

    About the Target Variable

    The target variable is damage_grade which is a factor variable made up of 3 levels:

    • 1: minimal damage
    • 2: medium/considerable damage
    • 3: near complete destruction

    This variable measures the level of damage to buildings. Further analysis of damage_grade is shown below.

    The target variable has no missing observations, but does contain class imbalance, where 2 has significantly more observations than the rest of the levels and 1 significantly fewer.

    Data overview

    The original dataset has 156,360 observations and 40 variables. Each instance in the data represents one building hit by the April 2015 Gorkha earthquake in Nepal, and the columns contain structural and geographic information about the building. Being provided as part of a Kaggle competition, the data contains no explicit missingness and a good variety of predictor types.

    We performed a brief exploratory data analysis on a randomly sampled subset of the training data. Each variable was plotted against the damage grade outcome. Box plots were used for numeric predictors, and mosaic plots were used for nominal predictors.

    Exploratory data analysis showed that only the age of the building had outliers or noticeable errors in data collection. Figure 1 shows the distribution of age across the three damage grades. Observations of buildings being 1000 years old could either represent cultural heritage buildings or errors in data collection.

    Figure 1: This figure shows the distribution of age of the buildings relative to the damage grade after the earthquake.

    Additionally, there is noticeable variation in damage grade across foundation type, having adobe or mud superstructure, having mud mortar - stone superstructure, and normalized price of the building footprint (Figure 2). This exploration helped us understand our predictors better and informed initial stages of feature engineering and our initial decision to try feature selection through random forest variable importance.

    (a) Foundation type

    (b) Adobe or mud superstructure

    (c) Mud mortar - stone superstructure

    (d) Normalized building height

    Figure 2: Visualization of damage grade across a few select predictors.

    Methods overview

    Assessment metrics

    F-measure

    We elected to use the Macro-averaged F-measure as our assessment metric. This metric describes the accuracy of the model through taking the harmonic mean of the sensitivity2 and precision3 of model predictions. The macro averaged F-measure is a non-weighted average of the sum of F-measures across each class divided by the number of classes, making it suitable for multi-class prediction. This metric is useful when there is an imbalance between classes. It ranges from 0 to 1, with 1 being the most desirable.

  • 2 sensitivity = true positives/all actual positives

  • 3 precision = true positive/all predicted positives

  • A f_meas less than 0.5 indicates that the model’s performance is poor due to low precision and sensitivity or an imbalance between the two. A f_meas at 0.5 means that model is no better than random guessing. A f_meas more than 0.5 indicates the model’s performance is better with higher precision and sensitivity.

    Macro-Averaged F1 Score = \(\frac{1}{n}\sum_{i=1}^{n}{F1\ Score_{i}}\)

    For brevity, we will refer to this as the MAFS metric for the remainder of this report.

    ROC_AUC

    We also opted to use average roc_auc as a second metric when performing model refinement with the top three models from the initial stage and to build ensemble models. We chose to do so due to our familiarity with the roc_auc metric and the requirement of using class probabilities when building ensemble models. roc_auc incorporates the receiver operating curve that plots the true positive rate against the false positive rate and area under the curve (AUC) to represent the degree of variance. An average roc_auc score of 0.5 means the model is no better than random guessing. An average roc_auc less than 0.5 means the model is performing worse than random guessing. An average roc_auc score between 0.6 to 0.7 indicates poor to fair performance. 0.7 to 0.8 is fair to good performance. 0.8 to 0.9 is good to very good performance as the model is able to effectively distinguish between negative and positive classes. An average roc_auc close to 1 is excellent model performance.

    Data Splitting

    Due to the size of the dataset and class imbalance, we subsampled the dataset down to 18,000 observations, with 6,000 observations in each damage grade. We split this subsampled data into a training set of 13,500 observations and testing set of 4,500 observations (75/25 split). Model training was validated using vfold cross-validation on 10 folds with 3 repeats, stratified by damage_grade.

    Model development

    Model development was split into an initial development and secondary model refinement stage. The process used in each is described in further detail in Section 5 and Section 6, but each section featured the following the following models and feature engineering steps:

    Model Development Summary

    Initial Stage Model Refinement
    Models
    • Random Forest

    • Boosted Tree

    • SVM Polynomial Function

    • SVM Radial Basis Function

    • Logistic Regression

    • Neural Network

    • MARS

    • Ensemble

    • Random Forest

    • Boosted Tree

    • Ensemble

    Recipes
    • Kitchen Sink

    • Random Forest Variable Selection

    • Parametric Featured Engineering

    • Non-parametric Feature Engineering

    • Feature Engineering (Parametric or Non-parametric)

    • PCA Feature Engineering (Parametric or Non-parametric)

    • Tuned step_other (Parametric or Non-parametric)

    Initial stage

    Recipes and Feature Engineering

    The initial stage of work on this project featured experimental feature-engineering and exploration of several model types. Firstly we developed the following recipes:

    Table 1: Initial Stage Recipes
    Recipe Shorthand Description
    Kitchen Sink KS Predicts damage grade with all predictor columns. Dummies nominal predictors, removes near-zero-variance columns, and normalizes all numeric predictors
    Variable Selection VS Predicts damage grade using predictors with a variable importance score >60 as determined by a random forest selection process. Dummies nominal predictors, removes near-zero-variance columns, and normalizes all numeric predictors.
    Feature Engineering 1 FE Predicts damage grade with all predictors, featuring mean and mode imputation for numeric and nominal predictors, respectively. Handles novel and missing factor levels by relegating them to new, specified categories. Dummies nominal predictors, removes near-zero-variance columns, and normalizes all numeric predictors.
    Feature Engineering 2 FE2 Predicts damage grade with all predictors except some id variables with a large number of levels. Removes explicit missingness encoded as numerical values in the building age variable and imputes with other highly-correlated predictors. ‘Others’ rare levels contributing to <5% of the dataset. Dummies nominal predictors, removes near-zero-variance columns, and normalizes all numeric predictors.

    Figure 3: Histogram of Variable Importance generated using random forest variable selection. Variables with a variable importance score above 60 (red line) were selected for the VS recipe.

    Baseline Model

    Our baseline model is a naive bayes model fit with a kitchen sink recipe.

    Table 2: Baseline Model Performance (measured by MAFS)
    wflow_id mean std_err total runtime (s)
    nbayes_ks 0.4013995 0.0132339 255

    An average F-measure of 0.40 indicates relatively poor predictive capability from the baseline model, suggesting that fitting more complex models in the pursuit of our prediction problem is warranted.

    Model Performance

    After tuning our more complex models with the recipes in Table 1, the following best-performing models were obtained when scored by the MAFS metric:

    Table 3: Initial Stage Model Performance (measured by MAFS)
    model recipe mean std_err total runtime (s)
    Random Forest FE 0.6704303 0.0025183 10020.862
    Boosted Tree FE2 0.6506890 0.0020727 2642.424
    Elastic Net FE 0.6433707 0.0019431 949.310
    Neural Network VS 0.5915064 0.0027660 553.475
    SVM radial FE2 0.5809771 0.0026705 5913.928
    Random Forest VS 0.5574154 0.0022169 3851.986
    SVM polynomial VS 0.5062007 0.0018512 8343.116
    MARS FE 0.2286743 0.0023838 264.788
    MARS VS 0.2189345 0.0027280 27.932

    Figure 4 depicts the performance of each model using f_meas visually:

    Figure 4: Visual depiction of Initial Model Performance from Autoplot

    Not shown above, we also ran an ensemble model that will be analyzed further below.

    Given their prior success in past classwork and projects, we are unsurprised that tree-based models performed best at this stage. We are also pleasantly surprised that the elastic net model performs reasonably well. There appears to be a natural divide around a MAFS score of 0.60 that divides our models into “better” and “worse” models. The standard errors recorded for each model also suggest that the differences in MAFS observed across models are significantly different. At a mean MAFS score of 0.21-22, we are surprised that the MARS models performed so poorly, as a MAFS score of about 0.5 is equivalent to random guessing of the model, and anything less is qualitatively equivalent to “intentionally” selecting the wrong classifications.

    We also note that variable selection appears to have worsened model performance compared to our other feature engineering recipes, suggesting that all variables include information relevant to the prediction problem, which we implement in the next stage of model development.

    Model Tuning Parameters

    Following tuning, we found the below hyperparameter values to perform best for each model.

    Table 4: Best-Performing Hyperparameters
    Hyperparameter RF (FE) RF (VS) Boosted Tree SVM Radial SVM Poly Elastic Net Neural Network MARS 1 MARS 2
    mtry 20 3 5.000
    min_n 21 11 2.000
    trees 1455.000
    learn_rate 0.021
    cost 0.863 0.024
    rbf_sigma 0.054
    margin 0.046
    degree 3.000
    scale_factor 0.052
    penalty 0.077 0.004
    mixture 0.778
    hidden_units 9.000
    num_terms 1 1
    prod_degree 1 1

    Given that our efforts were spread across exploring as many model types and recipes as possible, it is difficult to come to well-supported conclusions about optimal hyperparameter values at this time. However, some loose observations can be made:

    • A high mtry and min_n seem to improve random forest model performance when observing the differences in MAFS score (shown in Table 3) between the two random forest models, however the discrepancy may also be due to the variable selection recipe which we mentioned previously as depressing model performance

    • Curiously, the boosted tree model seemed to favor a low mtry and min_n value.

    • The elastic net model was able to achieve good results with a low penalty, but high mixture, which loosely supports the idea that variable selection is detrimental to model performance.

    Initial Stage Ensembling

    An ensemble model was also produced, with model members fitted with FE2. Table 5 displays the best MAFS score for each model type included in the ensemble.

    Table 5: Initial Stage Ensemble Member Model Performance (measured by MAFS)
    Model Type mean std_err
    Random Forest 0.6588691 0.0017454
    Boosted Tree 0.6520562 0.0018971
    Elastic Net 0.6442161 0.0020036
    SVM Polynomial 0.5688290 0.0029013

    The natural divide in performance across models found in Table 3 seems to reappear here. Given this, we elected to move forward with our random forest, boosted tree, and ensemble models (excluding SVM Poly as a member) for further refinement.

    Model refinement

    The second stage of model refinement featured another round of recipe development and model tuning on the models mentioned above. The primary changes we made to our feature engineering centered around geographic location identifiers included in the raw dataset, which were mostly ignored during the first stage due to the sheer number of levels included in them.

    geo_level_1_id, geo_level_2_id, and geo_level_3_id are obfuscated variables which specify the geographic region each building exists in. The level of specificity increases with each variable, with geo_level_1_id containing 31 levels, geo_level_2_id containing 1428 levels, and geo_level_3_id containing 12568 levels. To reduce tuning times, the latter two variables were removed as predictors in all recipes used during the initial stage. But these variables are likely to contain information highly relevant to improving model accuracy, which is why they became the focus of the model refinement stage.

    Recipes and Feature Engineering

    Table 6 summarizes the new recipes developed during this stage. Given that FE2 produced the best-performing models during the initial modeling stage, it was used again during model refinement to set a baseline threshold of performance for models (but now renamed FE3 to distinguish it from its earlier iteration). The remaining recipes also include the steps found in this recipe. Since the primary conclusion from the initial stage that more information leads to better results, our PCA recipe experiments with capturing interactions between all predictors and the “Other” recipe tunes with the threshold of step_other, which becomes important when considering the addition of information given by the geo ID variables.

    Table 6: Descriptions of Recipes Used in Model Refinement
    Recipe Shorthand Description
    Feature Engineering 3 FE3 Predicts damage grade with all predictors. Geo ID levels were converted to factor type using step_mutate. Removes explicit missingness encoded as numerical values in the building age variable and imputes with other highly-correlated predictors. ‘Others’ rare levels contributing to <5% of the dataset. Dummies nominal predictors, removes near-zero-variance columns, and normalizes all numeric predictors.
    Principal Component Analysis PCA Implements principal component analysis by creating interaction terms between all predictors and then reducing dimensionality by condensing terms into a certain number of components, which was tuned during model tuning. Utilizes all geo ID levels.
    Tuning step_other Other Tunes the threshold for ‘othering’ a level in step_other, as we anticipated that the frequency of the different levels found in the geo ID levels would correlate with importance in prediction. Utilizes all geo ID levels.

    While creating the recipe for the Other recipe, we discovered that the notation to expand all two way interactions, .*., caused the outcome variable to be included in the recipe. This created models with very high MAFS and ROC AUC (>0.9) estimates that were inconsistent with what was expected of using PCA with tree based models. Upon further exploration, we discovered that the outcome variable had been incorporated into the recipe while using step_interact. To address this issue, we used all_predictors() * all_predictors() to expand all two way interactions.

    In addition, we also adjusted hyperparameter tuning ranges during model refinement. To reiterate, we ran a boosted tree model, random forest model, and an ensemble model with boosted tree, random forest, and elastic net members. Three models of each type were produced, each fitted with a different feature engineering recipe as described above.

    Model Types and Tuning

    Random Forest

    Table 7 and Table 8 shows the MAFS and roc_auc of the random forest models along with the average run times for each model. The random forest using the FE3 recipe was the best performing random forest model. There were a total of 25 models run for each random forest model.

    Table 7: MAFS and average model run times of random forest models.
    wflow_id .metric mean std_err average runtime (s)
    rf_fe3 f_meas 0.6276903 0.0023860 1708.112
    rf_other f_meas 0.6265786 0.0024843 1708.120
    rf_pca f_meas 0.5903976 0.0021067 421.377
    Table 8: ROC AUC and average model run times of random forest models.
    wflow_id .metric mean std_err average runtime (s)
    rf_fe3 roc_auc 0.8123370 0.0015207 1708.112
    rf_other roc_auc 0.8115979 0.0015000 1708.120
    rf_pca roc_auc 0.7787781 0.0016641 421.377

    Boosted Tree

    We ran boosted tree models with each of the three recipes and updated tuning parameters based on the autoplots of the boosted tree model in the initial stage of modeling. These three models were ran using the xgboost engine. Given that the model fitted with the “Other” recipe performed best, we also tuned boosted tree models using the lightgbm and catboost engines. Table 9 and Table 10 shows the MAFS and ROC_AUC of these boosted tree models along with the average run times for each model. The boosted tree model using the catboost engine was the best performing boosted tree model.

    Table 9: MAFS and average model run times of boosted tree models.
    wflow_id .metric mean std_err average runtime (s)
    bt_catboost_other f_meas 0.6269855 0.0022532 167.2114
    bt_lightgbm_other f_meas 0.6259222 0.0020285 115.0461
    bt_other f_meas 0.6248485 0.0025401 273.4695
    bt_fe3 f_meas 0.6236902 0.0022898 399.4303
    bt_pca f_meas 0.5954955 0.0024196 793.3656
    Table 10: ROC AUC and average model run times of boosted tree models.
    wflow_id .metric mean std_err average runtime (s)
    bt_catboost_other roc_auc 0.8141758 0.0015412 167.2114
    bt_lightgbm_other roc_auc 0.8131996 0.0013899 115.0461
    bt_other roc_auc 0.8118673 0.0014865 273.4695
    bt_fe3 roc_auc 0.8112522 0.0014710 399.4303
    bt_pca roc_auc 0.7891539 0.0016715 793.3656

    Ensemble

    We created three ensemble models based on the three recipes used in model refinement. Each ensemble model was blended using boosted tree, random forest, and elastic net members models.

    FE3

    Table 11: Best-Performing models of each Type in FE3 Ensemble
    Model Type mean std_err
    rf 0.6276125 0.0023544
    bt 0.6223128 0.0021921
    en 0.6046279 0.0023360

    PCA

    Table 12: Best-Performing models of each Type in PCA Ensemble
    Model Type mean std_err
    en 0.6132573 0.0022513
    rf 0.5906554 0.0022357
    bt 0.5779136 0.0025291

    Other

    Table 13: Best-Performing models of each Type in Other Ensemble
    Model Type mean std_err
    rf 0.6323375 0.0025989
    bt 0.6248109 0.0025596
    en 0.6106990 0.0025337

    Observing Table 11, Table 12, and Table 13 leads to the conclusion that the ensemble models perform best with the Other recipe, in accordance with the boosted tree models but not the random forest models. This might be because most ensembles tended to weigh the boosted tree models more heavily than the other members, as demonstrated in Figure 5.

    However, when considering this model against all models tuned during model refinement, it still performed best. Thus, we will move forward with it as our final model.

    (a) FE3-tuned ensemble

    (b) PCA-tuned ensemble

    (c) Other-tuned ensemble

    Figure 5: Model weights of ensemble models.

    Final model analysis

    We selected the tuned ensemble model using the Other recipe during model refinement. Table 14 shows the performance metrics of the final ensemble model when fitted to the testing data. The final ensemble model performed better than the estimated performance of each member model as seen in Table 13.

    Table 14: Performance metrics and runtime of final ensemble model using Other recipe.
    metric performance runtime (s)
    f_meas 0.6413545 112.915
    roc_auc 0.8255142 112.915

    Figure 6 displays the confusion matrix of this final model. The model distinguished between damage grade 1 and 3 the best, while damage grade 2 predictions appear to have lower accuracy. The model seemed to distinguish damage grade 1 and 2 better than damage grade 2 and 3.

    Figure 6: Confusion matrix of final ensemble model

    While the final model’s MAFS appears relatively low, its corresponding ROC AUC score of 0.826 reflects fairly good performance. On the other hand, it is interesting to note that while the final model performed significantly better than the baseline model (MAFS = 0.401), it did not outperform the estimated performance of the random forest model built during stage 1 fitted with minimal feature engineering (MAFS = 0.67), which did not discard outlier building age values and treated geo ID variables as continuous numeric variables.

    In terms of the confusion matrix, the largest values were 1106, 771, and 1005 which were the number of predicted values that were predicted correctly. 1106 is the number of predicted 1 damage grade that was predicted correctly. 771 is the number of 2 predicted damage grade that was predicted correctly. 1005 is the number of 3 damage grade that was predicted correctly.

    Regarding the age variable, perhaps the unusual data points we observed were not mistakes in data entry after all, which may explain the underperformance of models built during model refinement. Furthermore, if the geo ID variables were in fact encoded on an ordered scale, perhaps it would have been more efficient to leave it as a continuous numeric variable (as the initial random forest model did) rather than changing them to an unordered factor variable as we did during model refinement. However, we are not given further details about the meaning of each level in the geo ID variables so this claim remains unverified.

    Regardless of the performance of the final model, the development of this final model offered a good exercise in data exploration and critical thinking in feature engineering, which were undoubtedly useful even given the limitations on what we know about the data.

    Conclusion

    To summarize, in this project we utilized information on buildings struck by the Gorkha earthquake to build a model predicting damage to a building characteristics using the variable damage_grade. Using the Macro-Averaged F1 Score (MAFS) metric, we found that an ensemble model utilizing tree-based models (random forest and boosted tree) and multinomial regression performed best, achieving a final MAFS score on the testing set of 0.641. We found that, variable selection causes models to underperform, and our recipe that tunes the threshold level in step_other boosts model performance. This suggests that while most variables are important, not all levels in each variable are important.

    An immediate next step would be to perhaps repeat the analysis on the remainder of the data that was unseen due to downsampling, or to acquire greater computational resources to better incorporate the data we were given into our modeling pipeline. We could also tweak the tuning parameters more in the models during the model refinement process, so if the best model used a parameter at the end values of a specific range, we could increase or decrease the range. In addition, a further area of research is to determine what some of the obfuscated variable levels represent, which could inform improved feature engineering. While it’s obvious here that the further feature engineering we took on during model refinement in this project was not particularly helpful, the competition leaderboard shows that most competitors achieved micro-averaged F-scores of 0.7 or above, suggesting that further engineering can be conducted and will be useful.

    References

    1. Jules. (2022). Earthquake Richter Prediction Competition. Kaggle. https://kaggle.com/competitions/earthquake-richter-prediction-competition