skim_variable | n_missing | complete_rate |
---|---|---|
ilp_r | 368 | 0.9877333 |
illb_r | 367 | 0.9877667 |
precare | 308 | 0.9897333 |
feduc | 0 | 1.0000000 |
meduc | 0 | 1.0000000 |
Birthweight Prediction
Data Science 2 with R (STAT 301-2)
Introduction
The developmental origins of health and disease (DOHaD) is a framework that seeks to link exposures to stressors during windows of susceptibility to the development of non-communicable diseases later in life (Lacagnina 2019). Birthweight has historically been used as an indicator of adverse early life exposures to stressors due to the relative ease of collecting birthweight data (Hanson 2015). Low birthweight is a risk factor for developing non-communicable diseases later in life such as cardiovascular diseases, metabolic diseases, and kidney diseases (Bianchi and Restrepo 2022).
Objective
The objective of this prediction problem is to create a model to predict the birthweight of an infant given a certain set of characteristics. This is a regression problem. Having a prediction model for this problem can be useful in providing insights to eventually developing an inferential question that identifies the most important factors associated with birthweight.
A classification problem to predict whether or not an infant would be born with low birthweight was not feasible with the dataset used because of high class imbalance. Only about 10% of observations in the dataset would have been classified as low birthweight, which is births under 2,500 grams (Hughes, Black, and Katz 2017).
Data Source
I downloaded the Kaggle dataset, US births (2018) (Amol Deshmukh 2020). The creator of this dataset subsetted this data from the raw 2018 natality file available from the Vital Statistics Online Data Portal (National Center for Health Statistics 2023).
I chose this specific dataset because I searched Kaggle for a dataset relating to birthweight and found the Kaggle competition, Prediction interval competition I: Birthweight1 by Carl McBride Ellis. Ellis cited the US births (2018) dataset as its source. I downloaded the US births (2018) dataset to have access to the the full dataset and have more control over the amount of observations I used for exploratory data analysis, the training data, and the testing data.
The prediction interval competition identified 39 predictors to use, however, due to computational limitations, I selected 20 variables from this list. This selection was done by selecting only variables related to the conditions related to pregnancy — for example, weight gain during pregnancy — and dropping conditions related to birth — for example, the type of facility the birth occurred in.
Data Overview
The US births (2018) dataset has over 3 million observations. Due to computational limitations, I subsetted 30,000 observations, using 15,000 for an exploratory data analysis and 15,000 for model fitting and testing.
Because I had the ability to subset from over 3 million observations, I was able to filter out incomplete observations before randomly sampling a total of 30,000 observations. Therefore, I do not have any major missingness issues. Table 1 shows the results of the missingness analysis. This analysis contains observations used in exploratory data analysis, model fitting and tuning, and tesing. Variables are ordered from the greatest to least missing observations, with only the first five variables shown.
Although there are not major missingness issues, Table 1 shows that interval since last pregnancy (ilp_r
), interval since last birth (illb_r
), and which month of pregnancy the mother began prenatal care (precare
) have several missing values. I created missingness in these variables because the original data collection encoded multiple types of information under the same variable.
For the interval since last pregnancy and last birth, if the mother had a plural delivery, the interval since the last birth was encoded as the number of infants (0 to 3) the mother had at the time of birth. From this information, I created a logical variable to identify whether or not the birth was a plural delivery and recoded the interval since last birth and last pregnancy as missing. Similarly, if this was the mother’s first birth or first pregnancy, the interval since last birth and last pregnancy was encoded as 888. From this information, I created two logical variables to identify whether or not this was the mother’s first birth or first pregnancy and recoded the interval since last birth and last pregnancy as missing. For which month of pregnancy the mother began prenatal care, 00 was recorded if the mother did not receive any prenatal care. From this information, I created a logical variable to identify whether or not the mother received prenatal care and recoded the month of pregnancy the mother began prenatal care as missing. I could not filter out these observations due to the information they contained, even as “missing” variables. After creating the new logical vairables, there were a total of 24 predictors in the dataset used. All other variables are complete.
A univariate analysis of birthweight across the 30,000 randomly sampled observations shows a fairly normal distribution, so no transformation will be needed to use birthweight in this prediction problem. Figure 1 shows the distribution of birthweights, highlighting the outliers for babies that were both unusually light and unusually heavy. This distribution is as expected with physiological constraints on both very small and very large infants.
I performed a short exploratory data analysis using the 15,000 observations set aside for this purpose. This analysis was used to inform both the base recipe and feature engineering in a reduced recipe.
I conducted bivariate analyses between each of the predictor and birthweight. Only a few of the predictors showed interesting relationships with birthweight. Figure 2, Figure 3, Figure 4, Figure 5, Figure 6, and Figure 7 show the relationships between birthweight and the predictors that have more interesting relationships.
I also explored correlation between numeric predictors to identify co-linearity, shown in Figure 8. There are a few predictors that are highly correlated with each other, such as the age of the mother and the age of the father, as well as the number of prior births who are still alive and the interval since the last birth and last pregnancy. I do not address these co-linearities in the basic recipe because they are not perfectly co-linear. However, in the reduced recipe, I do not include the the age of the father or the number of prior births who are still alive, so these co-linearities are not an issue.
Methods
Data Splitting
The 15,000 observations used for this prediction problem was split using a 75/25 split between training and testing data, stratified by birthweight.
I used vfold cross validaton because I have a fairly large dataset and do not need to use sampling with replacement as used in bootstrapping. I resampled with 4 folds and 3 repeats, so each workflow was fit 12 times for a metric estimate and standard error. In each iteration of the workflow, about 11,250 observations was used to train and 3,750 observations was used to obtain an estimate for the performance of the model. I used few folds and repeats because I was running into an issue with computation time and computer battery usage.
Recipes
I created two distinct recipes in addition to a null recipe. One was a basic recipe and one was a reduced recipe where I focused on only the predictors that were identified in the exploratory data analysis as having an interesting relationship with birthweight.
Null Recipe
A null recipe was defined using all predictor variables and dummy coding all nominal predictors.
Basic Recipe
The basic recipe is a kitchen sink recipe. The variables interval since last pregnancy and interval since last birth have NA values for those with plural deliveries. I changed these NA values to 0. For individuals who did not receive any prenatal care, I changed NA values to 10, essentially indicating negative amounts of prenatal care because imputing these NA values did not make sense. I removed variables that had exact linear correlations between them and variables that did not have any variance. For the recipe for parametric models, I dummy coded all nominal predictors and scaled and centered all numeric predictors. For nonparametric models, I used one-hot encoding to dummy the variables.
Reduced Recipe
For the second recipe, I selected only the variables that had interesting relationships with birthweight apparent in the exploratory data analysis and variables that have been shown to be predictive of birthweight in the literature. These variables were whether or not the birth was a plural delivery, the number of cigarettes smoked daily before pregnancy, the mother’s height, the mother’s weight before pregnancy, the number of prenatal visits, the mother’s weight gain during pregnancy, whether the birth was the mother’s first birth or first pregnancy, whether the mother received any prenatal care, and the mother’s age. Within the recipe, I also operationalized prenatal care as receiving prenatal care beginning in the first trimester or not and created a new variable to identify if mothers are in a higher risk age group (teenage or over 35) (Da Silva et al. 2003; Restrepo-Méndez et al. 2015). For parametric models, interaction terms were used between starting prenatal care early and number of prenatal visits, between weight gain and pre-pregnancy weight, between plural delivery and weight gain, between cigarettes and weight gain, and between interaction between risks and number of prenatal visits. A square root transformation was also applied to all numeric predictors due to slight skewness. Similar to the base recipe, I removed variables with exact linear correlations, predictors lacking any variance, and made appropriate adjustments to the recipe between recipes for parametric and nonparametric models.
Model Types and Tuning Parameters
For all models that required tuning, I used latin hypercube sampling to get a near random sample across the tuning parameters. This allowed me to nearly randomly sample across the tuning space, but ensures I have better overall coverage of the tuning space.
Null: A null model was defined as a baseline, un-informative model using the null recipe and the parsnip engine. For a regression problem like predicting birthweight, it simply predicts the average of the outcome variable. The RMSE of this model was calculated in order to compare with other models to see if creating more complex models.
Ordinary Linear Regression: A simple linear regression model was fit using the null recipe and the lm engine.
Elastic Net (en and en2): Penalized regression models were tuned using using the parametric versions of the basic and the reduced recipes and the glmnet engine. The hyperparameters I tuned were penalty and mixture. When the penalty increases, predictors are penalized when they do not contribute much to the target variable by pushing their coefficients towards zero. When mixture is 1, the model is a lasso regression, and when mixture is 0, the model is a ridge regression. I tuned penalty between -10 and 0 on a log10 scale and mixture between 0 and 1. These are the default tuning ranges for these hyperparameters, but they had to be set manually due to an error in the default settings used in the function.
Since elastic net models fit fairly quickly, I fit 500 unique workflows for each recipe.
K Nearest Neighbor (knn and knn2): K-nearest neighbor models were tuned using the parametric versions of the basic and reduced recipes and the kknn engine. The hyperparameters I tuned were the number of neighbors, the number of nearest neighbors to use to predict birthweight. I tuned nearest neighbors between 1 and 500 neighbors.
For this model type only, I used a regularized grid with 250 levels.
Random Forest (rf and rf2): Random forest models were tuned using the nonparametric versions of the basic and reduced recipes and the ranger engine. The hyperparameters I tuned were the number of sampled predictors, the number of trees, and the number of data points to split. For the basic recipe, I tuned the number of sampled predictors between 1 and 30, and for the reduced recipe, I tuned the number of sampled predictors between 1 and 9. This represents about 80% and about 60% of the total predictors. For both recipes, I tuned the number of trees between 1 and 3000, and number of data points to split between 1 and 50.
Since tree based models take longer to fit, I fit 50 unique workflows for each recipe.
Boosted Tree (bt and bt2): Boosted tree models were tuned using the nonparametric versions of the basic and reduced recipes and the xgboost engine. The hyperparameters I tuned were the same as for random forest with an addition of tuning the learning rate. I tuned learn rate between -5 and -0.2 on a log10 scale because it seemed like a reasonable range given previous examples in class.
Since tree based models take longer to fit, I fit 50 unique workflows for each recipe.
Neural Network (nnet and nnet2): Single layer neural network models were tuned using the parametric versions of the basic and reduced recipes and the nnet engine. Although neural network models are nonparametric, the preprocessing required aligned with that of parametric models. The hyperparameters I tuned were the number of units in the hidden model, the amount of weight decay, and the number of training iterations. Since I am not too familiar with the specifics of these tuning parameters, I did not update these tuning ranges.
Since the neural network model seemed to fit faster than tree based models, but slower than linear models, I fit 100 unique workflows for each recipe.
Selection Metric
The metric I used to identify the best model is root mean square error (RMSE). I used RMSE over mean average error partly because it is one of the default metrics used in regression prediction problems, but also because I would like to weigh larger errors more heavily in the performance estimate. This is because babies with a really low or really high birth weight can lead to birth complications. Thus, highly inaccurate predictions would fail to identify risks of low or high birth weight, which can have significant health impacts.
Model Building and Selection
Table 2 displays the RMSE of both the null and ordinary linear models. The RMSE of the linear regression model is significantly lower than that of the null model, indicating that creating a model should be more predictive of birthweight than using the average birthweight. Since the ordinary linear regression model is a baseline model, I should aim for a final model with an RMSE lower than 526 grams.
model | .metric | mean | n | std_err |
---|---|---|---|---|
null | rmse | 572.0364 | 12 | 1.196622 |
lm | rmse | 526.3188 | 12 | 1.583852 |
Best Performing Workflow Hyperparameters
Table 3 shows the values of the hyperparameters for the best performing workflow for each model type and recipe combination. These hyperparameters reflect the combination of hyperparameters tested during the latin hypercube search that resulted in the lowest RMSE. The hyperparameters themselves only contain information about the specific workflow used and do not contain information about how the workflow performed compared to other model types.
model | mtry | trees | min_n | learn_rate | penalty | mixture | neighbors | hidden_units | epochs |
---|---|---|---|---|---|---|---|---|---|
bt | 13 | 2458 | 25 | 0.0022986 | |||||
bt2 | 3 | 2857 | 41 | 0.0025347 | |||||
en | 0.0692981 | 0.9980920 | |||||||
en2 | 0.0628618 | 0.3803731 | |||||||
knn | 95 | ||||||||
knn2 | 129 | ||||||||
nnet | 0.0000976 | 5 | 503 | ||||||
nnet2 | 0.0000000 | 8 | 349 | ||||||
rf | 11 | 2490 | 34 | ||||||
rf2 | 3 | 2900 | 41 |
Best Performing Workflow Results
Table 4 shows the RMSE and standard error across folds and repeats for each model type and model recipe. For the boosted tree, elastic net, and random forest models, the reduced recipe performed significantly worse than the basic recipe. For the k nearest neighbor models, the reduced recipe performed significantly better than the basic recipe. For the neural network models, both recipes performed similarly.
wflow_id | .metric | mean | n | std_err |
---|---|---|---|---|
bt | rmse | 517.1398 | 12 | 1.499850 |
bt2 | rmse | 528.7085 | 12 | 1.517389 |
en | rmse | 527.6712 | 12 | 1.382197 |
en2 | rmse | 535.8793 | 12 | 1.336479 |
knn | rmse | 544.6537 | 12 | 1.476103 |
knn2 | rmse | 536.3700 | 12 | 1.390473 |
nnet | rmse | 547.9800 | 12 | 2.740920 |
nnet2 | rmse | 546.5518 | 12 | 1.671753 |
rf | rmse | 521.7891 | 12 | 1.429537 |
rf2 | rmse | 530.9761 | 12 | 1.347071 |
Figure 9 shows the RMSE of each model and recipe combination by workflow rank. Only the random forest and boosted tree models using the first recipe were significantly better than the baseline model. The elastic net model using the first recipe and the boosted tree model using the second recipe were similar to the baseline model. All other models performed worse than the baseline model.
From this analysis, I identified the boosted tree model using the basic recipe as the best model.
Further Tuning
New Boosted Tree Models
To further improve the model, I explored using a different engine to fit the boosted tree model and searching tuning parameters within a smaller range.
For the different engine, I tuned boosted tree models using the basic recipe and the lightgbm engine instead of the xgboost engine. I used the same tuning parameters and ranges as the model tuning used to tune the boosted tree models with the xgboost engine.
To narrow down the hyperparameter tuning ranges, I used the tuning parameters identified in the best boosted tree workflow and identified any important trends across the hyperparameters of each workflow. I used a random grid to tune 100 workflows within the narrowed hyperparameter ranges.
Table 5 shows the hyperparameters of the best boosted tree workflow, which I used to narrow down to a general range around each hyperparameter. I narrowed down the number of sampled predictors to between 10 and 15, the number of trees to between 2000 and 3000, and the number of data points to split between 20 and 30.
mtry | trees | min_n | learn_rate |
---|---|---|---|
13 | 2458 | 25 | 0.0022986 |
Figure 10 shows the performance of each hyperparameter across the workflows. Although there is no clear pattern for the number of sampled predictors, the number of tress, and the number of data points to split, there was a clear pattern for the learning rate. From Figure 10, I narrowed down the learning rate to be between -2 and -2.8 on a log10 scale.
Comparing Boosted Tree Models
After tuning the boosted tree model with the lightgbm engine and narrower tuning ranges, I compared the boosted tree models using the basic recipe (bt) and reduced recipe (bt2) using the xgboost engine, the boosted tree models using the lightgbm engine (bt3), and the boosted tree model with narrowed tuning ranges (bt4). Figure 11 shows that changing the engine and tuning within a narrower range did not significantly reduce the RMSE of the best model, but they both still performed better than the best boosted tree model fitted using the reduced recipe. Nevertheless, tuning within a narrower range did marginally reduce the average RMSE across the folds and repeats.
Final Model Selection
The final model I selected was the boosted tree model with narrowed tuning ranges using the basic recipe and the xgboost engine (bt4 in Figure 11). The tuning parameters of this model are 10 for the number of sampled predictors, 2645 for the number of trees, 29 for the number of data points to split, and 0.00315 for the learning rate.
I was surprised that this model won, or rather, that the boosted tree model with the reduced recipe did not win. I expected the reduced recipe to perform better than the basic recipe because the exploratory data analysis did not show particularly interesting relationships between many predictors and birthweight. I initially thought that removing these predictors would remove some of the noise they could be introducing into the model, but it appears that the variables I removed are in fact important in building a better model.
Final Model Analysis
I finalized the best boosted tree model and trained it using the full training set and then used this model to create predictions using the testing set.
Table 6 shows the RMSE, \(R^2\), and mean average error of the predictions. The RMSE of the estimate was more than one standard error lower than the average RMSE across the resamples of the best boosted tree model (Figure 11). This indicates that my resampling structure had too few repeats (12 repeats total) and likely led to overfitting. Achieving a lower RMSE on the testing set was likely by chance, and the model could just as likely had an RMSE more than one standard error higher than the average RMSE across the resamples. The \(R^2\) of the model is very low, showing that the model does not explain much of the variation seen in birthweight. The mean average error is much lower than the RMSE, showing that on average, the predictions are off from the actual birthweight by about 385 grams.
.metric | .estimate |
---|---|
rmse | 514.125270 |
rsq | 0.225763 |
mae | 385.190883 |
Figure 12 clearly demonstrates how the final model was unable to capture the full extent of the variation seen in the actual birthweights. While the actual birthweights range from less than 500 grams to over 5000 grams, the predicted birthweights are limited to just under 2000 grams and around 4000 grams. Despite my choice to use RMSE as my metric to more heavily penalize large deviations during the model competition stage, the final model predictions still showed significant error from the actual birthweights.
Although this model preformed significantly better than both the null and baseline model, it is not a strong model to predict birthweight. Only about 54% of the predicted observations were within 10% of the actual birthweight. Additionally, of 290 low birthweight births in the testing set (< 2500 grams), the final model only predicted that 43 of the birthweights would be low birthweight. Thus, significant improvements can be made to better predict birthweight given the data available.
Discussion and Conclusion
This project sought to predict birthweight from a set of 24 predictors. A boosted tree model using a basic recipe performed the best during model competition and was selected to analyze the testing set with. This model performed the best compared to all other models, but it still did not provide a particularly accurate prediction of birthweight and was unable to predict any outliers in birthweight data.
Although the two best performing models were nonparametric tree based models, the ordinary linear regression model did not perform much worse. This suggests that birthweight can be predicted more accurately with nonparametric tree based models, however, further feature engineering, and perhaps simply the inclusion of more predictors would be required to improve the performance of the model. Including more predictors would likely improve the accuracy of the model because I removed seemingly non significant predictors for the reduced recipe, which caused models fitted with that recipe to perform significantly worse than the basic recipe. This indicates that although many of the predictors were not highly correlated with birthweight, the combination of those predictors can significantly improve the accuracy of the model. With more time and computational resources, a next step would be to build a model including all the predictors listed in the Kaggle competition2.
This project, however, demonstrated the complexity of factors affecting birthweight and potential challenges in transitioning from a prediction problem to an inferential problem. Had the reduced recipe performed better than the basic recipe, the next step of conducting an inferential analysis to determine the strength and significance of the effect for each predictor on birthweight would be straightforward by just analyzing the predictors used in the reduced recipe.
As a future project, I would like to turn this into a classification problem. When I initially began the project, I just wanted to predict low birthweight rather than birthweight itself because low birthweight as a category has more implications for the development origins of health and disease. When I initially randomly sampled 30,000 observations and realized there was a massive class imbalance, I quickly switched over to a regression problem. However, I think it would be possible to just categorize each of the over 3 million observations as low birthweight or not and then use stratified sampling to reduce class imbalance. While the final model was able to predict some observations with low birthweights, it was generally particularly bad at predicting outliers, including low birthweight births.