Introduction
Maternal health risk remains a significant global public health challenge, particularly in low- and middle-income countries. According to the World Health Organization (WHO, n.d.), approximately 287,000 women die in 2020 from complications related to pregnancy and childbirth, most of which are preventable. Maternal health risks arise from a combination of medical, socioeconomic, and environmental factors that influence both the mother’s and the newborn’s well-being. Conditions such as preeclampsia, hemorrhage, infection, and obstructed labor contribute to a large proportion of maternal deaths. Furthermore, limited access to quality antenatal care, skilled birth attendants, and timely emergency obstetric services increases the vulnerability of pregnant women. Understanding the interplay between biological factors, healthcare access, and social determinants is crucial to reducing maternal morbidity and mortality and achieving safer motherhood globally.
Methodology
The study used a systematic approach that comprises of four distinct stages:
Data Extraction and Description
The dataset chosen for this study is the ‘Maternal Health Risk’ dataset. It was extracted from UC Irvine Machine Learning Repository. The data has been collected from different hospitals, community clinics, maternal health cares from the rural areas of Bangladesh through the IoT based risk monitoring system..
data source - https://archive.ics.uci.edu/dataset/863/maternal+health+risk
# Define the file path
file_path <- "/Users/thandarmoe/Library/Mobile Documents/com~apple~CloudDocs/teaching/Cloud/me/R-Language/Practice/Dataset/2025/maternal_health_risk.csv"
# Load the CSV file
mh_risk <- read.csv(file_path)
### check the structure
dim(mh_risk)
## [1] 1014 7
The dataset contains instances from 1014 participants with one target variable and 6 features that are related to clinical data.
str(mh_risk)
## 'data.frame': 1014 obs. of 7 variables:
## $ Age : int 25 35 29 30 35 23 23 35 32 42 ...
## $ SystolicBP : int 130 140 90 140 120 140 130 85 120 130 ...
## $ DiastolicBP: int 80 90 70 85 60 80 70 60 90 80 ...
## $ BS : num 15 13 8 7 6.1 7.01 7.01 11 6.9 18 ...
## $ BodyTemp : num 98 98 100 98 98 98 98 102 98 98 ...
## $ HeartRate : int 86 70 80 70 76 70 78 86 70 70 ...
## $ RiskLevel : chr "high risk" "high risk" "high risk" "high risk" ...
unique(mh_risk$RiskLevel)
## [1] "high risk" "low risk" "mid risk"
The original target column (RiskLevel) has three classes: high risk, low risk, and mid risk.
Age: Integer — Age of the individual in years.
SystolicBP: Integer — Systolic blood pressure (mmHg)
DiastolicBP: Integer — Diastolic blood pressure (mmHg)
BS: Numeric — Blood glucose levels is in terms of a molar concentration (mmol/L).
BodyTemp: Numeric — Body temperature in degrees Fahrenheit (F).
HeartRate: Integer — Heart rate in beats per minute (bpm).
Research Questions
In this study, the maternal risk dataset from UC Irvine Machine Learning Repository will be used to explore the following research questions:
Data Preparation
To assist data manipulation, the ‘tidyverse’ R-package was loaded.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Missing values were checked, and found none.
### to check if there are missing values
sapply(mh_risk, function(x) sum(is.na(x)))
## Age SystolicBP DiastolicBP BS BodyTemp HeartRate
## 0 0 0 0 0 0
## RiskLevel
## 0
The structure of the original dataset was first checked, and found that they are assigned correctly.
str(mh_risk)
## 'data.frame': 1014 obs. of 7 variables:
## $ Age : int 25 35 29 30 35 23 23 35 32 42 ...
## $ SystolicBP : int 130 140 90 140 120 140 130 85 120 130 ...
## $ DiastolicBP: int 80 90 70 85 60 80 70 60 90 80 ...
## $ BS : num 15 13 8 7 6.1 7.01 7.01 11 6.9 18 ...
## $ BodyTemp : num 98 98 100 98 98 98 98 102 98 98 ...
## $ HeartRate : int 86 70 80 70 76 70 78 86 70 70 ...
## $ RiskLevel : chr "high risk" "high risk" "high risk" "high risk" ...
A correlation plot was visualized to identify highly correlated features for potential removal but none exhibits high correlation.
### correlation
library(corrplot)
## corrplot 0.95 loaded
# to remove non-numeric columns (e.g., RiskLevel)
mh_risk_num <- mh_risk[sapply(mh_risk, is.numeric)]
cor_mat <- cor(mh_risk_num, use = "pairwise.complete.obs")
corrplot(cor_mat,
method = "color",
type = "upper",
tl.col = "black",
tl.srt = 45,
addCoef.col = "black",
number.cex = 0.6,
diag = FALSE)
### to find maximum correlation
max(abs(cor_mat[upper.tri(cor_mat)]))
## [1] 0.7870065
Systolic and diastolic blood pressure is highly correlated and will be problematic when running a logistic regression. So I will combine the two variable in to one clinical variable: mean arterial pressure (MAP).
mh_risk$MAP <- round((mh_risk$SystolicBP + 2 * mh_risk$DiastolicBP) / 3, 0)
# to create final dataset without SBP and DBP
mh_risk_final <- subset(mh_risk, select = -c(SystolicBP, DiastolicBP))
Target categorical variables (assigned as character data types) are converted to factor type.
mh_risk_final$RiskLevel <- factor(mh_risk_final$RiskLevel,
levels = c("low risk", "mid risk", "high risk"))
str(mh_risk_final)
## 'data.frame': 1014 obs. of 6 variables:
## $ Age : int 25 35 29 30 35 23 23 35 32 42 ...
## $ BS : num 15 13 8 7 6.1 7.01 7.01 11 6.9 18 ...
## $ BodyTemp : num 98 98 100 98 98 98 98 102 98 98 ...
## $ HeartRate: int 86 70 80 70 76 70 78 86 70 70 ...
## $ RiskLevel: Factor w/ 3 levels "low risk","mid risk",..: 3 3 3 3 1 3 2 3 2 3 ...
## $ MAP : num 97 107 77 103 80 100 90 68 100 97 ...
Target categorical variables were checked for class imbalance.
### to check the orignial classes in target column
table(mh_risk_final$RiskLevel)
##
## low risk mid risk high risk
## 406 336 272
The resulting dataset was balanced between three classes using simple random oversampling with replacement.
set.seed(2010)
# to determine target size (largest class)
target_size <- max(table(mh_risk_final$RiskLevel))
# to oversample each class to target size and combine
mh_risk_balanced <- mh_risk_final %>%
group_by(RiskLevel) %>%
slice_sample(n = target_size, replace = TRUE) %>%
ungroup()
# to check new counts
table(mh_risk_balanced$RiskLevel)
##
## low risk mid risk high risk
## 406 406 406
The balanced dataset is split into train (80%) and test dataset (20%).
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
### to assign train and test data
set.seed(2010)
### stratified random 80–20 split, keeping the class balance consistent
train_indices <- createDataPartition(mh_risk_balanced$RiskLevel, p = 0.8, list = FALSE)
train_data <- mh_risk_balanced[train_indices, ]
test_data <- mh_risk_balanced[-train_indices, ]
head(train_data,3)
## # A tibble: 3 × 6
## Age BS BodyTemp HeartRate RiskLevel MAP
## <int> <dbl> <dbl> <int> <fct> <dbl>
## 1 40 6.9 98 80 low risk 100
## 2 42 7.7 98 70 low risk 93
## 3 18 6.8 98 76 low risk 80
head(test_data,3)
## # A tibble: 3 × 6
## Age BS BodyTemp HeartRate RiskLevel MAP
## <int> <dbl> <dbl> <int> <fct> <dbl>
## 1 42 7.5 98 70 low risk 93
## 2 17 7.5 102 76 low risk 93
## 3 25 7.5 98 76 low risk 93
Data Mining
library(ranger)
Hyper-parameter Tuning
### Random forest key hyperparameters:
### mtry = number of variables randomly sampled at each split
### ntree = number of trees (fixed)
set.seed(2010)
ctrl <- trainControl(method = "cv", number = 10)
grid <- expand.grid(
mtry = c(2, 3, 4, 5),
splitrule = "gini",
min.node.size = 1
)
The results of the hyperparameter tuning can be seen as follows:
rf_tuned <- train(
RiskLevel ~ .,
data = train_data,
method = "ranger",
metric = "Accuracy",
trControl = ctrl,
tuneGrid = grid,
num.trees = 1000,
importance = "impurity"
)
### to view grid search results before final model
### to show all tried mtry values with mean Accuracy and SD across folds
rf_tuned$results
## mtry splitrule min.node.size Accuracy Kappa AccuracySD KappaSD
## 1 2 gini 1 0.8614454 0.7920940 0.04544692 0.06823838
## 2 3 gini 1 0.8614454 0.7921156 0.04341785 0.06517867
## 3 4 gini 1 0.8593835 0.7890040 0.04368394 0.06559453
## 4 5 gini 1 0.8604355 0.7906078 0.03849957 0.05777643
plot(rf_tuned)
rf_tuned$bestTune
## mtry splitrule min.node.size
## 1 2 gini 1
From above analyses, the optimal mtry can be considered as “2”.
Then, the final model was trained with chosen optimal mtry.
### tuned model
best_mtry <- rf_tuned$bestTune$mtry
final_rf <- ranger(
RiskLevel ~ .,
data = train_data,
mtry = best_mtry,
num.trees = 1000,
importance = "impurity"
)
Interpretation of the Results from Random Forest
The results of the final model is visualized along with the compact summary.
### result
print(final_rf)
## Ranger result
##
## Call:
## ranger(RiskLevel ~ ., data = train_data, mtry = best_mtry, num.trees = 1000, importance = "impurity")
##
## Type: Classification
## Number of trees: 1000
## Sample size: 975
## Number of independent variables: 5
## Mtry: 2
## Target node size: 1
## Variable importance mode: impurity
## Splitrule: gini
## OOB prediction error: 13.44 %
Feature importance is displayed as a graph.
### feature importance from tuned model
importance <- varImp(rf_tuned, scale = TRUE)
plot(importance, top = 5)
print(importance)
## ranger variable importance
##
## Overall
## BS 100.00
## MAP 70.74
## Age 46.83
## HeartRate 20.43
## BodyTemp 0.00
Evaluation
# to get predictions from ranger
pred <- predict(final_rf, data = test_data)$predictions
# to ensure factors match the classes
actual <- factor(test_data$RiskLevel, levels = c("low risk", "mid risk", "high risk"))
predicted <- factor(pred, levels = c("low risk", "mid risk", "high risk"))
# confusion matrix for multiclass
conf_mat_rf_n <- confusionMatrix(predicted, actual)
conf_mat_rf_n
## Confusion Matrix and Statistics
##
## Reference
## Prediction low risk mid risk high risk
## low risk 68 7 0
## mid risk 8 69 3
## high risk 5 5 78
##
## Overall Statistics
##
## Accuracy : 0.8848
## 95% CI : (0.8378, 0.9221)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8272
##
## Mcnemar's Test P-Value : 0.1347
##
## Statistics by Class:
##
## Class: low risk Class: mid risk Class: high risk
## Sensitivity 0.8395 0.8519 0.9630
## Specificity 0.9568 0.9321 0.9383
## Pos Pred Value 0.9067 0.8625 0.8864
## Neg Pred Value 0.9226 0.9264 0.9806
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.2798 0.2840 0.3210
## Detection Prevalence 0.3086 0.3292 0.3621
## Balanced Accuracy 0.8981 0.8920 0.9506
Numerical variables for the train data are converted to relevant categorical groups as follows:
### 1. Age categories (5 groups)
train_data$Age <- cut(train_data$Age,
breaks = c(0, 25, 35, 45, 55, 100),
labels = c("0-25","26-35","36-45","46-55","56+"),
right = TRUE)
### 2. MAP categories (hypothetical ranges)
train_data$MAP <- cut(train_data$MAP,
breaks = c(0, 70, 90, 110, 200),
labels = c("Low","Normal","Elevated","High"),
right = TRUE)
### 3. Blood Sugar (BS in mmol/L)
train_data$BS <- cut(train_data$BS,
breaks = c(0, 5.5, 7, 11, 30),
labels = c("Normal","Pre-diabetes","Diabetes","Very High"),
right = TRUE)
### 4. Body Temperature
train_data$BodyTemp <- cut(train_data$BodyTemp,
breaks = c(0, 97, 99, 105),
labels = c("Low","Normal","Fever"),
right = TRUE)
### 5. Heart Rate
train_data$HeartRate <- cut(train_data$HeartRate,
breaks = c(0, 60, 80, 200),
labels = c("Bradycardia","Normal","Tachycardia"),
right = TRUE)
head(train_data,3)
## # A tibble: 3 × 6
## Age BS BodyTemp HeartRate RiskLevel MAP
## <fct> <fct> <fct> <fct> <fct> <fct>
## 1 36-45 Pre-diabetes Normal Normal low risk Elevated
## 2 36-45 Diabetes Normal Normal low risk Elevated
## 3 0-25 Pre-diabetes Normal Normal low risk Normal
Numerical variables for the test data are converted to relevant categorical groups as follows:
### 1. Age categories (5 groups)
test_data$Age <- cut(test_data$Age,
breaks = c(0, 25, 35, 45, 55, 100),
labels = c("0-25","26-35","36-45","46-55","56+"),
right = TRUE)
### 2. MAP categories (hypothetical ranges)
test_data$MAP <- cut(test_data$MAP,
breaks = c(0, 70, 90, 110, 200),
labels = c("Low","Normal","Elevated","High"),
right = TRUE)
### 3. Blood Sugar (BS in mmol/L)
test_data$BS <- cut(test_data$BS,
breaks = c(0, 5.5, 7, 11, 30),
labels = c("Normal","Pre-diabetes","Diabetes","Very High"),
right = TRUE)
### 4. Body Temperature
test_data$BodyTemp <- cut(test_data$BodyTemp,
breaks = c(0, 97, 99, 105),
labels = c("Low","Normal","Fever"),
right = TRUE)
### 5. Heart Rate
test_data$HeartRate <- cut(test_data$HeartRate,
breaks = c(0, 60, 80, 200),
labels = c("Bradycardia","Normal","Tachycardia"),
right = TRUE)
head(test_data,3)
## # A tibble: 3 × 6
## Age BS BodyTemp HeartRate RiskLevel MAP
## <fct> <fct> <fct> <fct> <fct> <fct>
## 1 36-45 Diabetes Normal Normal low risk Elevated
## 2 0-25 Diabetes Fever Normal low risk Elevated
## 3 0-25 Diabetes Normal Normal low risk Elevated
The results of the hyperparameter tuning using the same grid as above can be seen as follows:
rf_tuned2 <- train(
RiskLevel ~ .,
data = train_data,
method = "ranger",
metric = "Accuracy",
trControl = ctrl,
tuneGrid = grid,
num.trees = 1000,
importance = "impurity"
)
### to view grid search results before final model
### to show all tried mtry values with mean Accuracy and SD across folds
rf_tuned2$results
## mtry splitrule min.node.size Accuracy Kappa AccuracySD KappaSD
## 1 2 gini 1 0.6576426 0.4864267 0.05954462 0.08927200
## 2 3 gini 1 0.6852163 0.5277591 0.04606131 0.06913748
## 3 4 gini 1 0.6810819 0.5216032 0.04189491 0.06274777
## 4 5 gini 1 0.6852280 0.5277901 0.04217588 0.06319396
plot(rf_tuned2)
rf_tuned2$bestTune
## mtry splitrule min.node.size
## 4 5 gini 1
From above analyses, the optimal mtry can be considered as “3”.
Then, the final model was trained with chosen optimal mtry.
### tuned model
final_rf2 <- ranger(
RiskLevel ~ .,
data = train_data,
mtry = 3,
num.trees = 1000,
importance = "impurity"
)
Interpretation of the Model Results
The results of the final model is visualized along with the compact summary.
### result
print(final_rf2)
## Ranger result
##
## Call:
## ranger(RiskLevel ~ ., data = train_data, mtry = 3, num.trees = 1000, importance = "impurity")
##
## Type: Classification
## Number of trees: 1000
## Sample size: 975
## Number of independent variables: 5
## Mtry: 3
## Target node size: 1
## Variable importance mode: impurity
## Splitrule: gini
## OOB prediction error: 29.13 %
Feature Importance
### feature importance from tuned model
importance2 <- varImp(rf_tuned2, scale = TRUE)
plot(importance2, top = 10)
print(importance2)
## ranger variable importance
##
## Overall
## BSVery High 100.000
## MAPHigh 93.502
## BSPre-diabetes 33.043
## BSDiabetes 23.967
## Age36-45 18.315
## BodyTempNormal 17.887
## Age26-35 17.167
## BodyTempFever 17.117
## MAPElevated 10.963
## HeartRateTachycardia 9.834
## Age46-55 9.488
## MAPNormal 8.879
## HeartRateNormal 5.358
## Age56+ 0.000
Evaluation
# to get predictions from ranger
pred2 <- predict(final_rf2, data = test_data)$predictions
# to ensure factors match the classes
predicted2 <- factor(pred2, levels = c("low risk", "mid risk", "high risk"))
# confusion matrix for multiclass
conf_mat_rf_cat <- confusionMatrix(predicted2, actual)
conf_mat_rf_cat
## Confusion Matrix and Statistics
##
## Reference
## Prediction low risk mid risk high risk
## low risk 52 21 2
## mid risk 21 52 16
## high risk 8 8 63
##
## Overall Statistics
##
## Accuracy : 0.6872
## 95% CI : (0.6249, 0.745)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.5309
##
## Mcnemar's Test P-Value : 0.09933
##
## Statistics by Class:
##
## Class: low risk Class: mid risk Class: high risk
## Sensitivity 0.6420 0.6420 0.7778
## Specificity 0.8580 0.7716 0.9012
## Pos Pred Value 0.6933 0.5843 0.7975
## Neg Pred Value 0.8274 0.8117 0.8902
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.2140 0.2140 0.2593
## Detection Prevalence 0.3086 0.3663 0.3251
## Balanced Accuracy 0.7500 0.7068 0.8395
library(nnet)
The multinational logistic regression is built as follows:
# to fit multinomial logistic regression
multi_logit <- multinom(RiskLevel ~ Age + MAP + BS + BodyTemp + HeartRate,
data = train_data)
## # weights: 48 (30 variable)
## initial value 1071.146981
## iter 10 value 753.431893
## iter 20 value 678.247475
## iter 30 value 666.078703
## iter 40 value 665.308794
## final value 665.307635
## converged
# to summarize model
summary(multi_logit)
## Call:
## multinom(formula = RiskLevel ~ Age + MAP + BS + BodyTemp + HeartRate,
## data = train_data)
##
## Coefficients:
## (Intercept) Age26-35 Age36-45 Age46-55 Age56+ MAPNormal
## mid risk 3.211048 1.007202 1.137030 0.2958535 -1.317108 0.7734677
## high risk 2.907014 1.173384 2.855892 0.8843586 -2.188685 1.6347380
## MAPElevated MAPHigh BSPre-diabetes BSDiabetes BSVery High
## mid risk 0.3457094 0.0385041 -4.119198 -5.329920 12.66017
## high risk 0.9050855 20.5583150 -5.829989 -5.678467 14.41547
## BodyTempNormal BodyTempFever HeartRateNormal HeartRateTachycardia
## mid risk 0.5382789 2.672769 -0.2920636 0.08674184
## high risk -0.1422062 3.049220 -1.0029739 0.69355800
##
## Std. Errors:
## (Intercept) Age26-35 Age36-45 Age46-55 Age56+ MAPNormal
## mid risk 0.2114266 0.2249396 0.4086512 0.3385768 0.6078595 0.2784850
## high risk 0.2886584 0.3123522 0.4427882 0.4216133 0.8555429 0.4304935
## MAPElevated MAPHigh BSPre-diabetes BSDiabetes BSVery High
## mid risk 0.2650699 9.085807e-08 0.1498808 0.1411936 0.1249447
## high risk 0.4127502 4.189316e-07 0.2243508 0.1830430 0.1249444
## BodyTempNormal BodyTempFever HeartRateNormal HeartRateTachycardia
## mid risk 0.1612706 0.1865413 0.3844020 0.4536973
## high risk 0.2294854 0.2229980 0.5176514 0.5563245
##
## Residual Deviance: 1330.615
## AIC: 1382.615
Evaluation
# to predict class labels
pred_class <- predict(multi_logit, newdata = test_data)
predicted3 <- factor(pred_class,
levels = c("low risk", "mid risk", "high risk"))
# Confusion matrix
conf_mat_log <- confusionMatrix(predicted3, actual)
conf_mat_log
## Confusion Matrix and Statistics
##
## Reference
## Prediction low risk mid risk high risk
## low risk 56 36 12
## mid risk 25 37 18
## high risk 0 8 51
##
## Overall Statistics
##
## Accuracy : 0.5926
## 95% CI : (0.5279, 0.655)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3889
##
## Mcnemar's Test P-Value : 0.0004769
##
## Statistics by Class:
##
## Class: low risk Class: mid risk Class: high risk
## Sensitivity 0.6914 0.4568 0.6296
## Specificity 0.7037 0.7346 0.9506
## Pos Pred Value 0.5385 0.4625 0.8644
## Neg Pred Value 0.8201 0.7301 0.8370
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.2305 0.1523 0.2099
## Detection Prevalence 0.4280 0.3292 0.2428
## Balanced Accuracy 0.6975 0.5957 0.7901
Conclusion
# random forest (numerical)
conf_mat_rf_n$overall["Accuracy"] * 100
## Accuracy
## 88.47737
# random forest (categorical)
conf_mat_rf_cat$overall["Accuracy"] * 100
## Accuracy
## 68.72428
# logistic regression
conf_mat_log$overall["Accuracy"] * 100
## Accuracy
## 59.25926
Overall, the Random Forest model effectively utilized these physiological parameters to classify participants into their respective risk levels with higher predictive performance compared to the Logistic Regression model.
References
World Health Organization. (n.d.). Maternal health. Retrieved from https://www.who.int/health-topics/maternal-health#tab=tab_1