Introduction

Maternal health risk remains a significant global public health challenge, particularly in low- and middle-income countries. According to the World Health Organization (WHO, n.d.), approximately 287,000 women die in 2020 from complications related to pregnancy and childbirth, most of which are preventable. Maternal health risks arise from a combination of medical, socioeconomic, and environmental factors that influence both the mother’s and the newborn’s well-being. Conditions such as preeclampsia, hemorrhage, infection, and obstructed labor contribute to a large proportion of maternal deaths. Furthermore, limited access to quality antenatal care, skilled birth attendants, and timely emergency obstetric services increases the vulnerability of pregnant women. Understanding the interplay between biological factors, healthcare access, and social determinants is crucial to reducing maternal morbidity and mortality and achieving safer motherhood globally.


Methodology

The study used a systematic approach that comprises of four distinct stages:

  1. data extraction and description,
  2. data preparation,
  3. data mining (classification by random forest and logistic regression), and
  4. evaluation.

Data Extraction and Description

The dataset chosen for this study is the ‘Maternal Health Risk’ dataset. It was extracted from UC Irvine Machine Learning Repository. The data has been collected from different hospitals, community clinics, maternal health cares from the rural areas of Bangladesh through the IoT based risk monitoring system..

data source - https://archive.ics.uci.edu/dataset/863/maternal+health+risk

# Define the file path
file_path <- "/Users/thandarmoe/Library/Mobile Documents/com~apple~CloudDocs/teaching/Cloud/me/R-Language/Practice/Dataset/2025/maternal_health_risk.csv"

# Load the CSV file
mh_risk <- read.csv(file_path)

### check the structure
dim(mh_risk)
## [1] 1014    7

The dataset contains instances from 1014 participants with one target variable and 6 features that are related to clinical data.

str(mh_risk)
## 'data.frame':    1014 obs. of  7 variables:
##  $ Age        : int  25 35 29 30 35 23 23 35 32 42 ...
##  $ SystolicBP : int  130 140 90 140 120 140 130 85 120 130 ...
##  $ DiastolicBP: int  80 90 70 85 60 80 70 60 90 80 ...
##  $ BS         : num  15 13 8 7 6.1 7.01 7.01 11 6.9 18 ...
##  $ BodyTemp   : num  98 98 100 98 98 98 98 102 98 98 ...
##  $ HeartRate  : int  86 70 80 70 76 70 78 86 70 70 ...
##  $ RiskLevel  : chr  "high risk" "high risk" "high risk" "high risk" ...
unique(mh_risk$RiskLevel)
## [1] "high risk" "low risk"  "mid risk"

The original target column (RiskLevel) has three classes: high risk, low risk, and mid risk.

Age: Integer — Age of the individual in years.

SystolicBP: Integer — Systolic blood pressure (mmHg)

DiastolicBP: Integer — Diastolic blood pressure (mmHg)

BS: Numeric — Blood glucose levels is in terms of a molar concentration (mmol/L).

BodyTemp: Numeric — Body temperature in degrees Fahrenheit (F).

HeartRate: Integer — Heart rate in beats per minute (bpm).


Research Questions

In this study, the maternal risk dataset from UC Irvine Machine Learning Repository will be used to explore the following research questions:

  1. Can selected health related attributes provided in the dataset be used to predict the risk levle of the participants using machine learning models?
  1. Which one of the model performs better: random forest or logistic regression?
  2. Which health related attributes contribute the most in predicting diabetes?

Data Preparation

To assist data manipulation, the ‘tidyverse’ R-package was loaded.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Missing values were checked, and found none.

### to check if there are missing values
sapply(mh_risk, function(x) sum(is.na(x)))
##         Age  SystolicBP DiastolicBP          BS    BodyTemp   HeartRate 
##           0           0           0           0           0           0 
##   RiskLevel 
##           0

The structure of the original dataset was first checked, and found that they are assigned correctly.

str(mh_risk)
## 'data.frame':    1014 obs. of  7 variables:
##  $ Age        : int  25 35 29 30 35 23 23 35 32 42 ...
##  $ SystolicBP : int  130 140 90 140 120 140 130 85 120 130 ...
##  $ DiastolicBP: int  80 90 70 85 60 80 70 60 90 80 ...
##  $ BS         : num  15 13 8 7 6.1 7.01 7.01 11 6.9 18 ...
##  $ BodyTemp   : num  98 98 100 98 98 98 98 102 98 98 ...
##  $ HeartRate  : int  86 70 80 70 76 70 78 86 70 70 ...
##  $ RiskLevel  : chr  "high risk" "high risk" "high risk" "high risk" ...

A correlation plot was visualized to identify highly correlated features for potential removal but none exhibits high correlation.

### correlation
library(corrplot)
## corrplot 0.95 loaded
# to remove non-numeric columns (e.g., RiskLevel)
mh_risk_num <- mh_risk[sapply(mh_risk, is.numeric)]

cor_mat <- cor(mh_risk_num, use = "pairwise.complete.obs")
corrplot(cor_mat,
         method = "color",      
         type = "upper",        
         tl.col = "black",      
         tl.srt = 45,           
         addCoef.col = "black", 
         number.cex = 0.6,      
         diag = FALSE)          

### to find maximum correlation
max(abs(cor_mat[upper.tri(cor_mat)]))
## [1] 0.7870065

Systolic and diastolic blood pressure is highly correlated and will be problematic when running a logistic regression. So I will combine the two variable in to one clinical variable: mean arterial pressure (MAP).

mh_risk$MAP <- round((mh_risk$SystolicBP + 2 * mh_risk$DiastolicBP) / 3, 0)

# to create final dataset without SBP and DBP
mh_risk_final <- subset(mh_risk, select = -c(SystolicBP, DiastolicBP))

Target categorical variables (assigned as character data types) are converted to factor type.

mh_risk_final$RiskLevel <- factor(mh_risk_final$RiskLevel,
                                  levels = c("low risk", "mid risk", "high risk"))

str(mh_risk_final)
## 'data.frame':    1014 obs. of  6 variables:
##  $ Age      : int  25 35 29 30 35 23 23 35 32 42 ...
##  $ BS       : num  15 13 8 7 6.1 7.01 7.01 11 6.9 18 ...
##  $ BodyTemp : num  98 98 100 98 98 98 98 102 98 98 ...
##  $ HeartRate: int  86 70 80 70 76 70 78 86 70 70 ...
##  $ RiskLevel: Factor w/ 3 levels "low risk","mid risk",..: 3 3 3 3 1 3 2 3 2 3 ...
##  $ MAP      : num  97 107 77 103 80 100 90 68 100 97 ...

Target categorical variables were checked for class imbalance.

### to check the orignial classes in target column
table(mh_risk_final$RiskLevel)
## 
##  low risk  mid risk high risk 
##       406       336       272

The resulting dataset was balanced between three classes using simple random oversampling with replacement.

set.seed(2010)

# to determine target size (largest class)
target_size <- max(table(mh_risk_final$RiskLevel))

# to oversample each class to target size and combine
mh_risk_balanced <- mh_risk_final %>%
  group_by(RiskLevel) %>%
  slice_sample(n = target_size, replace = TRUE) %>%
  ungroup()

# to check new counts
table(mh_risk_balanced$RiskLevel)
## 
##  low risk  mid risk high risk 
##       406       406       406

The balanced dataset is split into train (80%) and test dataset (20%).

library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
### to assign train and test data
set.seed(2010)    

### stratified random 80–20 split, keeping the class balance consistent
train_indices <- createDataPartition(mh_risk_balanced$RiskLevel, p = 0.8, list = FALSE)

train_data <- mh_risk_balanced[train_indices, ]
test_data  <- mh_risk_balanced[-train_indices, ]

head(train_data,3)
## # A tibble: 3 × 6
##     Age    BS BodyTemp HeartRate RiskLevel   MAP
##   <int> <dbl>    <dbl>     <int> <fct>     <dbl>
## 1    40   6.9       98        80 low risk    100
## 2    42   7.7       98        70 low risk     93
## 3    18   6.8       98        76 low risk     80
head(test_data,3)
## # A tibble: 3 × 6
##     Age    BS BodyTemp HeartRate RiskLevel   MAP
##   <int> <dbl>    <dbl>     <int> <fct>     <dbl>
## 1    42   7.5       98        70 low risk     93
## 2    17   7.5      102        76 low risk     93
## 3    25   7.5       98        76 low risk     93

Data Mining

  1. Random Forest (predictors as numeric)
library(ranger)

Hyper-parameter Tuning

### Random forest key hyperparameters:
### mtry = number of variables randomly sampled at each split
### ntree = number of trees (fixed)


set.seed(2010)

ctrl <- trainControl(method = "cv", number = 10)

grid <- expand.grid(
  mtry = c(2, 3, 4, 5),
  splitrule = "gini",
  min.node.size = 1
)

The results of the hyperparameter tuning can be seen as follows:

rf_tuned <- train(
  RiskLevel ~ ., 
  data = train_data,
  method = "ranger",
  metric = "Accuracy",
  trControl = ctrl,
  tuneGrid = grid,
  num.trees = 1000,
  importance = "impurity"
)

### to view grid search results before final model 
### to show all tried mtry values with mean Accuracy and SD across folds
rf_tuned$results
##   mtry splitrule min.node.size  Accuracy     Kappa AccuracySD    KappaSD
## 1    2      gini             1 0.8614454 0.7920940 0.04544692 0.06823838
## 2    3      gini             1 0.8614454 0.7921156 0.04341785 0.06517867
## 3    4      gini             1 0.8593835 0.7890040 0.04368394 0.06559453
## 4    5      gini             1 0.8604355 0.7906078 0.03849957 0.05777643
plot(rf_tuned)

rf_tuned$bestTune
##   mtry splitrule min.node.size
## 1    2      gini             1

From above analyses, the optimal mtry can be considered as “2”.

Then, the final model was trained with chosen optimal mtry.

### tuned model
best_mtry <- rf_tuned$bestTune$mtry

final_rf <- ranger(
  RiskLevel ~ ., 
  data = train_data, 
  mtry = best_mtry, 
  num.trees = 1000,
  importance = "impurity"
)

Interpretation of the Results from Random Forest

The results of the final model is visualized along with the compact summary.

### result
print(final_rf)
## Ranger result
## 
## Call:
##  ranger(RiskLevel ~ ., data = train_data, mtry = best_mtry, num.trees = 1000,      importance = "impurity") 
## 
## Type:                             Classification 
## Number of trees:                  1000 
## Sample size:                      975 
## Number of independent variables:  5 
## Mtry:                             2 
## Target node size:                 1 
## Variable importance mode:         impurity 
## Splitrule:                        gini 
## OOB prediction error:             13.44 %

Feature importance is displayed as a graph.

### feature importance from tuned model
importance <- varImp(rf_tuned, scale = TRUE)
plot(importance, top = 5)

print(importance)
## ranger variable importance
## 
##           Overall
## BS         100.00
## MAP         70.74
## Age         46.83
## HeartRate   20.43
## BodyTemp     0.00

Evaluation

# to get predictions from ranger
pred <- predict(final_rf, data = test_data)$predictions

# to ensure factors match the classes
actual <- factor(test_data$RiskLevel, levels = c("low risk", "mid risk", "high risk"))
predicted <- factor(pred, levels = c("low risk", "mid risk", "high risk"))

# confusion matrix for multiclass
conf_mat_rf_n <- confusionMatrix(predicted, actual)
conf_mat_rf_n
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  low risk mid risk high risk
##   low risk        68        7         0
##   mid risk         8       69         3
##   high risk        5        5        78
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8848          
##                  95% CI : (0.8378, 0.9221)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8272          
##                                           
##  Mcnemar's Test P-Value : 0.1347          
## 
## Statistics by Class:
## 
##                      Class: low risk Class: mid risk Class: high risk
## Sensitivity                   0.8395          0.8519           0.9630
## Specificity                   0.9568          0.9321           0.9383
## Pos Pred Value                0.9067          0.8625           0.8864
## Neg Pred Value                0.9226          0.9264           0.9806
## Prevalence                    0.3333          0.3333           0.3333
## Detection Rate                0.2798          0.2840           0.3210
## Detection Prevalence          0.3086          0.3292           0.3621
## Balanced Accuracy             0.8981          0.8920           0.9506

  1. Random Forest (predictors as categorical)

Numerical variables for the train data are converted to relevant categorical groups as follows:

### 1. Age categories (5 groups)
train_data$Age <- cut(train_data$Age,
                             breaks = c(0, 25, 35, 45, 55, 100),
                             labels = c("0-25","26-35","36-45","46-55","56+"),
                             right = TRUE)

### 2. MAP categories (hypothetical ranges)
train_data$MAP <- cut(train_data$MAP,
                             breaks = c(0, 70, 90, 110, 200),
                             labels = c("Low","Normal","Elevated","High"),
                             right = TRUE)

### 3. Blood Sugar (BS in mmol/L)
train_data$BS <- cut(train_data$BS,
                            breaks = c(0, 5.5, 7, 11, 30),
                            labels = c("Normal","Pre-diabetes","Diabetes","Very High"),
                            right = TRUE)

### 4. Body Temperature
train_data$BodyTemp <- cut(train_data$BodyTemp,
                                  breaks = c(0, 97, 99, 105),
                                  labels = c("Low","Normal","Fever"),
                                  right = TRUE)

### 5. Heart Rate
train_data$HeartRate <- cut(train_data$HeartRate,
                                   breaks = c(0, 60, 80, 200),
                                   labels = c("Bradycardia","Normal","Tachycardia"),
                                   right = TRUE)

head(train_data,3)
## # A tibble: 3 × 6
##   Age   BS           BodyTemp HeartRate RiskLevel MAP     
##   <fct> <fct>        <fct>    <fct>     <fct>     <fct>   
## 1 36-45 Pre-diabetes Normal   Normal    low risk  Elevated
## 2 36-45 Diabetes     Normal   Normal    low risk  Elevated
## 3 0-25  Pre-diabetes Normal   Normal    low risk  Normal

Numerical variables for the test data are converted to relevant categorical groups as follows:

### 1. Age categories (5 groups)
test_data$Age <- cut(test_data$Age,
                             breaks = c(0, 25, 35, 45, 55, 100),
                             labels = c("0-25","26-35","36-45","46-55","56+"),
                             right = TRUE)

### 2. MAP categories (hypothetical ranges)
test_data$MAP <- cut(test_data$MAP,
                             breaks = c(0, 70, 90, 110, 200),
                             labels = c("Low","Normal","Elevated","High"),
                             right = TRUE)

### 3. Blood Sugar (BS in mmol/L)
test_data$BS <- cut(test_data$BS,
                            breaks = c(0, 5.5, 7, 11, 30),
                            labels = c("Normal","Pre-diabetes","Diabetes","Very High"),
                            right = TRUE)

### 4. Body Temperature
test_data$BodyTemp <- cut(test_data$BodyTemp,
                                  breaks = c(0, 97, 99, 105),
                                  labels = c("Low","Normal","Fever"),
                                  right = TRUE)

### 5. Heart Rate
test_data$HeartRate <- cut(test_data$HeartRate,
                                   breaks = c(0, 60, 80, 200),
                                   labels = c("Bradycardia","Normal","Tachycardia"),
                                   right = TRUE)

head(test_data,3)
## # A tibble: 3 × 6
##   Age   BS       BodyTemp HeartRate RiskLevel MAP     
##   <fct> <fct>    <fct>    <fct>     <fct>     <fct>   
## 1 36-45 Diabetes Normal   Normal    low risk  Elevated
## 2 0-25  Diabetes Fever    Normal    low risk  Elevated
## 3 0-25  Diabetes Normal   Normal    low risk  Elevated

The results of the hyperparameter tuning using the same grid as above can be seen as follows:

rf_tuned2 <- train(
  RiskLevel ~ ., 
  data = train_data,
  method = "ranger",
  metric = "Accuracy",
  trControl = ctrl,
  tuneGrid = grid,
  num.trees = 1000,
  importance = "impurity"
)

### to view grid search results before final model 
### to show all tried mtry values with mean Accuracy and SD across folds
rf_tuned2$results
##   mtry splitrule min.node.size  Accuracy     Kappa AccuracySD    KappaSD
## 1    2      gini             1 0.6576426 0.4864267 0.05954462 0.08927200
## 2    3      gini             1 0.6852163 0.5277591 0.04606131 0.06913748
## 3    4      gini             1 0.6810819 0.5216032 0.04189491 0.06274777
## 4    5      gini             1 0.6852280 0.5277901 0.04217588 0.06319396
plot(rf_tuned2)

rf_tuned2$bestTune
##   mtry splitrule min.node.size
## 4    5      gini             1

From above analyses, the optimal mtry can be considered as “3”.

Then, the final model was trained with chosen optimal mtry.

### tuned model

final_rf2 <- ranger(
  RiskLevel ~ ., 
  data = train_data, 
  mtry = 3, 
  num.trees = 1000,
  importance = "impurity"
)

Interpretation of the Model Results

The results of the final model is visualized along with the compact summary.

### result
print(final_rf2)
## Ranger result
## 
## Call:
##  ranger(RiskLevel ~ ., data = train_data, mtry = 3, num.trees = 1000,      importance = "impurity") 
## 
## Type:                             Classification 
## Number of trees:                  1000 
## Sample size:                      975 
## Number of independent variables:  5 
## Mtry:                             3 
## Target node size:                 1 
## Variable importance mode:         impurity 
## Splitrule:                        gini 
## OOB prediction error:             29.13 %

Feature Importance

### feature importance from tuned model
importance2 <- varImp(rf_tuned2, scale = TRUE)
plot(importance2, top = 10)

print(importance2)
## ranger variable importance
## 
##                      Overall
## BSVery High          100.000
## MAPHigh               93.502
## BSPre-diabetes        33.043
## BSDiabetes            23.967
## Age36-45              18.315
## BodyTempNormal        17.887
## Age26-35              17.167
## BodyTempFever         17.117
## MAPElevated           10.963
## HeartRateTachycardia   9.834
## Age46-55               9.488
## MAPNormal              8.879
## HeartRateNormal        5.358
## Age56+                 0.000

Evaluation

# to get predictions from ranger
pred2 <- predict(final_rf2, data = test_data)$predictions

# to ensure factors match the classes
predicted2 <- factor(pred2, levels = c("low risk", "mid risk", "high risk"))

# confusion matrix for multiclass
conf_mat_rf_cat <- confusionMatrix(predicted2, actual)
conf_mat_rf_cat
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  low risk mid risk high risk
##   low risk        52       21         2
##   mid risk        21       52        16
##   high risk        8        8        63
## 
## Overall Statistics
##                                          
##                Accuracy : 0.6872         
##                  95% CI : (0.6249, 0.745)
##     No Information Rate : 0.3333         
##     P-Value [Acc > NIR] : < 2e-16        
##                                          
##                   Kappa : 0.5309         
##                                          
##  Mcnemar's Test P-Value : 0.09933        
## 
## Statistics by Class:
## 
##                      Class: low risk Class: mid risk Class: high risk
## Sensitivity                   0.6420          0.6420           0.7778
## Specificity                   0.8580          0.7716           0.9012
## Pos Pred Value                0.6933          0.5843           0.7975
## Neg Pred Value                0.8274          0.8117           0.8902
## Prevalence                    0.3333          0.3333           0.3333
## Detection Rate                0.2140          0.2140           0.2593
## Detection Prevalence          0.3086          0.3663           0.3251
## Balanced Accuracy             0.7500          0.7068           0.8395

  1. Logistic Regression
library(nnet)

The multinational logistic regression is built as follows:

# to fit multinomial logistic regression
multi_logit <- multinom(RiskLevel ~ Age + MAP + BS + BodyTemp + HeartRate, 
                        data = train_data)
## # weights:  48 (30 variable)
## initial  value 1071.146981 
## iter  10 value 753.431893
## iter  20 value 678.247475
## iter  30 value 666.078703
## iter  40 value 665.308794
## final  value 665.307635 
## converged
# to summarize model
summary(multi_logit)
## Call:
## multinom(formula = RiskLevel ~ Age + MAP + BS + BodyTemp + HeartRate, 
##     data = train_data)
## 
## Coefficients:
##           (Intercept) Age26-35 Age36-45  Age46-55    Age56+ MAPNormal
## mid risk     3.211048 1.007202 1.137030 0.2958535 -1.317108 0.7734677
## high risk    2.907014 1.173384 2.855892 0.8843586 -2.188685 1.6347380
##           MAPElevated    MAPHigh BSPre-diabetes BSDiabetes BSVery High
## mid risk    0.3457094  0.0385041      -4.119198  -5.329920    12.66017
## high risk   0.9050855 20.5583150      -5.829989  -5.678467    14.41547
##           BodyTempNormal BodyTempFever HeartRateNormal HeartRateTachycardia
## mid risk       0.5382789      2.672769      -0.2920636           0.08674184
## high risk     -0.1422062      3.049220      -1.0029739           0.69355800
## 
## Std. Errors:
##           (Intercept)  Age26-35  Age36-45  Age46-55    Age56+ MAPNormal
## mid risk    0.2114266 0.2249396 0.4086512 0.3385768 0.6078595 0.2784850
## high risk   0.2886584 0.3123522 0.4427882 0.4216133 0.8555429 0.4304935
##           MAPElevated      MAPHigh BSPre-diabetes BSDiabetes BSVery High
## mid risk    0.2650699 9.085807e-08      0.1498808  0.1411936   0.1249447
## high risk   0.4127502 4.189316e-07      0.2243508  0.1830430   0.1249444
##           BodyTempNormal BodyTempFever HeartRateNormal HeartRateTachycardia
## mid risk       0.1612706     0.1865413       0.3844020            0.4536973
## high risk      0.2294854     0.2229980       0.5176514            0.5563245
## 
## Residual Deviance: 1330.615 
## AIC: 1382.615

Evaluation

# to predict class labels
pred_class <- predict(multi_logit, newdata = test_data)


predicted3 <- factor(pred_class, 
                    levels = c("low risk", "mid risk", "high risk"))

# Confusion matrix
conf_mat_log <- confusionMatrix(predicted3, actual)
conf_mat_log
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  low risk mid risk high risk
##   low risk        56       36        12
##   mid risk        25       37        18
##   high risk        0        8        51
## 
## Overall Statistics
##                                          
##                Accuracy : 0.5926         
##                  95% CI : (0.5279, 0.655)
##     No Information Rate : 0.3333         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.3889         
##                                          
##  Mcnemar's Test P-Value : 0.0004769      
## 
## Statistics by Class:
## 
##                      Class: low risk Class: mid risk Class: high risk
## Sensitivity                   0.6914          0.4568           0.6296
## Specificity                   0.7037          0.7346           0.9506
## Pos Pred Value                0.5385          0.4625           0.8644
## Neg Pred Value                0.8201          0.7301           0.8370
## Prevalence                    0.3333          0.3333           0.3333
## Detection Rate                0.2305          0.1523           0.2099
## Detection Prevalence          0.4280          0.3292           0.2428
## Balanced Accuracy             0.6975          0.5957           0.7901

Conclusion

# random forest (numerical)
conf_mat_rf_n$overall["Accuracy"] * 100
## Accuracy 
## 88.47737
# random forest (categorical)
conf_mat_rf_cat$overall["Accuracy"] * 100
## Accuracy 
## 68.72428
# logistic regression
conf_mat_log$overall["Accuracy"] * 100
## Accuracy 
## 59.25926
  1. The selected health-related attributes — Blood Glucose (BS), Mean Arterial Pressure (MAP), Age, Heart Rate, and Body Temperature — were used to predict the risk level of participants using machine learning models. These variables represent important physiological indicators commonly linked to metabolic and cardiovascular conditions.
  2. Between the three models tested — Random Forest (numerical and categorical) and Logistic Regression — the Random Forest model using numerical values demonstrated higher accuracy, indicating its stronger ability to capture complex and non-linear relationships between predictors and risk levels.
  3. Based on the Random Forest model’s variable importance, the most influential attributes in predicting the risk of diabetes were:
  1. Blood Glucose (BS) — the strongest predictor of risk level, reflecting direct metabolic imbalance.
  2. Mean Arterial Pressure (MAP) — indicating cardiovascular strain.
  3. Age — representing cumulative physiological risk.
  4. Heart Rate — associated with cardiovascular and metabolic activity.
  5. Body Temperature — reflecting potential infection or metabolic stress.

Overall, the Random Forest model effectively utilized these physiological parameters to classify participants into their respective risk levels with higher predictive performance compared to the Logistic Regression model.

References

World Health Organization. (n.d.). Maternal health. Retrieved from https://www.who.int/health-topics/maternal-health#tab=tab_1