Final_Project

Introduction

Diabetes is one of the major non-communicable diseases that has been a persistent public health burden across the world. According to Diabetes Atlas 2025 report, nearly 600 million adults are living with diabetes worldwide. It is estimated that in the year 2050, there will be nearly 850 million adults with diabetes. (International Diabetes Federation, 2024) Diabetes is a chronic condition with an increase level of blood glucose. It is a multi-factorial disease that results from a complex interaction between genetic, environmental, and lifestyle choices. Therefore, it is important to understand the interplay between lifestyle choices and the diabetes.

Methodology

The study used a systematic approach that comprises of four distinct stages:

data extraction and description,
data preparation,
data mining (classification by random forest), and
evaluation.

Data Extraction and Description

The dataset chosen for this study is the ‘US CDC Diabetes Health Indicators’ dataset. It was extracted from UC Irvine Machine Learning Repository. The data was collected from a telephone health survey from the Behavioral Risk Factor Surveillance System (BRFSS) collected primarily by the CDC in the US in 2015.

Source - https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset and https://archive.ics.uci.edu/dataset/891/cdc+diabetes+health+indicators

### load the data
dm_data <- read.csv('diabetes_012_health_indicators_BRFSS2015.csv')

### check the structure
dim(dm_data)

## [1] 253680     22

The dataset contains instances from 253,680 participants with one target variable and 21 features such as demographics, and self-reported lifestyle and health related information.

for (column in names(dm_data)) {
  cat("\nColumn:", column, "\n")
  print(unique(dm_data[[column]]))
}

## 
## Column: Diabetes_012 
## [1] 0 2 1
## 
## Column: HighBP 
## [1] 1 0
## 
## Column: HighChol 
## [1] 1 0
## 
## Column: CholCheck 
## [1] 1 0
## 
## Column: BMI 
##  [1] 40 25 28 27 24 30 34 26 33 21 23 22 38 32 37 31 29 20 35 45 39 19 47 18 36
## [26] 43 55 49 42 17 16 41 44 50 59 48 52 46 54 57 53 14 15 51 58 63 61 56 74 62
## [51] 64 66 73 85 60 67 65 70 82 79 92 68 72 88 96 13 81 71 75 12 77 69 76 87 89
## [76] 84 95 98 91 86 83 80 90 78
## 
## Column: Smoker 
## [1] 1 0
## 
## Column: Stroke 
## [1] 0 1
## 
## Column: HeartDiseaseorAttack 
## [1] 0 1
## 
## Column: PhysActivity 
## [1] 0 1
## 
## Column: Fruits 
## [1] 0 1
## 
## Column: Veggies 
## [1] 1 0
## 
## Column: HvyAlcoholConsump 
## [1] 0 1
## 
## Column: AnyHealthcare 
## [1] 1 0
## 
## Column: NoDocbcCost 
## [1] 0 1
## 
## Column: GenHlth 
## [1] 5 3 2 4 1
## 
## Column: MentHlth 
##  [1] 18  0 30  3  5 15 10  6 20  2 25  1  4  7  8 21 14 26 29 16 28 11 12 24 17
## [26] 13 27 19 22  9 23
## 
## Column: PhysHlth 
##  [1] 15  0 30  2 14 28  7 20  3 10  1  5 17  4 19  6 12 25 27 21 22  8 29 24  9
## [26] 16 18 23 13 26 11
## 
## Column: DiffWalk 
## [1] 1 0
## 
## Column: Sex 
## [1] 0 1
## 
## Column: Age 
##  [1]  9  7 11 10  8 13  4  6  2 12  5  1  3
## 
## Column: Education 
## [1] 4 6 3 5 2 1
## 
## Column: Income 
## [1] 3 1 8 6 4 7 2 5

The original target column (Diabetes_012) has three classes: 0 for no diabetes, 1 for prediabetes, and 2 for diabetes.

The description of the 21 feature columns are as follows:

HighBP: Presence of high blood pressure (0 = no, 1 = yes).
HighChol: Presence of high cholesterol (0 = no, 1 = yes).
CholCheck: Indicates whether the individual had a cholesterol check in the past 5 years (0 = no, 1 = yes).
BMI: Body Mass Index.
Smoker: Smoking status, coded as whether the person has smoked at least 100 cigarettes in their entire life (0 = no, 1 = yes).
Stroke: History of stroke (0 = no, 1 = yes).
HeartDiseaseorAttack: History of coronary heart disease (CHD) or myocardial infarction (MI) (0 = no, 1 = yes).
PhysActivity: Participation in physical activity in the past 30 days, excluding job (0 = no, 1 = yes).
Fruits: Consumption of fruits one or more times per day (0 = no, 1 = yes).
Veggies: Consumption of vegetables one or more times per day (0 = no, 1 = yes).
HvyAlcoholConsump: Heavy alcohol consumption (0 = no, 1 = yes), defined as >14 drinks per week for men and >7 drinks per week for women.
AnyHealthcare: Access to healthcare coverage, such as prepaid plans (0 = no, 1 = yes).
NoDocbcCost: Whether cost prevented a doctor’s visit in the past 12 months (0 = no, 1 = yes).
GenHlth: Self-reported general health status on a 5-point scale (1 = excellent, 3=good, 5 = poor).
MentHlth: Number of days in the past 30 days when mental health was not good (scale: 1–30 days).
PhysHlth: Number of days in the past 30 days when physical illness or injury(scale: 1–30 days).
DiffWalk: Presence of serious difficulty walking or climbing stairs (0 = no, 1 = yes).
Sex: Gender (0 = female, 1 = male).
Age: Age group, coded into 13 ordered categories (e.g., 1 = 18–24 years, 9 = 60–64 years, 13 = 80 years or older).
Education: Education level, coded 1–6 (1 = neven attended school or only kindergarten, 4 = high school graduate, 6 = college graduate).
Income: Income Group, coded 1–8 (1 = less than $10,000, 5 = < $35,000, 8 = >=$75,000).

Research Questions

In this study, the diabetes health indicators dataset from UC Irvine Machine Learning Repository will be used to explore the following research questions:

1. Can lifestyle and health related attributes provided in the survey be used to predict the presence of diabetes in the surveyed participants using random forest model?

2. Which lifestyle and health related attributes contribute the most in predicting diabetes?

Data Preparation

The dataset contained no missing values, so no interpolation was required. A correlation plot was visualized to identify highly correlated features for potential removal before constructing the decision tree; however, since no features exhibited high correlation, none were removed. Additionally, several categorical variables were originally encoded as numeric values in the dataset, and they were converted to factor types to ensure correct handling by the decision tree algorithm. The original target variable had three classes, but the prediabetic class (coded as 1) represented only 2% of the data and was removed. The diabetic class (coded as 2) was recoded as 1. This resulted in 35,436 observations of class 1 (diabetes) and 213,703 observations of class 0 (no diabetes). To balance the dataset, an equal number of class 0 observations were randomly sampled. The balanced dataset was then split into a training set (80%) and a test set (20%), keeping the class balance consistent.

To assist data manipulation, the ‘tidyverse’ R-package was loaded.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Missing values were checked, and found none.

### to check if there are missing values
sapply(dm_data, function(x) sum(is.na(x)))

##         Diabetes_012               HighBP             HighChol 
##                    0                    0                    0 
##            CholCheck                  BMI               Smoker 
##                    0                    0                    0 
##               Stroke HeartDiseaseorAttack         PhysActivity 
##                    0                    0                    0 
##               Fruits              Veggies    HvyAlcoholConsump 
##                    0                    0                    0 
##        AnyHealthcare          NoDocbcCost              GenHlth 
##                    0                    0                    0 
##             MentHlth             PhysHlth             DiffWalk 
##                    0                    0                    0 
##                  Sex                  Age            Education 
##                    0                    0                    0 
##               Income 
##                    0

The structure of the original dataset was first checked, and found that some categorical variables are coded as numeric.

### to check the structure of the original dataset
str(dm_data)

## 'data.frame':    253680 obs. of  22 variables:
##  $ Diabetes_012        : num  0 0 0 0 0 0 0 0 2 0 ...
##  $ HighBP              : num  1 0 1 1 1 1 1 1 1 0 ...
##  $ HighChol            : num  1 0 1 0 1 1 0 1 1 0 ...
##  $ CholCheck           : num  1 0 1 1 1 1 1 1 1 1 ...
##  $ BMI                 : num  40 25 28 27 24 25 30 25 30 24 ...
##  $ Smoker              : num  1 1 0 0 0 1 1 1 1 0 ...
##  $ Stroke              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ HeartDiseaseorAttack: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ PhysActivity        : num  0 1 0 1 1 1 0 1 0 0 ...
##  $ Fruits              : num  0 0 1 1 1 1 0 0 1 0 ...
##  $ Veggies             : num  1 0 0 1 1 1 0 1 1 1 ...
##  $ HvyAlcoholConsump   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ AnyHealthcare       : num  1 0 1 1 1 1 1 1 1 1 ...
##  $ NoDocbcCost         : num  0 1 1 0 0 0 0 0 0 0 ...
##  $ GenHlth             : num  5 3 5 2 2 2 3 3 5 2 ...
##  $ MentHlth            : num  18 0 30 0 3 0 0 0 30 0 ...
##  $ PhysHlth            : num  15 0 30 0 0 2 14 0 30 0 ...
##  $ DiffWalk            : num  1 0 1 0 0 0 0 1 1 0 ...
##  $ Sex                 : num  0 0 0 0 0 1 0 0 0 1 ...
##  $ Age                 : num  9 7 9 11 11 10 9 11 9 8 ...
##  $ Education           : num  4 6 4 3 5 6 6 4 5 4 ...
##  $ Income              : num  3 1 8 6 4 8 7 4 1 3 ...

A correlation plot was visualized to identify highly correlated features for potential removal but none exhibits high correlation.

### correlation
library(corrplot)

## corrplot 0.95 loaded

cor_mat <- cor(dm_data, use = "pairwise.complete.obs")
corrplot(cor_mat,
         method = "color",      
         type = "upper",        
         tl.col = "black",      
         tl.srt = 45,           
         addCoef.col = "black", 
         number.cex = 0.6,      
         diag = FALSE)

### to find maximum correlation
max(abs(cor_mat[upper.tri(cor_mat)]))

## [1] 0.5243636

Categorical variables (assigned as numeric data types) are converted to factor type.

# cols to convert to factor (all except BMI, PhsyHlth, and MentHlth)
cols_to_factor <- setdiff(names(dm_data), c("BMI","PhysHlth","MentHlth"))

# convert selected columns to factor
dm_data[, cols_to_factor] <- lapply(dm_data[, cols_to_factor], as.factor)

### to check the recoded structure of the data
str(dm_data)

## 'data.frame':    253680 obs. of  22 variables:
##  $ Diabetes_012        : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 3 1 ...
##  $ HighBP              : Factor w/ 2 levels "0","1": 2 1 2 2 2 2 2 2 2 1 ...
##  $ HighChol            : Factor w/ 2 levels "0","1": 2 1 2 1 2 2 1 2 2 1 ...
##  $ CholCheck           : Factor w/ 2 levels "0","1": 2 1 2 2 2 2 2 2 2 2 ...
##  $ BMI                 : num  40 25 28 27 24 25 30 25 30 24 ...
##  $ Smoker              : Factor w/ 2 levels "0","1": 2 2 1 1 1 2 2 2 2 1 ...
##  $ Stroke              : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ HeartDiseaseorAttack: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 2 1 ...
##  $ PhysActivity        : Factor w/ 2 levels "0","1": 1 2 1 2 2 2 1 2 1 1 ...
##  $ Fruits              : Factor w/ 2 levels "0","1": 1 1 2 2 2 2 1 1 2 1 ...
##  $ Veggies             : Factor w/ 2 levels "0","1": 2 1 1 2 2 2 1 2 2 2 ...
##  $ HvyAlcoholConsump   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ AnyHealthcare       : Factor w/ 2 levels "0","1": 2 1 2 2 2 2 2 2 2 2 ...
##  $ NoDocbcCost         : Factor w/ 2 levels "0","1": 1 2 2 1 1 1 1 1 1 1 ...
##  $ GenHlth             : Factor w/ 5 levels "1","2","3","4",..: 5 3 5 2 2 2 3 3 5 2 ...
##  $ MentHlth            : num  18 0 30 0 3 0 0 0 30 0 ...
##  $ PhysHlth            : num  15 0 30 0 0 2 14 0 30 0 ...
##  $ DiffWalk            : Factor w/ 2 levels "0","1": 2 1 2 1 1 1 1 2 2 1 ...
##  $ Sex                 : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 1 1 2 ...
##  $ Age                 : Factor w/ 13 levels "1","2","3","4",..: 9 7 9 11 11 10 9 11 9 8 ...
##  $ Education           : Factor w/ 6 levels "1","2","3","4",..: 4 6 4 3 5 6 6 4 5 4 ...
##  $ Income              : Factor w/ 8 levels "1","2","3","4",..: 3 1 8 6 4 8 7 4 1 3 ...

The target variable was reduced to two classes (diabetes vs. no diabetes) by removing the very few prediabetic group(class 1), recoding diabetes(class 2) as class 1.

### to check the orignial classes in target column
table(dm_data$Diabetes_012)

## 
##      0      1      2 
## 213703   4631  35346

### to remove class 1 and change class 2 → 1
dm_data_filtered <- dm_data
dm_data_filtered <- dm_data_filtered[dm_data_filtered$Diabetes_012 != 1, ]
dm_data_filtered$Diabetes_012[dm_data_filtered$Diabetes_012 == 2] <- 1

### to check the distribution of new classes
table(dm_data_filtered$Diabetes_012)

## 
##      0      1      2 
## 213703  35346      0

The resulting dataset was balanced between class 0 (no diabetes) and class 1 (diabetes) by randomly sampling an equal number of no-diabetes cases.

### setting seeds
set.seed(6738371)

### to compute minimum group size and sample
min_size <- dm_data_filtered %>%
  count(Diabetes_012) %>%
  pull(n) %>%
  min()

dm_balanced <- dm_data_filtered %>%
  group_by(Diabetes_012) %>%
  slice_sample(n = min_size, replace = FALSE) %>%
  ungroup()

# to drop unused levels just in case
dm_balanced$Diabetes_012 <- droplevels(dm_balanced$Diabetes_012)

# to check the balanced classes
table(dm_balanced$Diabetes_012)

## 
##     0     1 
## 35346 35346

The balanced dataset is split into train (80%) and test dataset (20%).

library(caret)

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

### to assign train and test data
set.seed(6738371)    

### stratified random 80–20 split, keeping the class balance consistent
train_indices <- createDataPartition(dm_balanced$Diabetes_012, p = 0.8, list = FALSE)

train_data <- dm_balanced[train_indices, ]
test_data  <- dm_balanced[-train_indices, ]

Data Mining (classification by random forest)

A Random Forest classification model was developed to predict Diabetes_012 using the training dataset, employing a 10-fold cross-validation procedure to ensure robust performance estimation. The model was implemented with the ranger algorithm, with hyperparameter tuning performed via grid search over the number of variables randomly sampled at each split (mtry), while the number of trees (ntree) was fixed at 500 to ensure sufficient ensemble stability. The Gini impurity criterion was used to guide the splitting of nodes, and the caret package was employed to systematically evaluate each hyperparameter combination through cross-validation. Model selection was based on the cross-validated Accuracy, and the optimal parameters identified by this procedure were subsequently used to train the final Random Forest model. Variable importance was assessed using the impurity-based metric to interpret feature contributions.

library(ranger)

Hyper-parameter Tuning

### Random forest key hyperparameters:
### mtry = number of variables randomly sampled at each split
### ntree = number of trees (fixed)


set.seed(6738371)

ctrl <- trainControl(method = "cv", number = 10)

grid <- expand.grid(
  mtry = c(2, 4, 6),
  splitrule = "gini",
  min.node.size = 1
)

The results of the hyperparameter tuning can be seen as follows:

rf_tuned <- train(
  Diabetes_012 ~ ., 
  data = train_data,
  method = "ranger",
  metric = "Accuracy",
  trControl = ctrl,
  tuneGrid = grid,
  num.trees = 500,
  importance = "impurity"
)

### to view grid search results before final model 
### to show all tried mtry values with mean Accuracy and SD across folds
rf_tuned$results

##   mtry splitrule min.node.size  Accuracy     Kappa  AccuracySD     KappaSD
## 1    2      gini             1 0.7455704 0.4911407 0.004905459 0.009813955
## 2    4      gini             1 0.7496550 0.4993100 0.004122507 0.008246507
## 3    6      gini             1 0.7479573 0.4959149 0.005078902 0.010158857

plot(rf_tuned)

rf_tuned$bestTune

##   mtry splitrule min.node.size
## 2    4      gini             1

From above analyses, the optimal mtry can be considered as “4”.

Then, the final model was trained with chosen optimal mtry.

### tuned model
best_mtry <- rf_tuned$bestTune$mtry

final_rf <- ranger(
  Diabetes_012 ~ ., 
  data = train_data, 
  mtry = best_mtry, 
  num.trees = 500,
  importance = "impurity"
)

Interpretation of the Model Results

The results of the final model is visualized along with the compact summary.

### result
print(final_rf)

## Ranger result
## 
## Call:
##  ranger(Diabetes_012 ~ ., data = train_data, mtry = best_mtry,      num.trees = 500, importance = "impurity") 
## 
## Type:                             Classification 
## Number of trees:                  500 
## Sample size:                      56554 
## Number of independent variables:  21 
## Mtry:                             4 
## Target node size:                 1 
## Variable importance mode:         impurity 
## Splitrule:                        gini 
## OOB prediction error:             25.23 %

Feature importance is displayed as a graph.

### feature importance from tuned model
importance <- varImp(rf_tuned, scale = TRUE)
plot(importance, top = 20)

print(importance)

## ranger variable importance
## 
##   only 20 most important variables shown (out of 45)
## 
##                       Overall
## BMI                   100.000
## HighBP1                78.040
## HighChol1              44.792
## PhysHlth               40.037
## DiffWalk1              29.820
## GenHlth4               26.417
## MentHlth               26.012
## GenHlth3               20.097
## HeartDiseaseorAttack1  17.571
## GenHlth2               17.219
## Income8                13.973
## PhysActivity1          11.238
## Sex1                   11.139
## Smoker1                10.114
## Fruits1                 9.967
## Education6              8.988
## Veggies1                8.215
## GenHlth5                7.848
## Age11                   6.563
## Education4              6.498

The feature importance plot displays the relative contribution of each predictor in constructing random forest model to identify diabetes. Top five features with high scores are the status of BMI, the high blood pressure, the high cholestrol, self-reported physical health, and difficulty in walking. They reflect the current understanding of the association between metabolic risk factors, physical activity, and diabetes. However, factors like sex, eating fruits, and physical activity only slightly contributes the model. Additionally, commonly recognized risk factors like age have negligible effects in the model.

Limitation: since the model use variables that rely on self-reporting, caution should be exercised when interpreting the results.

Evaluation

Firstly, the actual class variable was extracted from the test data. Secondly, the pruned model was used to make predictions. Thirdly, a confusion matrix was constructed using the predictions from the pruned model. Lastly, the standard performance metrics and statistics were calculated using confusionMatrix().

### get predictions from ranger
pred <- predict(final_rf, data = test_data)$predictions

### ensure factors match the classes
actual <- factor(test_data$Diabetes_012, levels = c("0", "1"))
predicted <- factor(pred, levels = c("0", "1"))

### Confusion matrix
confusionMatrix(predicted, actual, positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4965 1446
##          1 2104 5623
##                                          
##                Accuracy : 0.7489         
##                  95% CI : (0.7417, 0.756)
##     No Information Rate : 0.5            
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.4978         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.7954         
##             Specificity : 0.7024         
##          Pos Pred Value : 0.7277         
##          Neg Pred Value : 0.7745         
##              Prevalence : 0.5000         
##          Detection Rate : 0.3977         
##    Detection Prevalence : 0.5465         
##       Balanced Accuracy : 0.7489         
##                                          
##        'Positive' Class : 1              
##

The overall accuracy of the model was 75% indicating that it correctly predicts three out of four classifications. Additionally, the accuracy was significantly (very low p value) higher than no information rate (NIR), suggesting that it is better than random guessing. And the kappa statistics of 0.4978 indicates that the moderate level of agreement between actual and predicted classification beyond chance. The sensitivity was 0.7954, meaning that the model can accurately identify 8 out of 10 individuals with diabetes. The specificity was 0.7024 indicates that it can accurately identify 70% of patients with no diabetes. Sice sensitivity is higher than specficity, the model is more effective at identifying patients with diabetes than those with no diabetes. In disease prediction, high sensitivity is more desirable so as to detect as many true cases as possible. Additionally, positive preditive value is 0.7277, indicating that when the model predicts that a person has diabetes, there is 73% chance the prediction is correct. negative predictive value is 0.7745, meaning that when the model predicts that the person has no diabetes, it is slightly more likely to be correctly classified. To conclude, the model has better peformance in detecting diabetes cases. And the level of accuracy and sensitivity can make it useful for preliminary screening, however, there are areas to improvemnet to further enhance the specificity and overall performacne.

References

International Diabetes Federation. (2025). Diabetes Global Report 2000 — 2050. Diabetes Atlas. Retrieved from https://diabetesatlas.org/data-by-location/global/

Final_Project

Aung Thura Htoo

2025-10-09