top of page

Create Your First Project

Start adding your projects to your portfolio. Click on "Manage Projects" to get started

Spring 2024 - Applied Statistics II for STEM Professions - Predicting Heart Disease

Project Type

Predicting Heart Disease Using Statistical and Machine Learning Models

Date

Spring 2024

❤️ Predicting Heart Disease Using Statistical and Machine Learning Models
Course: Applied Statistics II for STEM Professions (MAT-303)
Tools Used: R, Logistic Regression, Random Forest, ROC Curves, Confusion Matrices

Project Overview
In this project, I explored multiple statistical models to analyze a heart disease dataset containing 303 patient records and 14 health-related variables (e.g., age, cholesterol, chest pain type, blood pressure, max heart rate). The goal was to identify patterns, risk factors, and relationships among these indicators to predict the likelihood of heart disease and provide insights that could support early diagnosis and prevention efforts.

The models I built and compared included:

Two logistic regression models using different predictor combinations

A random forest classification model to assess heart disease presence

A random forest regression model to predict maximum heart rate achieved

Objectives & Methodology
Clean and prepare the data for analysis

Test multiple predictors for significance in logistic regression models

Evaluate model performance using confusion matrices, ROC/AUC values, and significance testing

Use random forest models to explore non-linear relationships and improve prediction accuracy

Key Insights:
The second logistic regression model, which included chest pain type and max heart rate, was more statistically significant than the first and achieved an AUC of 0.8389, indicating a strong fit.

The random forest classification model outperformed both logistic regressions, effectively capturing complex patterns with greater generalization, especially when tested with new data.

The random forest regression model was used to predict maximum heart rate — a critical indicator in assessing cardiac strain — with optimal tree tuning for accuracy.

What I Learned:
How to perform logistic regression and interpret coefficients, p-values, and odds ratios

The value of model evaluation tools like Hosmer-Lemeshow tests, Wald’s test, and ROC curves

How to build and fine-tune random forest models for classification and regression tasks

The importance of comparing multiple models to determine the most reliable for decision-making

Reflection:
This project marked my transition into more advanced statistical thinking. It deepened my appreciation for the role of data in health risk prediction and real-world decision-making. Most importantly, it taught me how to evaluate not just the accuracy of a model, but its relevance, strength, and practical utility in solving complex problems — skills I carry into every data project I take on.

bottom of page