Utilizing Machine Learning Techniques to Predict Cardiovascular Diseases and Comparing the Outcomes for Better Accuracy
Abstract
With the advancements in technology, several features are now available for diagnosing heart diseases. However, large data sets have some limitations such as limited storage capacity and long access and processing times. Early diagnosis of heart problems is vital for medical treatment. Heart disease is a severe illness that is on the rise in both developed and developing countries, leading to fatalities. This disease causes the heart to not provide enough blood to various parts of the body, hindering its normal functions. Diagnosing this condition early and accurately is crucial to prevent further harm and potentially save lives. Diagnosis for various forms of heart disease can be detected with numerous medical tests, however, predicting heart disease without such tests is very difficult. Many researchers analyzed the risk factors of this disease and proposed machine learning models for the early detection of heart patients. However, these models suffer from the high dimensionality of data and need to be improved to obtain highly accurate results. The proposal was tested using five different standard data sets from the UCI repository. Our proposal consists of two main processes: the first is the data preprocessing process, and the second is the prediction process. In data preprocessing, the data is prepared for the prediction process, and three different feature selection methods (e.g., PCA) are applied to select the most relevant features from the data. In the prediction process, ten different prediction techniques (for example, Random Forest (RF) and Support Vector Classifier (SVC)) were applied to over-employed datasets. The techniques used were evaluated using four evaluation metrics: accuracy, precision, recall, and F1-score. For this research, we collected the dataset from the UCI repository (Kaggle) and used Random Forest Classification algorithm for predicting heart disease. The predictive model achieved an accuracy of 89.4 percent using Random Forest Classifier’s default setting to predict heart diseases. Furthermore, the research focuses on the opportunity for training and testing using our model with a larger dataset and modifying different hyper parameters for further improvement. The results show that the LASSO method as a feature selection method with RF as a prediction technique produced the best accuracy (100%). Accuracy (99.57%) was obtained for Decision Tree (DT), Gradient Boosting (GB), AdaBoost (AB), Decision Tree Bagging Method (DTBM), Random Forest Bagging Method (RFBM), K-Nearest Neighbors Bagging Method (KNNBM), AdaBoost Boosting Method (ABBM), and Gradient Boosting Boosting Method (GBBM). The accuracy of SVC, Logistic Regression (LR), Naïve Bayes (NB), and Support Vector Classifier Bagging Method (SVCBM) was very similar to each other (98.73%).