Imbalanced Data NearMiss for Comparison of SVM and Naive Bayes Algorithms
Abstract
The study aims to improve the diagnosis, management, and prevention of HIV/AIDS by using classification algorithms. The dataset used consists of 707,379 records and 89 columns. Data preprocessing includes removing irrelevant attributes, handling inconsistencies, and balancing the data using the NearMiss method, resulting in a balanced proportion of reactive and non-reactive HIV cases. Once the data is balanced, it is split into several ratios: 60:40, 70:30, 80:20, and 90:10. The classification models used in this study are Naive Bayes and SVM. The models are evaluated using the metrics Accuracy, Precision, Recall, and F1-Score. The results show that the SVM model achieves the highest accuracy of 82.6% with a 90:10 data split at a 6-fold value, and 82.2% with a 60:40 data split at a 5-fold value. On the other hand, Naive Bayes achieves the highest accuracy of 61.1% with a 60:40 data split.