Machine learning project using K-Nearest Neighbors algorithm for early breast cancer detection and classification
A machine learning project focused on early breast cancer detection using the K-Nearest Neighbors (KNN) algorithm. This project addresses the critical need for early cancer diagnosis, which significantly improves survival rates and treatment outcomes.
What is breast cancer? Breast cancer is one of the most common types of cancer diagnosed among women worldwide. It occurs when abnormal cells in the breast begin to grow and divide in an uncontrolled way and eventually form a growth (tumour).
Importance of Early Detection: Detecting breast cancer early improves survival, lowers morbidity and reduces the cost of care, if patients can be promptly diagnosed and effectively treated.
The project uses the Wisconsin Breast Cancer Dataset, which contains various features that describe the characteristics of cell nuclei present in breast cancer biopsies.
KNN is a non-parametric classification algorithm. The basic idea is that it classifies a data point by majority voting among its k nearest neighbors. The distance metric typically used in KNN is the Euclidean distance.
The performance of KNN depends on the choice of k (number of neighbors). Typically, you choose an odd number to avoid ties in voting. We used cross-validation to determine the optimal value for k.
The model was implemented in R using comprehensive data preprocessing and hyperparameter tuning:
# Loading necessary libraries
library(class)
library(dplyr)
library(caret)
# Data preprocessing
data <- data %>% select(-X)
data <- data[, -1] # Remove ID column
# Encoding target variable
data$diagnosis <- factor(data$diagnosis, levels = c("B", "M"),
labels = c("Benign", "Malignant"))
# Hyperparameter tuning to find optimal k
k_values <- seq(1, 50, by = 2)
best_k <- 7 # Determined through cross-validation
# Training final KNN model
knn_predictions <- knn(train = train_features, test = test_features,
cl = train_labels, k = best_k)
| Predicted Benign | Predicted Malignant | |
|---|---|---|
| Actual Benign | 71 | 2 |
| Actual Malignant | 3 | 38 |
The model achieved excellent performance metrics, demonstrating high accuracy in distinguishing between benign and malignant breast cancer cases.