View Code

Breast Cancer Detection with KNN

Machine learning project using K-Nearest Neighbors algorithm for early breast cancer detection and classification

Timeline 1 week
Status Completed
KNN Algorithm Performance Visualization
R Machine Learning KNN Algorithm Data Mining Healthcare Analytics Classification

Project Overview

A machine learning project focused on early breast cancer detection using the K-Nearest Neighbors (KNN) algorithm. This project addresses the critical need for early cancer diagnosis, which significantly improves survival rates and treatment outcomes.

Medical Impact: Early detection of breast cancer improves survival rates from less than 30% at late stages to almost 100% at early stages, making this project potentially life-saving.

Introduction

What is breast cancer? Breast cancer is one of the most common types of cancer diagnosed among women worldwide. It occurs when abnormal cells in the breast begin to grow and divide in an uncontrolled way and eventually form a growth (tumour).

Importance of Early Detection: Detecting breast cancer early improves survival, lowers morbidity and reduces the cost of care, if patients can be promptly diagnosed and effectively treated.

KNN Algorithm: The K-Nearest Neighbors (KNN) algorithm is a simple, yet powerful, classification technique used in machine learning. It works by classifying data points based on the majority class of their nearest neighbors.

Dataset

The project uses the Wisconsin Breast Cancer Dataset, which contains various features that describe the characteristics of cell nuclei present in breast cancer biopsies.

Key Features:

  • mean_radius: The mean of distances from the center to points on the perimeter
  • mean_texture: The mean of the standard deviation of gray-scale values
  • mean_smoothness: The mean of local variation in the cell texture
  • Several other statistical features based on cell shape and texture

Methodology

K-Nearest Neighbors Algorithm

KNN is a non-parametric classification algorithm. The basic idea is that it classifies a data point by majority voting among its k nearest neighbors. The distance metric typically used in KNN is the Euclidean distance.

Choosing the K Value

The performance of KNN depends on the choice of k (number of neighbors). Typically, you choose an odd number to avoid ties in voting. We used cross-validation to determine the optimal value for k.

Data Preprocessing

  • Removed empty columns and ID column
  • Encoded target variable as factor (Benign/Malignant)
  • Split data into training (80%) and testing (20%) sets
  • Normalized features using center and scale methods

Technical Implementation

The model was implemented in R using comprehensive data preprocessing and hyperparameter tuning:

# Loading necessary libraries
library(class)
library(dplyr)
library(caret)

# Data preprocessing
data <- data %>% select(-X)
data <- data[, -1] # Remove ID column

# Encoding target variable
data$diagnosis <- factor(data$diagnosis, levels = c("B", "M"), 
                        labels = c("Benign", "Malignant"))

# Hyperparameter tuning to find optimal k
k_values <- seq(1, 50, by = 2)
best_k <- 7  # Determined through cross-validation

# Training final KNN model
knn_predictions <- knn(train = train_features, test = test_features, 
                      cl = train_labels, k = best_k)
Overfitting & Underfitting Analysis: We demonstrated that small k values lead to overfitting while large k values cause underfitting. Through hyperparameter tuning, we found k=7 to be optimal.

Results & Performance

Confusion Matrix

Predicted Benign Predicted Malignant
Actual Benign 71 2
Actual Malignant 3 38
95.6%
Accuracy
97.3%
Sensitivity
92.7%
Specificity

The model achieved excellent performance metrics, demonstrating high accuracy in distinguishing between benign and malignant breast cancer cases.