TITLE:
Machine Learning Classification of Prostate Cancer Genomic Sequences Using K-Mer and Sequence-Derived Features
AUTHORS:
Kuldeep Rawat, Hirendra Nath Banerjee, Jamie Noble, Saa Naudia Deloatch, Satyendra Banerjee, Sachin Shetty, Soumya Banerjee
KEYWORDS:
Prostate Cancer, DNA Sequence Classification, K-Mer Analysis, Sequence-Derived, Random Forest Algorithm, SMOTE, Machine Learning, Health Disparities
JOURNAL NAME:
Computational Molecular Bioscience,
Vol.16 No.2,
June
24,
2026
ABSTRACT: Prostate cancer disproportionately impacts African American men, who experience significantly higher mortality rates and earlier disease onset than other populations. Current diagnostic approaches, including prostate-specific antigen testing and biopsy, lack sufficient specificity and sensitivity, underscoring the need for accurate, molecular-level classification tools. This paper presents a machine learning framework for binary classification of genomic DNA sequences as cancerous or healthy. A dataset of 1684 FASTA-formatted sequences obtained from the National Library of Medicine - GenBank was analyzed, with 1662 sequences retained after quality control filtering. Feature engineering yielded 67 attributes, including GC content, Shannon entropy, sequence length, and trinucleotide k-mer frequencies. To address class imbalance, we applied the Synthetic Minority Over-sampling Technique to the training data. Seven classification algorithms were evaluated using stratified train–test splits, cross-validation, and hyperparameter optimization. Among the models, the optimized Random Forest classifier achieved superior performance, with a cross-validation accuracy of 97.2% (±0.006), a weighted F1-score of 0.95, a cancer-class recall of 0.96, and an ROC-AUC of 0.974. Feature importance analysis identified sequence length and Shannon entropy as the most discriminative predictors, followed by specific trinucleotide motifs (TTC, AAC, ACC, and GGG). These results demonstrate the potential of interpretable machine learning approaches for genomic sequence-based PCa classification, offering a promising pathway toward improved, equitable diagnostic tools for high-risk populations.