Rebbeck, T.R. (2018) Prostate Cancer Disparities by Race and Ethnicity From Nucleotide to Neighborhood. Cold Spring Harbor Perspectives in Medicine, 8, a030387. - References

Article citationsMore>>

Rebbeck, T.R. (2018) Prostate Cancer Disparities by Race and Ethnicity: From Nucleotide to Neighborhood. Cold Spring Harbor Perspectives in Medicine, 8, a030387.
https://doi.org/10.1101/cshperspect.a030387

has been cited by the following article:

TITLE: Machine Learning Classification of Prostate Cancer Genomic Sequences Using K-Mer and Sequence-Derived Features

AUTHORS: Kuldeep Rawat, Hirendra Nath Banerjee, Jamie Noble, Saa Naudia Deloatch, Satyendra Banerjee, Sachin Shetty, Soumya Banerjee

KEYWORDS: Prostate Cancer, DNA Sequence Classification, K-Mer Analysis, Sequence-Derived, Random Forest Algorithm, SMOTE, Machine Learning, Health Disparities

JOURNAL NAME: Computational Molecular Bioscience, Vol.16 No.2, June 24, 2026

ABSTRACT: Prostate cancer disproportionately impacts African American men, who experience significantly higher mortality rates and earlier disease onset than other populations. Current diagnostic approaches, including prostate-specific antigen testing and biopsy, lack sufficient specificity and sensitivity, underscoring the need for accurate, molecular-level classification tools. This paper presents a machine learning framework for binary classification of genomic DNA sequences as cancerous or healthy. A dataset of 1684 FASTA-formatted sequences obtained from the National Library of Medicine - GenBank was analyzed, with 1662 sequences retained after quality control filtering. Feature engineering yielded 67 attributes, including GC content, Shannon entropy, sequence length, and trinucleotide k-mer frequencies. To address class imbalance, we applied the Synthetic Minority Over-sampling Technique to the training data. Seven classification algorithms were evaluated using stratified train–test splits, cross-validation, and hyperparameter optimization. Among the models, the optimized Random Forest classifier achieved superior performance, with a cross-validation accuracy of 97.2% (±0.006), a weighted F1-score of 0.95, a cancer-class recall of 0.96, and an ROC-AUC of 0.974. Feature importance analysis identified sequence length and Shannon entropy as the most discriminative predictors, followed by specific trinucleotide motifs (TTC, AAC, ACC, and GGG). These results demonstrate the potential of interpretable machine learning approaches for genomic sequence-based PCa classification, offering a promising pathway toward improved, equitable diagnostic tools for high-risk populations.

	[email protected]
	+86 18163351462(WhatsApp)
	1655362766

	Paper Publishing WeChat

Journals by Subject

Publish with us

Article citationsMore>>

Home

About SCIRP

Service

Policies