A Dataset-Driven Comparison of Traditional and Advanced Machine Learning Techniques for Phishing Detection in Low-Variance Environments

Authors

  • Konaz Kawa Latif Koya University.
  • Saman Mirza Abdullah Koya University

DOI:

https://doi.org/10.25195/ijci.v51i2.632

Keywords:

Phishing Detection, Machine Learning, Ensemble Learning, Low-Variance Datasets, Obfuscated URLs

Abstract

Phishing attacks continue to grow rapidly, often using obfuscated and deceptive URL patterns to mimic legitimate websites and evade detection. While traditional Machine Learning (ML) models perform well on benchmark datasets, they often struggle in real-world scenarios where phishing URLs are carefully crafted to resemble authentic domains. This study presents a dataset-driven comparison between traditional ML models—Logistic Regression, K-Nearest Neighbors, Support Vector Machine, and Random Forest—and advanced approaches such as Extreme Gradient Boosting (XGBoost), a tuned XGBoost variant, and a soft voting ensemble. Two datasets were used: (i) a high-variance global dataset from Mendeley Data, and (ii) a custom-built local dataset with region-specific phishing URLs designed with minimal alterations (e.g., character substitutions and deceptive subdomains) to simulate low-variance attacks. Preprocessing included feature engineering, variance analysis, and balancing with the Synthetic Minority Oversampling Technique (SMOTE). Experimental results show that Random Forest outperforms other traditional models but still struggles with low-variance phishing URLs. In contrast, advanced models—particularly tuned XGBoost—achieved significantly higher recall (0.99) and strong precision (0.81), while the voting ensemble further improved robustness by combining multiple classifiers. These findings emphasize the importance of realistic datasets and demonstrate that advanced ML strategies are more effective for detecting phishing attempts based on subtle obfuscation. This work contributes by (i) validating advanced ML models under realistic low-variance conditions, and (ii) highlighting precision and recall as more appropriate evaluation metrics than accuracy in cybersecurity.

Downloads

Download data is not yet available.

Author Biographies

Konaz Kawa Latif, Koya University.

Department of Software Engineering

Saman Mirza Abdullah, Koya University

Department of Software Engineering

Downloads

Published

2025-11-13