A Dataset-Driven Comparison of Traditional and Advanced Machine Learning Techniques for Phishing Detection in Low-Variance Environments

Konaz Kawa Latif; Saman Mirza Abdullah

doi:10.25195/ijci.v51i2.632

A Dataset-Driven Comparison of Traditional and Advanced Machine Learning Techniques for Phishing Detection in Low-Variance Environments

Authors

Konaz Kawa Latif Koya University.
Saman Mirza Abdullah Koya University

DOI:

https://doi.org/10.25195/ijci.v51i2.632

Keywords:

Phishing Detection, Machine Learning, Ensemble Learning, Low-Variance Datasets, Obfuscated URLs

Abstract

Phishing attacks continue to grow rapidly, often using obfuscated and deceptive URL patterns to mimic legitimate websites and evade detection. While traditional Machine Learning (ML) models perform well on benchmark datasets, they often struggle in real-world scenarios where phishing URLs are carefully crafted to resemble authentic domains. This study presents a dataset-driven comparison between traditional ML models—Logistic Regression, K-Nearest Neighbors, Support Vector Machine, and Random Forest—and advanced approaches such as Extreme Gradient Boosting (XGBoost), a tuned XGBoost variant, and a soft voting ensemble. Two datasets were used: (i) a high-variance global dataset from Mendeley Data, and (ii) a custom-built local dataset with region-specific phishing URLs designed with minimal alterations (e.g., character substitutions and deceptive subdomains) to simulate low-variance attacks. Preprocessing included feature engineering, variance analysis, and balancing with the Synthetic Minority Oversampling Technique (SMOTE). Experimental results show that Random Forest outperforms other traditional models but still struggles with low-variance phishing URLs. In contrast, advanced models—particularly tuned XGBoost—achieved significantly higher recall (0.99) and strong precision (0.81), while the voting ensemble further improved robustness by combining multiple classifiers. These findings emphasize the importance of realistic datasets and demonstrate that advanced ML strategies are more effective for detecting phishing attempts based on subtle obfuscation. This work contributes by (i) validating advanced ML models under realistic low-variance conditions, and (ii) highlighting precision and recall as more appropriate evaluation metrics than accuracy in cybersecurity.

Downloads

Download data is not yet available.

Author Biographies

Konaz Kawa Latif, Koya University.

Department of Software Engineering

Saman Mirza Abdullah, Koya University

Department of Software Engineering

Downloads

Published

2025-11-13

Issue

Vol. 51 No. 2 (2025): Volume 51 Issue 2 Year 2025

Section

Articles

License

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

IJCI applies the Creative Commons Attribution (CC BY) license to articles. The author of the submitted paper for publication by IJCI has the CC BY license. Under this Open Access license, the author gives an agreement to any author to reuse the article in whole or part for any purpose, even for commercial purposes. Anyone may copy, distribute, or reuse the content as long as the author and source are properly cited. This facility helps in re-use and ensures that journal content is available for the needs of research.
If the manuscript contains photos, images, figures, tables, audio files, videos, etc., that the author or the co-authors do not own, IJCI will require the author to provide the journal with proof that the owner of that content has given the author written permission to use it, and the owner has approved that the CC BY license being applied to content. IJCI provides a form that the author can use to ask for permission from the owner. If the author does not have owner permission, IJCI will ask the author to remove that content and/or replace it with other content that the author owns or has such permission to use.
Many authors assume that if they previously published a paper through another publisher, they have the right to reuse that content in their PLOS paper, but that is not necessarily the case – it depends on the license that covers the other paper. The author must ascertain the rights he/she has of a specific license (a license that enables the author to use the content). The author must obtain written permission from the publisher to use the content in the IJCI paper. The author should not include any content in her/his IJCI paper without having the right to use it, and always give proper attribution.
The accompanying submitted data should be stated with licensing policies, the policies should not be more restrictive than CC BY.
IJCI has the right to remove photos, captures, images, figures, tables, illustrations, audio, and video files, from a paper before or after publication, if these contents were included in the author's paper without permission from the owner of the content.

A Dataset-Driven Comparison of Traditional and Advanced Machine Learning Techniques for Phishing Detection in Low-Variance Environments

Authors

DOI:

Keywords:

Abstract

Downloads

Author Biographies

Konaz Kawa Latif, Koya University.

Saman Mirza Abdullah, Koya University

Downloads

Published

Issue

Section

License

Issn Journal

Current Issue

Information