Automatic Detection of Phishing Attacks in Email using Large Language Models (LLM)
Keywords:
CRISP-DM, Large Language Models, Machine Learning, Natural Language Processing, Phishing.Abstract
This research aimed to train a model based on a Large Language Model (LLM) for the detection of phishing attacks through the analysis of email content. The Transformer DistilBERT model was employed following the CRISP-DM methodology, which ensured a structured life cycle for training. The procedure included preprocessing the ealvaradob/phishing-dataset, tokenization, and division into training, validation, and test subsets. The model was trained in two phases: fine-tuning on specialized data and rigorous validation using standardized metrics (accuracy, precision, recall, F1-score). Results in the training phase exceeded 95% across all metrics. In the final validation with an independent dataset (zefang-liu/phishing-email-dataset), an average above 98% was achieved, demonstrating high effectiveness and a minimal margin of error. It is concluded that the model meets the functional requirements for deployment in production, providing solid evidence for the use of Natural Language Processing (NLP) in cybersecurity applications.
References
Alanezi, M. (2021). Phishing detection methods: A review. Technium: Romanian Journal of Applied Sciences and Technology, 3(9), 19–35. https://doi.org/10.47577/technium.v3i9.4973
Anti-Phishing Working Group. (2024). Phishing activity trends report 4to quarter 2024.
Cherian, T. V., Paulraj, G. J. L., Princess, J. B., & Jebadurai, I. J. (2024). A comparative analysis of machine learning and deep learning techniques for aspect-based sentiment analysis. En D. J. Hemanth (Ed.), Computational intelligence methods for sentiment analysis in natural language processing applications (pp. 23–37). Morgan Kaufmann. https://doi.org/10.1016/B978-0-443-22009-8.00006-9
CRISP-DM. (2025, 15 de enero). La metodología CRISP-DM: Desarrollo de modelos de machine learning. MyTaskPanel Consulting. https://www.mytaskpanel.com/la-metodologia-crisp-dm-desarrollo-de-modelos-de-machine-learning/
Freed, N., & Borenstein, N. S. (1996). Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies (Request for Comments RFC 2045). Internet Engineering Task Force. https://doi.org/10.17487/RFC2045
Gomes, V., Reis, J., & Alturas, B. (2020). Ingeniería social y los peligros del phishing. Actas del Congreso Ibérico de Sistemas y Tecnologías de la Información (CISTI), 1–6. https://rclimatol.eu/wp-content/uploads/2023/07/Articulo-CS23-Yolanda-maribel.pdf
González-Hugo, M. P., & Quevedo-Sacoto, A. S. (2025). Tendencias actuales en ataques de ingeniería social: Revisión de literatura. MQRInvestigar, 9(1), Article e203. https://doi.org/10.56048/MQR20225.9.1.2025.e203
Kamsetty, A. (2020, 6 de octubre). Hyperparameter optimization for transformers: A guide. Distributed Computing with Ray. https://medium.com/distributed-computing-with-ray/hyperparameter-optimization-for-transformers-a-guide-c4e32c6c989b
Mitnick, K. D., & Simon, W. L. (2002). The art of deception: Controlling the human element of security. Wiley.
Resnick, P. (2008). Internet message format (Request for Comments RFC 5322). Internet Engineering Task Force. https://doi.org/10.17487/RFC5322
Salloum, S., Gaber, T., Vadera, S., & Shaalan, K. (2021). Phishing email detection using natural language processing techniques: A literature survey. Procedia Computer Science, 189, 19–28. https://doi.org/10.1016/j.procs.2021.05.077
Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2020). DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter (arXiv:1910.01108). arXiv. https://doi.org/10.48550/arXiv.1910.01108
Verma, R., Shashidhar, N., & Hossain, N. (2012). Detecting phishing emails the natural language way. En S. Foresti, M. Yung, & F. Martinelli (Eds.), Computer security – ESORICS 2012 (pp. 824–841). Springer. https://doi.org/10.1007/978-3-642-33167-1_47
Published
Issue
Section
License
Copyright (c) 2026 José Ernesto Rodríguez Del Toro, Antonio Hernández Domínguez

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
CC Reconocimiento-NoComercial-SinObrasDerivadas 4.0

.jpg)










