Email became one of the most pervasive, powerful communication tools during the digital revolution. Attempts to defraud users by deceptively posing as legitimate people or institutions through email became so widespread that it got its own name: phishing.
Today, with the digital world deeply interwoven in our work and personal lives, phishing remains one of the top three initial infection vectors for ransomware incidents in 2021, and grows in sophistication and scale. The stakes are only increasing as the losses from phishing rise.
Most phishing cybersecurity defenses combine rules-based email filters and human training to detect a fraudulent email. When filters fail, there is a risk that a human will too, despite training to enhance the detection of a suspicious email.
It takes just one human error to cost an enterprise millions of dollars in losses and time-to-resolution. To reduce breaches, it is critical to eliminate phishing from entering any inbox in the first place.
Current rules-based systems are limited in their sight. They can only “see” issues that are known, and fraudsters are typically one step ahead of those systems. Filters to catch these issues can only improve after a breach and weakness have been identified, which is too late.
To get ahead of the phishing problem, machines must be able to anticipate weaknesses, rather than fall prey to them, and develop enhanced sentiment analysis to keep pace and pull ahead of the fraudsters.
Phishing detection with NVIDIA Morpheus
With NVIDIA Morpheus, our cybersecurity team applied natural language processing (NLP), a popular AI technique, to create a phishing detection application that correctly classified phishing emails at a 99%+ accuracy rate.
Using the Morpheus pipeline for phishing detection, you can use your own models to improve the accuracy further. As you fine-tune the model with new phishing emails that your company receives, the model continues to improve.
Because Morpheus enables large-scale unsupervised learning, you don’t have to rely on rules-based methods that require a URL or suspicious email address to detect phishing. Instead, Morpheus learns from the emails received, making it a more comprehensive, sustainable approach to managing phishing detection.
The cybersecurity team followed the first three steps of a typical AI workflow to develop the phishing detection proof of concept (POC):
- Data prep
- AI modeling
- Simulation and testing
They were able to execute rapidly by using pretrained models. We walk through each step to see how the team approached development.
To develop an AI model, it must be trained with preexisting relevant data. Normally, much of the development time centers on working with datasets to make it usable for a model-in-training to analyze.
In this case, the team sourced publicly available English-language phishing datasets that already existed and repurposed them to align with the POC needs, expediting the development process significantly.
The POC required a large dataset of emails that were benign and fraudulent for the phishing model to train from. The team started with the SPAM_ASSASSIN dataset, which has a preexisting mix of email data labeled phishing, hard ham, and easy ham. The ham classes are benign emails of various complexities. For our purposes, we simplified the classifications to benign and phishing, combining both hard ham and easy ham classified emails into a single benign category.
While the SPAM_Assassin dataset was a helpful starting point, the model required significantly more training data. The team incorporated the Enron Emails dataset as a benign data source and the phishing class of the Clair dataset as a phishing source. The model was trained and evaluated on various mixes of these datasets.
ML development centers on training and evaluating a model with data that eventually learns to perform the requested function on its own.
Instead of creating a new AI model from scratch, the team sourced a pretrained BERT model as the AI model to refine for the POC. BERT is an open-source, machine learning framework for NLP. BERT is designed to help computers understand the meaning of ambiguous language in text by using surrounding text to establish context.
The team fine-tuned this existing model for phishing detection by training and evaluating it with the earlier datasets.
Simulation and test
This is the stage where the model is tested, evaluated, and trained to perform for phishing detection purposes.
The SPAM_Assassin, Clair, and Enron datasets were all split into training and validation sets at random. Then, the BERT model was trained to classify emails from various mixes of the set as benign or phishing. When the refined BERT model was tested using a validation dataset that combined Enron, Clair, and SPAM_Assassin, it was again 99.68% accurate at interpreting the email in alignment with its classification.
Our tests showed over 99% accuracy of the trained BERT model in detecting phishing or benign emails when used on the validation datasets.
AI can play a significant role in addressing the cybersecurity issues organizations face every single day, but many organizations are intimidated by developing AI capability in their organization.
NVIDIA is democratizing AI by making it simple and efficient to develop for any enterprise across any use case. This POC was an example of how resources available in NVIDIA Morpheus can shorten and simplify AI-application development for enterprise developers looking to enhance their cybersecurity arsenal.
To accelerate your enterprise’s cybersecurity even more, use the pretrained phishing model available with NVIDIA Morpheus today. The NVIDIA Morpheus AI cybersecurity framework not only demonstrates the transformative capabilities of applying AI to address cybersecurity threats but also makes it easy for an organization to incorporate AI with development cycles like the one described earlier. With more data to train the model, it becomes even stronger.
Morpheus is an open AI framework for developers to implement cybersecurity-specific inference pipelines. Morpheus provides a simple interface for security developers and data scientists to create and deploy end-to-end pipelines that address cybersecurity, information security, and general log-based pipelines. This series is focused on highlighting the various use cases and implementations of Morpheus that can be relevant to any technical cybersecurity strategy.