Trustworthy AI / Cybersecurity

Designing a New Net for Phishing Detection with NVIDIA Morpheus

May 16, 2022

By Gorkem Batmaz and Prachi Goel

Discuss (0)

AI-Generated Summary

Dislike

Phishing remains a significant cybersecurity threat, being one of the top three initial infection vectors for ransomware incidents in 2021, and continues to grow in sophistication and scale.
Traditional rules-based email filters and human training are limited in their ability to detect phishing emails, as they can only identify known issues and are often one step behind fraudsters.
NVIDIA Morpheus, an open AI framework, uses natural language processing (NLP) to detect phishing emails with a 99%+ accuracy rate, and allows for further improvement through fine-tuning with new phishing emails.

AI-generated content may summarize information incompletely. Verify important information. Learn more

Email became one of the most pervasive, powerful communication tools during the digital revolution. Attempts to defraud users by deceptively posing as legitimate people or institutions through email became so widespread that it got its own name: phishing.

Today, with the digital world deeply interwoven in our work and personal lives, phishing remains one of the top three initial infection vectors for ransomware incidents in 2021, and grows in sophistication and scale. The stakes are only increasing as the losses from phishing rise.

Phishing today

Most phishing cybersecurity defenses combine rules-based email filters and human training to detect a fraudulent email. When filters fail, there is a risk that a human will too, despite training to enhance the detection of a suspicious email.

It takes just one human error to cost an enterprise millions of dollars in losses and time-to-resolution. To reduce breaches, it is critical to eliminate phishing from entering any inbox in the first place.

Current rules-based systems are limited in their sight. They can only “see” issues that are known, and fraudsters are typically one step ahead of those systems. Filters to catch these issues can only improve after a breach and weakness have been identified, which is too late.

To get ahead of the phishing problem, machines must be able to anticipate weaknesses, rather than fall prey to them, and develop enhanced sentiment analysis to keep pace and pull ahead of the fraudsters.

Phishing detection with NVIDIA Morpheus

NVIDIA Morpheus, now generally available for download from NVIDIA NGC and the NVIDIA/Morpheus GitHub repo, is an open AI framework for implementing cybersecurity-specific inference pipelines.

With NVIDIA Morpheus, our cybersecurity team applied natural language processing (NLP), a popular AI technique, to create a phishing detection application that correctly classified phishing emails at a 99%+ accuracy rate.

Using the Morpheus pipeline for phishing detection, you can use your own models to improve the accuracy further. As you fine-tune the model with new phishing emails that your company receives, the model continues to improve.

Because Morpheus enables large-scale unsupervised learning, you don’t have to rely on rules-based methods that require a URL or suspicious email address to detect phishing. Instead, Morpheus learns from the emails received, making it a more comprehensive, sustainable approach to managing phishing detection.

Approach

The cybersecurity team followed the first three steps of a typical AI workflow to develop the phishing detection proof of concept (POC):

Data prep
AI modeling
Simulation and testing

They were able to execute rapidly by using pretrained models. We walk through each step to see how the team approached development.

Data prep

To develop an AI model, it must be trained with preexisting relevant data. Normally, much of the development time centers on working with datasets to make it usable for a model-in-training to analyze.

In this case, the team sourced publicly available English-language phishing datasets that already existed and repurposed them to align with the POC needs, expediting the development process significantly.

The POC required a large dataset of emails that were benign and fraudulent for the phishing model to train from. The team started with the SPAM_ASSASSIN dataset, which has a preexisting mix of email data labeled phishing, hard ham, and easy ham. The ham classes are benign emails of various complexities. For our purposes, we simplified the classifications to benign and phishing, combining both hard ham and easy ham classified emails into a single benign category.

While the SPAM_Assassin dataset was a helpful starting point, the model required significantly more training data. The team incorporated the Enron Emails dataset as a benign data source and the phishing class of the Clair dataset as a phishing source. The model was trained and evaluated on various mixes of these datasets.

ML modeling

ML development centers on training and evaluating a model with data that eventually learns to perform the requested function on its own.

Instead of creating a new AI model from scratch, the team sourced a pretrained BERT model as the AI model to refine for the POC. BERT is an open-source, machine learning framework for NLP. BERT is designed to help computers understand the meaning of ambiguous language in text by using surrounding text to establish context.

The team fine-tuned this existing model for phishing detection by training and evaluating it with the earlier datasets.

Simulation and test

This is the stage where the model is tested, evaluated, and trained to perform for phishing detection purposes.

The SPAM_Assassin, Clair, and Enron datasets were all split into training and validation sets at random. Then, the BERT model was trained to classify emails from various mixes of the set as benign or phishing. When the refined BERT model was tested using a validation dataset that combined Enron, Clair, and SPAM_Assassin, it was again 99.68% accurate at interpreting the email in alignment with its classification.

Our tests showed over 99% accuracy of the trained BERT model in detecting phishing or benign emails when used on the validation datasets.

Summary

AI can play a significant role in addressing the cybersecurity issues organizations face every single day, but many organizations are intimidated by developing AI capability in their organization.

NVIDIA is democratizing AI by making it simple and efficient to develop for any enterprise across any use case. This POC was an example of how resources available in NVIDIA Morpheus can shorten and simplify AI-application development for enterprise developers looking to enhance their cybersecurity arsenal.

To accelerate your enterprise’s cybersecurity even more, use the pretrained phishing model available with NVIDIA Morpheus today. The NVIDIA Morpheus AI cybersecurity framework not only demonstrates the transformative capabilities of applying AI to address cybersecurity threats but also makes it easy for an organization to incorporate AI with development cycles like the one described earlier. With more data to train the model, it becomes even stronger.

To start developing today, learn more about NVIDIA Morpheus or access it through the NGC NVIDIA Morpheus download or NVIDIA/Morpheus GitHub repo.

Morpheus is an open AI framework for developers to implement cybersecurity-specific inference pipelines. Morpheus provides a simple interface for security developers and data scientists to create and deploy end-to-end pipelines that address cybersecurity, information security, and general log-based pipelines. This series is focused on highlighting the various use cases and implementations of Morpheus that can be relevant to any technical cybersecurity strategy.

Discuss (0)

About the Authors

About Gorkem Batmaz
Gorkem Batmaz is a senior data scientist on the NVIDIA Morpheus team at NVIDIA. His focus is on applying GPU-accelerated high-performance analytics to solve cybersecurity challenges. He developed ML/NLP-based predictive maintenance, phishing-DGA-Malware detection, asset classification, periodicity detection, and Generative AI for cyber solutions. Before joining the Morpheus team, Gorkem worked in the cybersecurity team for NVIDIA's autonomous cars division. Before joining NVIDIA in 2011, he worked at Motorola and Alcatel-Lucent. He holds an M.S. in Engineering & Technology Management and a B.S. in Electrical Electronics Engineering from Bogazici University.

View all posts by Gorkem Batmaz

About Prachi Goel
Prachi Goel is a product marketing manager supporting accelerated data science at NVIDIA. Before joining NVIDIA, she was a product manager for the Smart Cities group at Cisco Systems. She holds her MBA from the Anderson School of Management at the University of California, Los Angeles.

View all posts by Prachi Goel