Data Science

cyBERT: Neural network, that’s the tech; To free your staff from, bad regex

Cybersecurity logs are generated across an organization and cover endpoints (e.g, computers, laptops, servers), network communications, and perimeter devices (e.g., VPN nodes, firewalls). Using a conservative estimate for a company of 1000 employee devices, a small organization can expect to generate over 100 GB/day in log traffic with a peak EPS (Event Per Second) of over 22,000. 

Today, organizations collect, store, and (attempt to) analyze more data than ever before. Logs are heterogeneous in source, format, and time. In order to analyze the data, it first needs to be parsed. Actionable fields must be extracted from raw logs. Today, logs are parsed using complex heuristics and regular expressions. These heuristics are inflexible and prone to failure if a log deviates at all from its specific format.

This blog shows how interpreting cybersecurity logs as a natural language has the potential to render traditional, regex-based parsing mechanisms obsolete and introduce flexibility and resilience at a new level to typical log parsing architectures.

Parsing logs efficiently and correctly is critical to any security operations center, and cyBERT allows users to accomplish this without the need to develop extensive regex libraries. Further, as we increase the speed of pre- and post-processing with cyBERT, the ability to replay archived logs through new parsers will be possible, allowing security analysts the ability to quickly extract new information from older logs as needed.

For more details, results and next steps, read the RAPIDS blog here.

Related resources

Discuss (0)