Magika, Google’s Ultra-Fast File Detector

The latest innovation in the field of AI is the detection of file types. Oh yes! Google recently made the sources of Magika, an AI-based file type identification system, publicly available. This system aims to help us—well, our tools—accurately detect both binary and textual file types.

For a long time, Linux systems have been equipped with libmagic and the file utility, which have served as the de facto standard for identifying file types for over 50 years!

Web browsers, code editors, and countless other software rely on file type detection to decide how to properly display a file. For example, modern IDEs use file type detection to choose which syntax highlighting scheme to use when the developer begins typing in a new file.

Accurate detection of file types is a difficult problem, as each file format has a different structure, or no structure at all. It’s even harder for text formats and programming languages because they have very similar constructs. Until now, libmagic and most other file type identification software have relied on a rather homemade collection of heuristics and custom rules to detect each file format.

Because this manual approach is both time-consuming and error-prone, it may not be ideal, especially for security applications where creating reliable detection is particularly difficult, as attackers are constantly trying to cheat detection with homemade payloads.

To solve this problem and provide fast and accurate detection of file types, Google has developed Magika, a new AI-based file type detector.

Under the hood, Magika uses a customized, highly optimized deep learning model designed and trained using Keras that weighs only about 1 MB. Magika also uses ONNX as its inference engine to ensure that files are identified in milliseconds, almost as quickly as a non-AI-based tool, even on a CPU.

In terms of performance, Magika, thanks to its AI model and large training dataset, is able to outperform other existing tools by about 20% when evaluated on a benchmark of 1 million files encompassing more than 100 file types. When breaking it down by file type, as shown in the table below, there are even greater performance gains on text files, including code files and configuration files that other tools may struggle with.

Magika is used internally at Google on a large scale to help improve the security of users of their services, including routing files within Gmail, Drive, or Safe Browsing to security scanners and content filters.

Looking at a weekly average of hundreds of billions of files, Google found that Magika improved the accuracy of identifying file types by 50% compared to their previous system based on simple rules. This increase in accuracy allowed them to scan 11% more files with their specialized AI malicious document scanners, and they were able to reduce the number of unidentified files to 3%.

Magika’s next integration will be in VirusTotal and will complement the platform’s existing Code Insight feature, which uses Google’s generative AI to analyze and detect malicious code. Magika will then act as a pre-filter before the files are analyzed by Code Insight, improving the efficiency and accuracy of the platform. This integration, due to the collaborative nature of VirusTotal, contributes directly to the cyber ecosystem, and that’s pretty good news for everyone.

By making Magika’s sources available, Google aims to help other software companies improve their file identification accuracy and provide researchers with a reliable method to identify file types at a very large scale. Magika’s code and model are available for free on GitHub under the Apache 2 license.

If you’re interested, you can try Magika’s web demo.

Magika can also be quickly installed as a standalone utility and python library via the PyPI package manager by simply typing:

pip install magika

And no need for a GPU!

To learn more about how to use it, I invite you to visit the Magika documentation.

"Because of the Google update, I, like many other blogs, lost a lot of traffic."

Join the Newsletter

Please, subscribe to get our latest content by email.

Mohamed SAKHRI
Mohamed SAKHRI

I'm the creator and editor-in-chief of Tech To Geek. Through this little blog, I share with you my passion for technology. I specialize in various operating systems such as Windows, Linux, macOS, and Android, focusing on providing practical and valuable guides.

Articles: 1646

Newsletter Updates

Enter your email address below and subscribe to our newsletter

Leave a Reply

Your email address will not be published. Required fields are marked *