Adversarial Malware Detection: Lessons Learned from PDF-based Attacks

Background and Problem Definition

The widespread use of PDF documents is often a pervasive channel for malware distribution. This is accomplished by embedding malware and malicious code within PDFs, as PDFs can contain static elements (i.e., images and text), dynamic elements (i.e., JavaScript, forms) and embedded signatures.

In this research, Maiorca et al, investigates the threat that PDF documents pose as a vector for malware infections. It differs from other research that focuses on just detection methods, and instead, looks at the perspective of a bad actor utilizing adversarial malware detection methods in developing malware. In other words, it investigates how adversaries can use the very same advanced detection methods that malware detectors deploy and analyses how adversarial malware detection techniques can be used to develop advanced PDF-based malware to evade common detection methods (including machine learning-based systems).

Summary of Approach and Proposed Methodology

The researchers begin with a comprehensive review of different approaches used to develop PDF malware. They then reviewed different advanced techniques for malware detection, including the use of machine-learning based systems. Through those findings, they showed how machine-learning detection can be evaded with different adversarial attacks. Lastly, they proposed possible solutions to mitigate such attacks.

I found this research approach very useful and novel in its claim that systems based on machine learning for malware detection should be built by accounting for the presence of attacks tailored against them. It simply propagates the importance and notion of security by design, which requires malware detection systems to anticipate attackers in using such detection evasion methods.

As an overview, PDFs contain a general structure consisting of a header, body, X-Ref table, trailer and objects. The malicious code present in PDFs then exploits Adobe Reader (most common PDF reader) and its components and plugins in parsing PDF files. There are at least 27 major and known CVE vulnerabilities identified between 2008 and 2018. The vulnerability types include the use API and buffer overflows, memory corruption, malformed data and type confusion. These vulnerabilities are exploited through 3 primary exploitation channels: JavaScript, ActionScript and file embedding (TIFF and EXE).

To counter and detect PDF malware, the research team discovered that machine learning based malware detection systems can and are commonly used to detect malware. Such systems do so in a 3-step approach of pre-processing, feature extraction, and classifier/training; before identifying the file as legitimate or malicious. The 3rd step (classifier/training) uses various machine-learning concepts – Markov, decision tress, Bayesian and Random Forest – to accomplish classification. However, the research also opined that all existing machine learning-based PDF malware detectors cannot completely detect embedded malicious code due to computational overheads and practicality. Some tradeoffs and partial completeness are required.

Next, the research then analyzes 3 possible adversarial attacks against PDF malware detectors – evasion, poisoning and privacy attacks. Evasion attacks aim to disrupt integrity checks by manipulating the input PDF file structure at time of testing. It suggests that malware writers may inject malicious code at various sections of the PDF structure. Poisoning and privacy attacks aim to reduce malware detection capabilities by injecting mis-labeled samples or steal information in the classifier training set.

After identifying attacks that takes advantage of machine learning concepts, the research proposes several countermeasures for more adaptable malware detection. These methods include the use of robust detection algorithms (i.e., classifiers that can retrain to correctly detect manipulated samples) that account for intentional modifications to the attack vectors.

Strengths and Weaknesses of the Methodology

In reviewing this research article, the approach and methodology used is both comprehensive and sufficiently exhaustive. It investigates the issue of PDF malwares by first, providing an overview of its structure before reviewing how current malware detection methods are deployed in detecting malware in PDFs. It further considers how adversaries, who are likely conversant in malware detection methods (including machine learning-based methods), would leverage that to neutralize such detection. The research then explores and proposes how malware detection developers may develop countermeasures to account for such adversarial malware.

However, the research appears limited to theoretical considerations. The proposed countermeasures have not been developed and tested in a real-world setting where such a scenario may deem the proposed solution as impractical, i.e., expensive computational overheads and resource requirements during detection runs. Nevertheless, the concept and the impression on the principal “security by design”, is critical in the constantly evolving cat-and-mouse malware attack and detection game.

MAIORCA, D., BIGGIO, B. and GIACINTO, G. (2019) ‘Towards Adversarial Malware Detection: Lessons Learned from PDF-based Attacks’, ACM Computing Surveys, 52(4), pp. 1–36. doi:10.1145/3332184.