Automatic Analysis of Malware Behavior using Machine Learning

Monday, December 28. 2009
In the last couple of years, several honeypot solutions to automatically "collect" malware samples were developed. With these tools, it is possible to obtain copies of malware samples without any human interaction. As a result, we are able to collect quite a few malware samples per day, which then also need to be analyzed. Thus, several sandbox solutions were developed that automate the analysis step by performing dynamic, behavior-based analysis. The result of the dynamic analysis is typically a report that summarizes the observed behavior. The next logical step is to use that information to perform malware classification and malware clustering: at the end of that process, we can then obtain information about which samples perform basically the same kind of activity. We can then automatically find variants of well-known threats, identify new malware families, and reduce the manual effort needed to analyze the large number of incoming malware samples.

In the last couple of months, we worked on malware classification and malware clustering. The results are summarized in a technical report. In the article, we introduce a learning-based framework for automatic analysis of malware behavior. To apply this framework in practice, it suffices to collect a large number of malware samples and monitor their behavior using a sandbox environment. By embedding the observed behavior in a vector space, reflecting behavioral patterns in its dimensions, we are able to apply learning algorithms, such as clustering and classification, for analysis of malware behavior. Both techniques are important for an automated processing of malware samples and we show in several experiments that our techniques significantly improve previous work in this area. For example, the concept of prototypes allows for efficient clustering and classification, while also enabling a security researcher to focus manual analysis on prototypes instead of all malware samples. Moreover, we introduce a technique to perform behavior-based analysis in an incremental way that avoids run-time and memory overhead inherent to previous approaches.

Malicious software — so called malware — poses a major threat to the security of computer systems. The amount and diversity of its variants render classic security defenses ineffective, such that millions of hosts in the Internet are infected with malware in form of computer viruses, Internet worms and Trojan horses. While obfuscation and polymorphism employed by malware largely impede detection at file level, the dynamic analysis of malware binaries during run-time provides an instrument for characterizing and defending against the threat of malicious software.
In this article, we propose a framework for automatic analysis of malware behavior using machine learning. The framework allows for automatically identifying novel classes of malware with similar behavior (clustering) and assigning unknown malware to these discovered classes (classification). Based on both, clustering and classification, we propose an incremental approach for behavior-based analysis, capable to process the behavior of thousands of malware binaries on a daily basis. The incremental analysis significantly reduces the run-time overhead of current analysis methods, while providing an accurate discovery and discrimination of novel malware variants.

The full technical report is available at It was joint work with Konrad Rieck, Philipp Trinius, and Carsten Willems. And the word cloud was generated using

AV Tracker

Thursday, October 22. 2009
A couple of days ago, the website "AV Tracker" went online, which publishes information about various automated analysis systems. The idea is that the attacker uploads a binary to an analysis system, waits for the sample to be executed, and then the binary phones home some information to a server under the control of the attacker. The collected information is then published at "AV Tracker", exposing information about the analysis systems. Besides some well-known AV companies, also CWSandbox and Anubis were affected.

We analyzed the binary and found that it sends a simply HTTP request, in which all extracted information is encoded. An example for an analysis report generated by one of the samples is This can be defeated by blocking access to the reporting server or by regularly changing the IP address of the analysis systems, but at the end this will be some kind of arms race again.

Some other interesting information is also embedded in the binary. When extracting the strings from the sample, the following text becomes visible (some information is hidden by dots):
This is Peter Kl....... fuck ...... fuck the world fuck you all!
I was once working with ...... and was a white hat, now I am the worst mean motherfucker black hat and I am selling the source code of ...... .. :D
I am with the SinowalWhistler developers, funny days, aren't ;) and fuck ..... they don't have no idea :D bitches

A related article was also published today at under the title "A black hat loses control".

Thread Graphs for Visualizing Malware Behavior

Tuesday, August 25. 2009
The last blog post dealt with our recent research on visualizing malware behavior. Now a quick update on the thread graphs we generate for visualizing malware behavior: since tree maps display nothing about the sequence of operations, we use another presentation format to visualize the temporal behavior of the individual threads of a sample. A thread graph can be regarded as a behavioral fingerprint of the sample that represents the temporal order of executed system commands and the different threads spawned by a binary. The x-axis represents the time (sequence of performed actions), while the y-axis indicates the operation/section of the performed action. An analyst can then study this behavior graph to quickly learn more about the actions of each individual thread.

The following two pictures show examples of this kind of visualization:

On the left hand picture, we can see that one thread is responsible for the majority of operations for the sample. This thread performs many registry operations and initially performs many network- and system-related operations (operations 90-140). Additionally, two more threads are spawned, but they perform only a limited amount of operations during the analysis phase. The thread graph for the malware sample on the right side is completely different and an analyst can get a quick overview of what actions a given samples performs.

"Visual Analysis of Malware Behavior Using Treemaps and Thread Graphs"

Friday, August 21. 2009
I continue the series of recently or upcoming papers with a paper we will publish at VizSec'09 entitled "Visual Analysis of Malware Behavior Using Treemaps and Thread Graphs". In the recent years, we saw a lot of progress in the area of automated malware analysis. Nowadays tools such as CWSandbox, Anubis, ThreatExpert, or Norman Sandbox are available. These tools analyze a given binary and generate a report which contains a summary of the observed behavior while executing the sample. Such reports are often quite long, it is for example not uncommon for a CWSandbox report to be longer than 100 lines. An analyst thus has to read the report in order to get an understanding of what a given sample is doing. In this paper we present an approach to visualize the behavior report with treemaps and behavior graphs (i.e., visualizing the behavior of the individual threads over time). This helps to get a quick overview of what a given sample does and also samples from one malware family have a similar looking treemap/behavior graph.

As an example, consider the following three pictures which each show the treemap generated for three distinct samples of the Bagle worm:

Each picture shows a treemap of the behavior: the x-axis depicts the type of action performed, e.g., whether the sample performed actions related to the filesystem, the registry, or the network. The y-axis devides the actions into operations, i.e., whether it was a read or write access to the registry. As you can see, the behavior of the Bagle sample is (more or less) consistent across different samples from the same family. Below you can find the visualization of two Swizzor samples and one Allaple sample.

Samples from the same family have a similar visualization, while samples from different families look different. This could help an analyst to quickly identify if the sample is interesting or just another small variant of a well-known family. This research will be integrated in the frontend of

Abstract: We study techniques to visualize the behavior of malicious software (malware). Our aim is to help human analysts to quickly assess and classify the nature of a new malware sample. Our techniques are based on a parametrized abstraction of detailed behavioral reports automatically generated by sandbox environments. We then explore two visualization techniques: treemaps and thread graphs. We argue that both techniques can effectively support a human analyst (a) in detecting maliciousness of software, and (b) in classifying malicious behavior.

Malicious PDFs Analysis Continued

Monday, January 12. 2009
After my initial posting about the possibility to analyze PDF files with CWSandbox we received a few more such samples. In all cases the PDF file exploits a vulnerability in Acrobat Reader once the file is opened. With the help of CWSandbox it is possible to observe this exploit and also the actions of the malware after the compromise (e.g., downloading of additional malware from another server). Please find below three additional examples of such reports:

If you happen to have more malicious PDFs, please submit them at :-)