Automatic Analysis of Malware Behavior using Machine Learning

Monday, December 28. 2009
CWSandbox
In the last couple of years, several honeypot solutions to automatically "collect" malware samples were developed. With these tools, it is possible to obtain copies of malware samples without any human interaction. As a result, we are able to collect quite a few malware samples per day, which then also need to be analyzed. Thus, several sandbox solutions were developed that automate the analysis step by performing dynamic, behavior-based analysis. The result of the dynamic analysis is typically a report that summarizes the observed behavior. The next logical step is to use that information to perform malware classification and malware clustering: at the end of that process, we can then obtain information about which samples perform basically the same kind of activity. We can then automatically find variants of well-known threats, identify new malware families, and reduce the manual effort needed to analyze the large number of incoming malware samples.

In the last couple of months, we worked on malware classification and malware clustering. The results are summarized in a technical report. In the article, we introduce a learning-based framework for automatic analysis of malware behavior. To apply this framework in practice, it suffices to collect a large number of malware samples and monitor their behavior using a sandbox environment. By embedding the observed behavior in a vector space, reflecting behavioral patterns in its dimensions, we are able to apply learning algorithms, such as clustering and classification, for analysis of malware behavior. Both techniques are important for an automated processing of malware samples and we show in several experiments that our techniques significantly improve previous work in this area. For example, the concept of prototypes allows for efficient clustering and classification, while also enabling a security researcher to focus manual analysis on prototypes instead of all malware samples. Moreover, we introduce a technique to perform behavior-based analysis in an incremental way that avoids run-time and memory overhead inherent to previous approaches.

Abstract
Malicious software — so called malware — poses a major threat to the security of computer systems. The amount and diversity of its variants render classic security defenses ineffective, such that millions of hosts in the Internet are infected with malware in form of computer viruses, Internet worms and Trojan horses. While obfuscation and polymorphism employed by malware largely impede detection at file level, the dynamic analysis of malware binaries during run-time provides an instrument for characterizing and defending against the threat of malicious software.
In this article, we propose a framework for automatic analysis of malware behavior using machine learning. The framework allows for automatically identifying novel classes of malware with similar behavior (clustering) and assigning unknown malware to these discovered classes (classification). Based on both, clustering and classification, we propose an incremental approach for behavior-based analysis, capable to process the behavior of thousands of malware binaries on a daily basis. The incremental analysis significantly reduces the run-time overhead of current analysis methods, while providing an accurate discovery and discrimination of novel malware variants.

The full technical report is available at http://honeyblog.org/junkyard/paper/malheur-TR-2009.pd. It was joint work with Konrad Rieck, Philipp Trinius, and Carsten Willems. And the word cloud was generated using http://www.wordle.net/.

AV Tracker

Thursday, October 22. 2009
CWSandbox
A couple of days ago, the website "AV Tracker" went online, which publishes information about various automated analysis systems. The idea is that the attacker uploads a binary to an analysis system, waits for the sample to be executed, and then the binary phones home some information to a server under the control of the attacker. The collected information is then published at "AV Tracker", exposing information about the analysis systems. Besides some well-known AV companies, also CWSandbox and Anubis were affected.

We analyzed the binary and found that it sends a simply HTTP request, in which all extracted information is encoded. An example for an analysis report generated by one of the samples is http://anubis.iseclab.org/?action=result&task_id=361b5a8ee7235954252b02d33b3a7d24. This can be defeated by blocking access to the reporting server or by regularly changing the IP address of the analysis systems, but at the end this will be some kind of arms race again.

Some other interesting information is also embedded in the binary. When extracting the strings from the sample, the following text becomes visible (some information is hidden by dots):
This is Peter Kl....... fuck ...... fuck the world fuck you all!
I was once working with ...... and was a white hat, now I am the worst mean motherfucker black hat and I am selling the source code of ...... .. :D
I am with the SinowalWhistler developers, funny days, aren't ;) and fuck ..... they don't have no idea :D bitches

A related article was also published today at http://www.viruslist.com/en/weblog under the title "A black hat loses control".

$645.00 ...

Thursday, September 10. 2009
... is the amount I am worth in the underground economy, at least according to Symantec's new website on which they advertise (in a somewhat entertaining way) Norton 2010 products. Here are the results when I take the risk assessment:
[...] In the underground economy, you're really worth about $645.00. And that's on a good day.
Your entire digital life could go on the auction block for as little as $10.96, whether you like it or not.

How they compute these numbers and on what methodology / measurements this is based remains completely unclear, after all it is just some kind of marketing. But the movies are funny, perhaps they can serve as some kind of security awareness campaign. Main drawback is that the website is almost completely built on top of Flash and JavaScript - how about not using all these techniques next time? In some recent measurements we found that the vast majority of web surfers still have an unpatched version of Flash installed, better teach them to regularly update their system next time...

Thread Graphs for Visualizing Malware Behavior

Tuesday, August 25. 2009
CWSandbox
The last blog post dealt with our recent research on visualizing malware behavior. Now a quick update on the thread graphs we generate for visualizing malware behavior: since tree maps display nothing about the sequence of operations, we use another presentation format to visualize the temporal behavior of the individual threads of a sample. A thread graph can be regarded as a behavioral fingerprint of the sample that represents the temporal order of executed system commands and the different threads spawned by a binary. The x-axis represents the time (sequence of performed actions), while the y-axis indicates the operation/section of the performed action. An analyst can then study this behavior graph to quickly learn more about the actions of each individual thread.

The following two pictures show examples of this kind of visualization:


On the left hand picture, we can see that one thread is responsible for the majority of operations for the sample. This thread performs many registry operations and initially performs many network- and system-related operations (operations 90-140). Additionally, two more threads are spawned, but they perform only a limited amount of operations during the analysis phase. The thread graph for the malware sample on the right side is completely different and an analyst can get a quick overview of what actions a given samples performs.

"Towards Proactive Spam Filtering"

Friday, July 31. 2009
A common technique employed by spammers is to send spam mails with the help of botnets. In a typical setting, the spammer uses so called template-based spamming: the attacker sends the bots a spam template that describes the structure of the spam message to be sent. Furthermore, the attacker sends meta-data like recipient list, subject list, and a list of URLs that are used to fill in variables in the template. The bots then construct an email based on the template and the meta-data, and send this email to the targets. As a result, the actual work of handling the SMTP communication is moved from the control server to the bots. Nowadays this technique is used by most large spam botnets, like Waledac, Bobax, Rustock, Cutwail, and a lot of the other major spam botnets as Joe Stewart explained in detail.

Since spammers nowadays use such a tactic, we can also collect spam mails in a more efficient way: Instead of waiting at the end-user's mailboxes or spamtraps for mail messages to arrive and then decide whether or not this is spam, we directly interact with the servers that are used to send spam messages. The basic idea is that we execute spambots, i.e., malicious software dedicated to sending spam emails, in a controlled (honeypot) environment and collect all email messages sent by the bots. This enables us to directly interfere with botnet control servers to collect current spam messages sent by a specific botnet.

We describe this idea in more detail in a short paper that was published at DIMVA'09. The paper is also available on this blog.

Abstract: With increasing security measures in network services, remote exploitation is getting harder. As a result, attackers concentrate on more reliable attack vectors like email: victims are infected using either malicious attachments or links leading to malicious websites. Therefore efficient filtering and blocking methods for spam messages are needed. Unfortunately, most spam filtering solutions proposed so far are reactive, they require a large amount of both ham and spam messages to efficiently generate rules to differentiate between both. In this paper, we introduce a more proactive approach that allows us to directly collect spam message by interacting with the spam botnet controllers. We are able to observe current spam runs and obtain a copy of latest spam messages in a fast and efficient way. Based on the collected information we are able to generate templates that represent a concise summary of a spam run. The collected data can then be used to improve current spam filtering techniques and develop new venues to efficiently filter mails.