Data Set For Malware Clustering/Classification

Friday, January 29. 2010
About one month ago I blogged about our research on malware clustering and classification. We have now also released the full data set from our experiments, such that other people can reproduce the results and compare our approach to theirs. You can find all information at http://pi1.informatik.uni-mannheim.de/malheur/, together with a description of the different data.

Quick overview of the data:
Our reference data set is extracted from our large database of malware binaries maintained at CWSandbox. The malware binaries have been collected over a period of three years from a variety of sources. From the overall database, we select binaries which have been assigned to a known class of malware by the majority of six independent anti-virus products. We append the overall anti-virus label to the filename of each report. Although anti-virus labels suffer from inconsistency, we expect the selection using different scanners to be reasonable consistent and accurate. To compensate for the skewed distribution of classes, we discard classes with less than 20 samples and restrict the maximum contribution of each class to 300 binaries. The selected malware binaries are then executed and monitored using CWSandbox, resulting in a total of 3.133 behavior reports in MIST format.

The application data set consists of seven chunks of malware binaries obtained from the anti-malware vendor Sunbelt Software. The binaries correspond to malware collected during seven consecutive days in August 2009 and originate from a variety of sources. Sunbelt Software uses these very samples to create and update signatures for their VIPRE anti-malware product as well as for their security data feed ThreatTrack. The complete test data set consists of 33.698 behavior reports in MIST format.

The full technical report is available at http://honeyblog.org/junkyard/paper/malheur-TR-2009.pdf.

Update: I changed the terms within the description to use the correct description.