Data Set For Malware Clustering/Classification

About one month ago I blogged about our research on malware clustering and classification. We have now also released the full data set from our experiments, such that other people can reproduce the results and compare our approach to theirs. You can find all information at, together with a description of the different data.

Quick overview of the data:
Our reference data set is extracted from our large database of malware binaries maintained at CWSandbox. The malware binaries have been collected over a period of three years from a variety of sources. From the overall database, we select binaries which have been assigned to a known class of malware by the majority of six independent anti-virus products. We append the overall anti-virus label to the filename of each report. Although anti-virus labels suffer from inconsistency, we expect the selection using different scanners to be reasonable consistent and accurate. To compensate for the skewed distribution of classes, we discard classes with less than 20 samples and restrict the maximum contribution of each class to 300 binaries. The selected malware binaries are then executed and monitored using CWSandbox, resulting in a total of 3.133 behavior reports in MIST format.

The application data set consists of seven chunks of malware binaries obtained from the anti-malware vendor Sunbelt Software. The binaries correspond to malware collected during seven consecutive days in August 2009 and originate from a variety of sources. Sunbelt Software uses these very samples to create and update signatures for their VIPRE anti-malware product as well as for their security data feed ThreatTrack. The complete test data set consists of 33.698 behavior reports in MIST format.

The full technical report is available at

Update: I changed the terms within the description to use the correct description.


    No Trackbacks


Display comments as (Linear | Threaded)

  1. George Terry says:

    CWSandBox is probably the only malware binary that uses a combination of automated static analysis and behavioral analysis techniques so one do not have set up a lab or sandnet to analyze suspicious binaries, thus eliminating the risks of infecting a network during the analysis.

  2. totalhash says:

    We thought we would release some code that implements the pehash algorithm for malware clustering. Wondering what other folks think of the clustering method itself as well as the code designed to match it?

  3. 28 Inch Hair Extensions says:

    I am so glad this internet thing works and your article really helped me. Might take you up on that home advice you.

Add Comment

E-Mail addresses will not be displayed and will only be used for E-Mail notifications

To prevent automated Bots from commentspamming, please enter the string you see in the image below in the appropriate input box. Your comment will only be submitted if the strings match. Please ensure that your browser supports and accepts cookies, or your comment cannot be verified correctly.