Contents

Problem

Current Methods for Determining the Genealogy of Binaries are Inconclusive

Versioning and fingerprinting algorithms (see "Related Work") provide methods for comparing binary files in order to identify specific data characteristics which may be indicative of similarities between the files. No work in either of these two broad areas of study purport to have developed a formal, global method for evaluating the similarity of binary files. Alex Aiken's Winnowing work provides a method for determining similarity within 33% of the local lower bound, given uniform identically distributed input. Sabre Security's BinDiff algorithm employs control flow graphs and heuristics to identify regions of code within (disassembled) binary files which are similar. By building upon these distinct ideas, and others, it may be possible to construct an algorithm which is not subject to the bias of local comparison or the assumption of uniform identically distributed input.

Determining the genealogy, or derivation, of binary files is of immediate interest to developers involved in the design and execution of intrusion detection/prevention, compatibility testing, and debugging systems. It is also be applicable to GPL (http://en.wikipedia.org/wiki/GNU_General_Public_License) violation, and other intellectual property disputes.

Solution

Determine Current Problem Constraints and Expose Implicit Assumptions

This work provides initial findings derived from a technical evaluation of Sabre's BinDiff Plugin for the IDA Disassembler. The Windows XP binary files were compared to the Windows Vista binary files. Windows Vista (http://en.wikipedia.org/wiki/Windows_Vista) is publicly known to have been derived from Windows XP. Limiting assumptions inherent in the design of existing tools were documented.

This project was completed with the support of Alex Sotirov and Determina (http://determina.com).

Technical Objectives

Comparison of Windows XP and Vista Binaries

Extract and Format

The Windows Vista and Windows XP binary files were extracted from their respective installation CDs. A script designed to unpack all compressed files within the Windows XP code base was utilized for Windows XP. Changes in compression formats preventing this script from being used to extract the compressed files within Windows Vista. Instead, VMWare's VMWare Workstation (http://www.vmware.com/) was used to virtualize the installation of Windows Vista and extract the requisite files. Once all files were extracted, the headers in each file were examined to detect the portable executable format (http://en.wikipedia.org/wiki/Portable_Executable). Files in the portable executable format were assumed to be binary files to be considered.

Exact file content matches were eliminated from within XP and Vista. Sabre Security's BinDiff software compares pairs of binary files. Files in XP and Vista were matched based upon their current filenames and, when no match was found, based upon their original filename, when this information was available within the file header.

Disassemble

Binary files were disassembled using the IDA Pro Disassembler (http://www.datarescue.com/idabase/index.htm). Before disassembly, if available from the Microsoft Symbol Server service (http://support.microsoft.com/kb/311503), the relevant Symbols were downloaded and associated with the file. Microsoft Symbol Server provides a markup for many Windows XP and Vista binary files (during this step, it was discovered that symbols for some files, including the files which provide Digital Rights Management (http://en.wikipedia.org/wiki/Digital_Rights_Management) features, were not made available by Microsoft), providing function name and location data which can be associated with disassembled versions of the files.

Compare Files

Pairs of binary files were compared using Sabre Security's BinDiff plugin (http://www.sabre-security.com/products/bindiff.html) for the IDA Pro Disassembler. BinDiff employs control flow graphs and heuristics to identify regions of code within (disassembled) binary files which are similar. BinDiff first attempts to determine the locations of the functions within each file (instead using the Microsoft Symbols, when available). BinDiff then attempts to match functions between the two files, by matching and comparing the "basic blocks" (consecutive lists of machine code which do not include branches or jumps) within the code. Branch inversion can, for example, be detected. Basic blocks are considered to be similar if their similarity exceeds a determined threshold, which can be manipulated by a parameter which is passed as an argument to the algorithm. Lists of functions which are considered to be "similar" and "different" and "unmatched" are then provided, along with logs detailing the basis for categorizations. Please refer to the "Related Work" section below to locate papers which provide detailed descriptions of the BinDiff algorithm.

Analyze Results

The results were analyzed. Please refer to the "Results" section below for details.

Technical and Result Interpretation Challenges

Data Extraction

As is detailed in the section "Extract and Format" above, custom file decompression methods were necessary in order to extract all versions of the code, as not all versions might be installed during any particular installation. Matching files between operating systems also required the creation of complex scripts to remove exact duplicates, and match files based upon their current and historic file names.

IDA and BinDiff Batch Modes

Neither IDA Pro nor BinDiff were designed or are well-supported when executed from command-line interfaces. Message boxes which interrupted batch scripts, undefined errors, and undefined bugs causing the application to crash or hang, were among the challenges. After numerous hacks (e.g. eliminating blank lines for the disassembled files), and bug fixes, graciously provided by Sabre Security at our request, over two hundred files could not be processed. Over 200 files which were unable to be processed by either IDA or BinDiff were eliminated, leaving 1518 matched pairs of files, appearing in both Windows XP and Windows Vista, to be evaluated.

The disassembly and pair comparison process was streamlined to preserve disk space. The final BinDiff logs, which were later analyzed to calculate results, occupied over 15 Gigabytes of storage space. The disassembled files, if they were all to have been disassembled at once, would likely have occupied over a factor of ten times as much space.

Interpreting Results

The results which are detailed in the proceeding section are dependent upon a large number of assumptions. Beyond the limitations imposed by the tools which were used, decisions had to be made regarding the interpretation of BinDiff's output.

For simplicity, the set of "similar" functions were compared to the the set of all functions. In order to perform this comparison, another script had to be created to determine the total number of functions within each file, and then associate these numbers with the numbers of similar functions, as these values were not provided as part of BinDiff's output.

When calculating the aggregate similarity between Windows XP and Windows Vista, the idea of weighing each pair of files by its size (e.g. number of functions or size in bytes) was considered, but in the end dismissed for the purposes of this preliminary evaluation. This dismissal was in part due to the recognition that the frequency with which functions are called, a calculation which is beyond the scope of this work, should also be considered on that level of analysis.

Results

Basic

Between Windows XP and Windows Vista, there was approximately a 25% increase in the number of binary files (5561/4453 Vista/XP).

Of the 1518 pairs of binary files matched between the Windows XP and Windows Vista operating systems, the following basic results were calculated:

1) 37% of Vista file names matched XP file names

2) On average, there was a 17% increase in the number of functions per file, and a 30% increase in the number of bytes per file


There was only one file which contained exactly the same binary data between operating systems: sqlwoa.dll


Matching

Matching Results

Comparing the number of functions, between pairs of files, which BinDiff calculated to be similar, to the maximum of the total numbers of functions in the files, it was determined that:

1) There was approximately a 63% average function match between Windows XP and Windows Vista

2) Windows XP machine code accounts for approximately 23% of Windows Vista machine code (37% * 63% => ~23%)

3) Windows Vista contains approximately 36% of the machine code in Windows XP (63% * 79% => ~36%)

Random Matching

It is important to note these approximately are subject to multiple factors which require further analysis. As detailed in "Technical and Result Interpretation Challenges", a significant number of matched pairs of files could not be processed by IDA Pro or BinDiff. Matching based upon the name of files is another potentially significant contributor to error. Pairs of files of similar sizes, which were determined to be unlikely to contain similar code, were subjected to the same disassembly and BinDiff analysis as the pairs of files utilized to calculate the preliminary results which documented above:


Set #1

comsvcs.dll (5549 total functions, 1251840 bytes)

danim.dll (7970 total functions, 1053696 bytes)

Results: 566 similar functions => 7% "similar"


Set #2

nslookup.exe (117 total functions, 76800 bytes)

notepad.exe (85 total functions, 69120 bytes)

Results: 22 similar functions => 18% "similar"


During the random matching tests, no similarity measure exceeded 18%.

Similarity Distribution

The following histogram compares the number of pairs of files (represented on the Y axis) with particular similarity measures (represented on the X axis):

similarfunctionvistamatch.JPG


Rather than a uniform distribution, this histogram suggests that some files differed considerably more than others. In order to determine whether the files which differed more/less were more or less integral to the operating systems, the similarity of explorer.exe and kernel32.dll was considered:


Explorer.exe http://www.liutilities.com/products/wintaskspro/processlibrary/explorer/

56% similar between Windows XP and Windows Vista. Windows Vista includes 84% of the machine code which was in this file in Windows XP.


Kernel32.dll http://www.neuber.com/taskmanager/process/kernel32.dll.html

55% similar between Windows XP and Windows Vista. Windows Vista includes 84% of the machine code which was in this file in Windows XP.

Related Work

Sabre Security's BinDiff

(please also refer to the description within the "Technical Objectives" section above)

Graph-Based Comparison of Executable Objects http://www.sabre-security.com/files/BinDiffSSTIC05.pdf

Structural Comparison of Executable Objects http://www.sabre-security.com/files/dimva_paper2.pdf

Version Management

Version Management for CVS http://crypto.riken.go.jp/archives/GNU/non-gnu/cvs/source/nightly-snapshots/feature/cederqvist-1.12.13.1.pdf

Minimum Description Length

Minimum Description Length, and related work within the field of Algorithmic Information Theory, provides a rigorous, theoretical basis for general (rather than local) assumptions regarding the probability of the occurrence of a string of symbols.

http://en.wikipedia.org/wiki/Minimum_description_length

"The fundamental idea behind the MDL Principle is that any regularity in a given set of data can be used to compress the data, i.e. to describe it using fewer symbols than needed to describe the data literally."

http://homepages.cwi.nl/~pdg/ftp/mdlintro.pdf

Unfortunately, because Kolmogorov Complexity is uncomputable, practical implementations of these ideas will require further consideration.

Document Fingerprinting

Alex Aiken's Winnowing algorithm provides a method for determining similarity within 33% of the local lower bound, given uniform identically distributed input. Hashes of particular segments of files are selected to create "fingerprints". Parameters, for which little theoretical or empirical basis is provided, are required. These parameters are intended to be specific to the format of the data under comparison, balancing between hashing upon data segments which are too small or too large to provide a strong basis for comparison (e.g. if data is repetitious, or the ordering of data is modified, respectively). Unfortunately, this trade off is inherent in the Winnowing method. The Winnowing paper referenced below also mentions that cleaning the data, prior to evaluation, significantly improves the quality of results.

Moss: A System for Detecting Software Plagiarism http://theory.stanford.edu/~aiken/moss/

Winnowing: Local Algorithms for Document Fingerprinting http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf

Similix: "Similix Corporation specializes in supporting lawyers, engineers, and businesses in evaluating possible infringement of software copyrights and other intellectual property." http://similix.com

DNA Fingerprinting

DNA Fingerprinting is an example of a similar, well-documented scientific method which is widely accepted as meeting the admissibility standards for scientific evidence in courts of law.

Genetic Fingerprinting on Wikipedia http://en.wikipedia.org/wiki/Genetic_fingerprinting

Statisticians and DNA Fingerprinting http://www.jstor.org/view/08834237/di984044/98p0038f/0

DNA Fingerprinting: A Review of the Controversy http://links.jstor.org/sici?sici=0883-4237(199405)9%3A2%3C255%3A%5BFAROT%3E2.0.CO%3B2-5

Other related opportunities include improving the accuracy of STD testing by comparing genetic data obtained from DNA Microarrays.

Conclusion

Implicit Assumptions

Local Comparison

Local comparisons ignore the likelihood of the occurrence of patterns in the general case. However, a global comparison must also not be limited to the assumption of uniform identically distributed input, as machine code is context-sensitive. Please also refer to the section "Minimum Description Length".

(Ir)relevance of Output

Neither BinDiff nor Moss consider the differences in the output produced, given constant inputs. It is possible that, by altering only one basic block, the data transformation performed by a function or file may be sufficiently altered to render its potential use cases entirely different. If common design patterns are used, the differences between independently developed solutions to distinct problems might inadvertently qualify as highly similar.

Similarity of Application Type

If two distinct sets of developer are both writing drivers on the same operating system for the same peripheral device, no matter how different their coding styles may be, current algorithms might consider their binaries to be highly similar.

(No) Intentional Copying

Human facial features are significant indicators of ancestry. Similarly, small, relatively rare patterns within data may be significant indicators of derivation. Only document fingerprinting techniques detect such patterns, and they do so at the expense of detecting reorderings.

Function Classification

The BinDiff algorithm applies a heuristic method to determine the location of functions within disassembled code, when function locations are not known. Similar to function and block independence, and file independence, this heuristic is required because a fingerprinting method which is valid in the general case is not available. Together, these assumptions hierarchically narrow the scope of consideration.

Function + Block Independence

Functions may move between files, and basic blocks may move between functions. Although it may be expected that such movements will be infrequent, the current similarity measure is unable to detect them.

File Independence

Binary files are dependent upon each other. Further, some files / functions within files, may, on average, be called more or less frequently by other files than others. These file dependencies are, themselves, an important factor in the determination of similarity between software systems.

Future Work

Address Implicit Assumptions

The assumptions which are implicit in the designs of current tools (see "Implicit Assumptions") must be examined and modified, if a similarity measure sufficient for the purpose of determining the genealogy of files is to be created.

Local Comparison of Global Patterns

Principles from Algorithmic Information Theory (see "Related Work") may be applied to provide a global, or general, basis for the calculation of a similarity measure based comparison between files.

Establish Admissibility as Scientific Evidence

In order for a similarity measure to be accepted as scientific evidence, a set of poorly defined criteria must be met. It is expected that the process of publishing theories and empirical results will result in this achievement. However, this end goal might be achieved more quickly if attention is given to the process through which DNA fingerprinting became acceptable within courts of law.

Wikipedia, Admissible Evidence http://en.wikipedia.org/wiki/Admissible_evidence

Wikipedia, Scientific Evidence http://en.wikipedia.org/wiki/Scientific_evidence

(also see "Related Work", DNA Fingerprinting)

Streamlining and Extending Existing Tools

The IDA Pro Disassembler, and the BinDiff plugin, were both specifically design to process files, or pairs of files, individually. In order to be able to disassemble multiple files at once, it is expected that the systems' software architectures would need to be significantly altered, as data will no longer fit in RAM, and methods of determining dependencies between files has yet to be explored in depth.

Visualization/Interpretation Aids

BinDiff, and source code versioning interfaces, provide special methods for visualizing the differenced between files. BinDiff provides visualizations of call graphs, and most CVS interfaces provide overlapping dual-window text comparison graphical user interfaces. These tools significantly aid human analysis. As the calculation of increasingly precise similarity measures becomes possible, it is expected that tools similar to BinDiff may be enhanced by novel visualization methods, perhaps which are more similar to those which are currently common within CVS interfaces.

Last modified September 19, 2007 5:48 pm / Skin by Kevin Hughes
MediaWiki