A comparison and evaluation of approximate matching algorithms for file detection in network traffic

Bachkarov, N. (2015). A comparison and evaluation of approximate matching algorithms for file detection in network traffic (MSc ASDF Dissertation). Edinburgh Napier University (Macfarlane, R., Penrose, P.).



Computer related crimes have been steadily increasing in the last few years (Wiener-Bronner, 2014). The main target of the attackers is the either the user’s data or a company’s sensitive files. Data Loss Prevention (DLP) systems are a key factor in detecting such files leaving the network. The key factor to detecting contraband data is to properly configure the Data Loss Prevention system and install it on the right side of your network. Although DLP systems have been around for some time now, many issues with accurately detecting sensitive data exist. This allows knowledgeable attackers to easily bypass the restrictions of such systems and leak private and sensitive data.
This dissertation focuses on new methods which aim to provide a more flexible detection of sensitive data. One of those methods is approximate matching, also known as fuzzy hashing. This rather new technique is designed to find similarities between two digital artefacts. It is different from the hashing method because instead of creating one hash for each file, approximate matching breaks the file into chunks and each chunk is then hashed. One of the main advantages of this method is that it is able to find similarities between files. The other benefit is that it uses Bloom filters to store the signatures of the files. Bloom filters are a space-efficient probabilistic data structure and are approximately between 1.0% and 2.6% of the size of the input length.
An experimental system is designed and implemented in order to recreate experiments that have been conducted by other researchers in the field. As sdhash and mrsh-v2, because of their good overall performance, are currently the most widely used tools for approximate matching, the experiments focus on comparing them in terms of sampling speed, accuracy and comparison time for network file similarity. The experiments conducted include sampling two sets of data and then those files are compared against four different in size network captures. The results produced show that mrsh-v2 outperforms sdhash in terms of sampling speed, but sdhash excels in terms of speed when it comes to comparing the data sets to the captured traffic. Another important finding is that sdhash is more likely to identify contraband files within larger network files. Those results are similar and compare well to previous work(Breitinger & Baggili, 2014). This project extends the work undertaken by comparing a date set against a captured network traffic file which excludes the file type signature headers. This project also tries to extend the work of other researchers in this field by evaluating the performance of sdhash on a Windows Operating system.
[Read More]


Areas of Expertise

Electronic information now plays a vital role in almost every aspect of our daily lives. So the need for a secure and trustworthy online infrastructure is more important than ever. without it, not only the growth of the internet but our personal interactions and the economy itself could be at risk.

Associated Projects