Confero, near-duplicate file identifier

confero

Confero identifies files with a lot of data in common within a large set of files. Any long enough stretches of bytes, regardless of their position within the file. It usually won't help search for similar images - there are more specialised utilities for that - but it is very effective for many types of document. It will identify archives which contain the same files within, or different revisions of a document. For example, I used it to examine a large collection of downloaded web comic archives contained within zip files, and it identified instances where the same comic series had been included twice under two different titles.

It does this using a combination of variable-length chunking followed by Jaccard simularity comparison of the resulting chunk hashes.

Compiled for Windows, source for linux.

Windows compiled executable removed due to false positive on Windows Defender.

How it works

Confero finds similar files. It does this by combining three commonly-used algorithms.

Step one: Each file is chunked using a variable-length chunking algorithm. This is built around a FNV-64 hash function. This isn't the absolute fastest possible - a cyclic polynomial would be a better choice for performance - but avoiding such mathematics makes the algorithm a lot easier to follow and test. The performance matters little, as the limiting factor is usually going to be reading from disk. Even used in this inefficient manner that reduces the speed by a factor of eight, FNV is still pretty fast.

Step two: Each chunk is then reduced to a hash integer, which is stored in a simple (k=1) Bloom table. The hash used for this is again FNV-64. There's no strict requirement this has to use the same hash, but the function is already present. The hash is taken modulo the number of buckets in the table, and the appropriate bit set. In this manner each file is reduced to a fixed-length signature. The size of these signatures is a trade-off: Too small and large files would have so many bits set their signature would be almost all ones leading to unacceptable levels of false matches. Too large a signature wouldn't produce any problems, but it would consume excessive memory resources.

Step three: Each pair of signatures is measured for similarity using Jaccard similarity as the metric. Any match over the threshold is output.

There are two tunable paramaters: Buckets (B) and threshhold (T). These are used to set a memory/accuracy tradeoff.