Dupdelete is a program for aiding in tidying collections of files.
Dupdelete has two fundamental modes of operation:
- Recursive MD5 hashing of a directory and its contents, writing the hashes to a file.
- Recursive MD5 hashing of a directory and its contents, logging and optionally deleting anything matching a blacklist of hashes.
It can also cache the hashes found on previous passes into a GDBM file, allowing for very large sets of files to be processed on a regular basis without calculating hashes from multiple terabytes of input. It is entirely practical to have it, for example, process the user areas of over a thousand students at a school and compare every newly-found or modified file against a blacklist containing the MD5s of downloadable games, infringing music, offensive jokes and inappropriate images. This is the purpose to which I use it, and I find it very satisfactory (and satisfying) in that role.
As an example, dupdelete can be used to hash a folder full of copyright-infringing music and then to search user folders for any undiscovered copies:
dupdelete path_to_contraband/ > contraband.md5
dupdelete path_to_user_folders/ -l contraband.md5 > gotya.txt
Or it can be used to validate copying of a large directory structure to ensure no files were corrupted in transit:
dupdelete path_to_copy > copied_files.md5
dupdelete path_to_original -l copied_files.md5 -v | grep -v "(Flagged"
Though only capable of using the md5 hash - no longer cryptographically secure - it can apply a salt to this hash, and can also be set to skip sizes greater than a defined size.
Dupdelete is released under the GPL, though on the understanding that this is 'good enough' software and not the best-written. Windows executable and linux-friendly source are released, but this is a simple and cross-platform program so you should have no trouble compiling it on anything POSIX. It compiles for windows using mingw. The Visual Studio C compiler can also handle it, but the code needs some minor adaptations for that.