RISAHash

A common problem in media handling is the identification of files with similar but not-bit-identical content, such as resized versions of an image or a song in a variety of codecs and compression levels. This is typically required when searching a media collection for duplicates or screening submissions to a service for copyright infringement or resubmissions. The simplist solution is a brute-force search using some form of distance metric function, but in most circumstances this will require a prohibative amount of processing time.

One solution to this is the use of a perceptual hash - a function which takes media as input and generates a short representation of the file, such that a difference between hashes may then be calculated that approximates the distance between the original media. As these hashes are shorter, often of fixed length and designed to be easily comparable, evaluation of the distance between two perceptual hashes takes only a miniscule fraction of the processing time required to evaluate the distance between the source media. Typicially the approximation does introduce a chance of a false positive match, but the cost of verifying any near-matching hashes is far less than that saved by their use.

Risahash (The RIdiculously Simple Audio Hash) is a perceptual hash function for audio which functions in a manner very similar to the phash perceptual image hash - though not at all like the phash library's own audio hash function. Unlike the phash audio function, risahash's outputs are of a fixed 64-bit length and the distance between hashes is a simple calculation of hamming distance. This makes them very easy to handle when storing or searching - it's even possible to craft an SQL statement to search for all hashes within a specified distance of a target hash.

Many other perceptual audio hashes (or 'fingerprints') have been proposed, most of which utilise some form of frequency-domain feature identification. Risahash has several compelling advantages.

Speed: Calculating a hash from audio is so rapid, the processor time will usually be less than that to load the content.
Simplicity: The algorithm can be implimented in under twenty lines of C and requires no use of frequency domain analysis or more than the most basic arithmatic.
Compact hashes: Fixed-length, 64 bits.
Resistant to changes in volume, sample rate, most bandpass filtering, lossy compression at low bitrates, introduction of white or colored noise or changes in playback speed. However, highly succeptable to cutting or padding.
Accurate: Though testing has so far been limited, false matches are rare. It has even identified matches when two artists perform the same song - the similar musical structure is enough to allow a match.
Extremely high speed of searching: Similarity between hashes is calculated as hamming distance, which can be computed very quickly.

This approach does have a drawback: It has no resistance at all to clipping, and requires the content to syncronised. That makes it unsuited for unstructured pattern-matching - you cannot use it to, for example, identify a song based on a short extract. It does make it ideally suited to another use: Detecting duplicate songs in poorly-structured music library. It was designed for this, and performs very well in that capacity. The succeptability to clipping and shifting becomes a strength in this capacity, as it allows easy differentiation between different mixes of a song - extended, dance and club editions have very similiar power spectrums, but different structure at the large temporal scales used by risahash to identify commonalities.

The algorithm

The approach is, as the name states, ridiculously simple: The signal (having been converted to a single signed mono channel) is divided into 66 equal-length segments. Each of these, excluding the endmost, corresponds to a single bit of the hash to be output. This bit will be a one if that segment has a greater power (ie, mean squared sample value) than the previous segment, or zero otherwise. The contents of the final segment is ignored entirely, to reduce the effect of the fade-out effect commonly encountered in music. These 'slices' are quite short - approximately one second of slice duration for each minute of audio. The bits are in effect a representation of the volume either rising or falling over sixty-four defined transition points. This patten of when the sound is getting louder and when it is getting softer are an effective perceptual hash of the audio file.

A basic demonstration that reads a .wav file and outputs the hash.