Don't Rely on MD5 sums.
MD5 sums are not a reliable way to check for duplicates, they are only a way to check for differences.
Use MD5s to find possible candidate duplicates, and then for each pair sharing an MD5
- Opens both files
- Seeks forward in those files until one differs.
Seeing I'm getting downvoted by people doing naïve approaches to file duplicate Identity, If you're going to rely entirely on a hash algorithm, for goodness sake, use something tougher like SHA256 or SHA512, at least you'll reduce the probability to a reasonable degree by having more bits checked. MD5 is Exceedingly weak for collision conditions.
I also advise people read mailing lists here titled 'file check' : http://london.pm.org/pipermail/london.pm/Week-of-Mon-20080714/thread.html
If you say "MD5 can uniquely identify all files uniquely" then you have a logic error.
Given a range of values, of varying lengths from 40,000 bytes in length to 100,000,000,000 bytes in length, the total number of combinations available to that range greatly exceeds the possible number of values represented by MD5, weighing in at a mere 128 bits of length.
Represent 2^100,000,000,000 combinations with only 2^128 combinations? I don't think that likely.
The Least Naïve way
The least naïve way, and the fastest way, to weed out duplicates is as follows.
- By size: Files with different size cannot be identical. This takes little time as it does not have to even open the file.
- By MD5 : Files with different MD5/Sha values cannot be identical. This takes a little longer because it has to read all bytes in the file and perform math on them, but it makes multiple comparisons quicker.
- Failing the above differences: Perform a byte-by-byte comparison of the files. This is a slow test to execute, which is why it is left until after all the other eliminating factors have been considered.
Fdupes does this. And you should use software that uses the same criteria.