Find duplicate files by content not name
Today in IRC suseROCKS needed to find all duplicate files in a directory by their content, not by their file name, so we whipped up this fancy little 1 liner bash script to do the trick:
find . -type f -exec md5sum '{}' \; | sort | awk 'dup[$1]++{print $2}’
EDIT:
As Andreas suggested, using xargs instead of -exec is much faster, here is the updated command:
find . -type f -print0 | xargs -0 md5sum | sort | awk ‘dup[$1]++{print $2}’








Normally you want to use xargs instead of -exec. The -exec option calls md5sum for every file found. With xargs, md5sum is only called once.
Tested with cold caches on ~200 files:
time find . -type f -exec md5sum ‘{}’ \; | sort | awk ‘dup[$1]++{print $2}’
…
real 0m6.567s
user 0m0.636s
sys 0m0.604s
—
time find . -type f -print0 | xargs -0 md5sum | sort | awk ‘dup[$1]++{print $2}’
…
real 0m5.454s
user 0m0.620s
sys 0m0.272s
Comment by Andreas Schneider — July 1, 2008 @ 1:56 am
we use duplicate finder from ashisoft to find and remove duplicate files.
You can find the free trial version at : http://www.ashisoft.com
Comment by John — July 1, 2008 @ 5:38 am
If I remember right, xargs is easier on memory usage as well.
Comment by Sam Merrell — July 1, 2008 @ 9:25 pm
A C prog that achieves the same
http://en.wikipedia.org/wiki/Fdupes
Comment by Ted Hanney — July 3, 2008 @ 10:26 am
Yeah fdupes works very well too!
Comment by IAnjo — July 3, 2008 @ 12:44 pm
Would be cool to have a GUI for that, or even integrated in the file browser.
Comment by Jakub Szypulka — July 4, 2008 @ 8:58 am
it has to be the coolest command ever !
Why didn’t i think of it.
Two similar files will have same md5, isn’t it ?
Comment by T — August 4, 2008 @ 6:59 pm