sontek ( John M. Anderson )

June 30, 2008

Find duplicate files by content not name

Filed under: Bash — Tags: — sontek @ 11:36 pm

Today in IRC suseROCKS needed to find all duplicate files in a directory by their content, not by their file name, so we whipped up this fancy little 1 liner bash script to do the trick:

find . -type f -exec md5sum '{}' \; | sort | awk 'dup[$1]++{print $2}’

EDIT:

As Andreas suggested, using xargs instead of -exec is much faster, here is the updated command:

find . -type f -print0 | xargs -0 md5sum | sort | awk ‘dup[$1]++{print $2}’

Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • description
  • Pownce
  • Slashdot
  • StumbleUpon
  • TwitThis

6 Comments »

  1. Normally you want to use xargs instead of -exec. The -exec option calls md5sum for every file found. With xargs, md5sum is only called once.

    Tested with cold caches on ~200 files:

    time find . -type f -exec md5sum ‘{}’ \; | sort | awk ‘dup[$1]++{print $2}’

    real 0m6.567s
    user 0m0.636s
    sys 0m0.604s

    time find . -type f -print0 | xargs -0 md5sum | sort | awk ‘dup[$1]++{print $2}’

    real 0m5.454s
    user 0m0.620s
    sys 0m0.272s

    Comment by Andreas Schneider — July 1, 2008 @ 1:56 am

  2. we use duplicate finder from ashisoft to find and remove duplicate files.

    You can find the free trial version at : http://www.ashisoft.com

    Comment by John — July 1, 2008 @ 5:38 am

  3. If I remember right, xargs is easier on memory usage as well.

    Comment by Sam Merrell — July 1, 2008 @ 9:25 pm

  4. A C prog that achieves the same
    http://en.wikipedia.org/wiki/Fdupes

    Comment by Ted Hanney — July 3, 2008 @ 10:26 am

  5. Yeah fdupes works very well too!

    Comment by IAnjo — July 3, 2008 @ 12:44 pm

  6. Would be cool to have a GUI for that, or even integrated in the file browser.

    Comment by Jakub Szypulka — July 4, 2008 @ 8:58 am

RSS feed for comments on this post. TrackBack URL

Leave a comment

Powered by WordPress