{"id":6457,"date":"2019-05-14T09:45:57","date_gmt":"2019-05-14T14:45:57","guid":{"rendered":"https:\/\/www.poweradmin.com\/blog\/?p=6457"},"modified":"2019-05-10T16:56:59","modified_gmt":"2019-05-10T21:56:59","slug":"identifying-duplicate-files-in-linux","status":"publish","type":"post","link":"https:\/\/www.poweradmin.com\/blog\/identifying-duplicate-files-in-linux\/","title":{"rendered":"Identifying Duplicate Files in Linux"},"content":{"rendered":"<p style=\"margin-bottom: .0001pt; line-height: 150%; text-autospace: none;\"><span style=\"font-family: 'Arial',sans-serif; color: black;\"><span style=\"color: #000000;\"><strong><span style=\"font-family: 'Arial',sans-serif;\">By Des Nnochiri<\/span><\/strong><\/span><\/span><\/p>\n<p>\u00a0<\/p>\n<p style=\"margin-bottom: .0001pt; line-height: 150%; text-autospace: none;\"><span style=\"font-family: 'Arial',sans-serif; color: black;\">Keeping redundant copies of essential files and programs can assist in recovery when system glitches or other incidents occur. However, duplicate files also hold the potential to confuse matters and introduce errors. It\u2019s possible to have too much of a good thing, so keeping track of these duplicates is always a good idea.<\/span><\/p>\n<p style=\"margin-bottom: .0001pt; line-height: 150%; text-autospace: none;\"><span style=\"font-family: 'Arial',sans-serif; color: black;\">\u00a0<\/span><\/p>\n<h2><span style=\"font-family: 'Arial',sans-serif;\">Why Files Sometimes Multiply<\/span><\/h2>\n<p style=\"margin-bottom: .0001pt; line-height: 150%; text-autospace: none;\"><span style=\"font-family: 'Arial',sans-serif; color: black;\">\u00a0<\/span><\/p>\n<p style=\"margin-bottom: .0001pt; line-height: 150%; text-autospace: none;\"><span style=\"font-family: 'Arial',sans-serif; color: black;\">If you\u2019re a music, video, or graphics enthusiast, you\u2019ll understand how easy it can be for files with similar names or similar content to pile up on your storage drives. What might surprise you is that this kind of download or file-saving activity isn\u2019t the biggest cause of file replication on most systems.<\/span><\/p>\n<p style=\"margin-bottom: .0001pt; line-height: 150%; text-autospace: none;\"><span style=\"font-family: 'Arial',sans-serif; color: black;\">\u00a0<\/span><\/p>\n<p style=\"margin-bottom: .0001pt; line-height: 150%; text-autospace: none;\"><span style=\"font-family: 'Arial',sans-serif; color: black;\">Operating systems and application software are actually the biggest culprits. These programs often create duplicate files for entirely legitimate purposes.<\/span><\/p>\n<p style=\"margin-bottom: .0001pt; line-height: 150%; text-autospace: none;\"><span style=\"font-family: 'Arial',sans-serif; color: black;\">\u00a0<\/span><\/p>\n<p style=\"margin-bottom: .0001pt; line-height: 150%; text-autospace: none;\"><span style=\"font-family: 'Arial',sans-serif; color: black;\">Some programs are designed with built-in mechanisms to protect them from the malfunctioning of other, related software. One way of doing this is for an application to install its own local copy of a shared library or support files. This prevents the software from losing access to essential files or code if another program that uses the same library is uninstalled or becomes hopelessly corrupted.<\/span><\/p>\n<p style=\"margin-bottom: .0001pt; line-height: 150%; text-autospace: none;\"><span style=\"font-family: 'Arial',sans-serif; color: black;\">\u00a0<\/span><\/p>\n<p style=\"margin-bottom: .0001pt; line-height: 150%; text-autospace: none;\"><span style=\"font-family: 'Arial',sans-serif; color: black;\">Operating systems typically offer users the option of rolling back their installation to a previous version if the current set-up becomes damaged or corrupted by a virus attack or some other circumstance. This requires entire sets of OS files to be held in a designated portion of your hard drive, many of which will be exact duplicates of files on the currently active operating system.<\/span><\/p>\n<p style=\"margin-bottom: .0001pt; line-height: 150%; text-autospace: none;\"><span style=\"font-family: 'Arial',sans-serif; color: black;\">\u00a0<\/span><\/p>\n<p style=\"margin-bottom: .0001pt; line-height: 150%; text-autospace: none;\"><span style=\"font-family: 'Arial',sans-serif; color: black;\">Similarly, when some applications update themselves, they\u2019ll save copies of any files that have been changed in case the upgrade goes wrong and the previous installation needs to be restored.<\/span><\/p>\n<p style=\"margin-bottom: .0001pt; line-height: 150%; text-autospace: none;\"><span style=\"font-family: 'Arial',sans-serif; color: black;\">\u00a0<\/span><\/p>\n<h2><span style=\"font-family: 'Arial',sans-serif;\">When More Means Less<\/span><\/h2>\n<p style=\"margin-bottom: .0001pt; line-height: 150%; text-autospace: none;\"><span style=\"font-family: 'Arial',sans-serif; color: black;\">\u00a0<\/span><\/p>\n<p style=\"margin-bottom: .0001pt; line-height: 150%; text-autospace: none;\"><span style=\"font-family: 'Arial',sans-serif; color: black;\">Besides their potential to consume large quantities of valuable storage space, duplicate files can sometimes create problems affecting usability or operational matters.<\/span><\/p>\n<p style=\"margin-bottom: .0001pt; line-height: 150%; text-autospace: none;\"><span style=\"font-family: 'Arial',sans-serif; color: black;\">\u00a0<\/span><\/p>\n<p style=\"margin-bottom: .0001pt; line-height: 150%; text-autospace: none;\"><span style=\"font-family: 'Arial',sans-serif; color: black;\">If, for example, the information making up a duplicate file set becomes corrupted, applications depending on this data for their functionality may exhibit erratic or even destructive behaviors. Duplicate files may slow down or complicate system-wide operations, such as file indexing or database sorts and searches.<\/span><\/p>\n<p style=\"margin-bottom: .0001pt; line-height: 150%; text-autospace: none;\"><span style=\"font-family: 'Arial',sans-serif; color: black;\">\u00a0<\/span><\/p>\n<h2><span style=\"font-family: 'Arial',sans-serif;\">The Issue of Hard and Symbolic Links<\/span><\/h2>\n<p style=\"margin-bottom: .0001pt; line-height: 150%; text-autospace: none;\"><span style=\"font-family: 'Arial',sans-serif; color: black;\">\u00a0<\/span><\/p>\n<p style=\"margin-bottom: .0001pt; line-height: 150%; text-autospace: none;\"><span style=\"font-family: 'Arial',sans-serif; color: black;\">In a Linux environment, the issue of duplicate files may be even more pronounced. That\u2019s because two or more entities on a Linux drive can have different names yet still be recognized by the system as identical versions of the same file.<\/span><\/p>\n<p style=\"margin-bottom: .0001pt; line-height: 150%; text-autospace: none;\"><span style=\"font-family: 'Arial',sans-serif; color: black;\">\u00a0<\/span><\/p>\n<p style=\"margin-bottom: .0001pt; line-height: 150%; text-autospace: none;\"><span style=\"font-family: 'Arial',sans-serif; color: black;\">Files sharing the same disk space in a Linux installation will share the same inode, which is the data structure that stores all the information about a file except its name and its content. Two or more files with a common inode may have different names and file system locations, yet they\u2019ll still share the same content, ownership, permissions, and other characteristics.<\/span><\/p>\n<p style=\"margin-bottom: .0001pt; line-height: 150%; text-autospace: none;\"><span style=\"font-family: 'Arial',sans-serif; color: black;\">\u00a0<\/span><\/p>\n<p style=\"margin-bottom: .0001pt; line-height: 150%; text-autospace: none;\"><span style=\"font-family: 'Arial',sans-serif; color: black;\">Files like these are known as hard links. They operate in contrast to symbolic links, which point to other files by containing their names. In her <\/span><a href=\"https:\/\/www.networkworld.com\/article\/3387961\/how-to-identify-duplicate-files-on-linux.html\" rel=\"nofollow\" target=\"_blank\"><span style=\"font-family: 'Arial',sans-serif;\">analysis of this subject for NetworkWorld<\/span><img class=\"extlink-icon\" src=\"https:\/\/www.poweradmin.com\/blog\/wp-content\/plugins\/external-links-nofollow-open-in-new-tab-favicon\/images\/extlink.png\"><\/a><span style=\"font-family: 'Arial',sans-serif; color: black;\">, Sandra Henry-Stocker points out that symbolic links are easy to identify in a file listing by the \u201cl\u201d in the first position and the -&gt; symbol that refers to the file being referenced:<\/span><\/p>\n<p style=\"margin-bottom: .0001pt; line-height: 150%; text-autospace: none;\"><span style=\"font-family: 'Arial',sans-serif; color: black;\">\u00a0<\/span><\/p>\n<p style=\"margin-bottom: .0001pt; line-height: 150%; text-autospace: none;\"><span style=\"font-family: 'Arial',sans-serif; color: black;\">$ ls -l my*<\/span><\/p>\n<p style=\"margin-bottom: .0001pt; line-height: 150%; text-autospace: none;\"><span style=\"font-family: 'Arial',sans-serif; color: black;\">-rw-r\u2013r\u2013 4 shs shs\u00a0\u00a0 228 Apr 12 19:37 myfile<\/span><\/p>\n<p style=\"margin-bottom: .0001pt; line-height: 150%; text-autospace: none;\"><span style=\"font-family: 'Arial',sans-serif; color: black;\">lrwxrwxrwx 1 shs shs\u00a0\u00a0\u00a0\u00a0 6 Apr 15 11:18 myref -&gt; myfile<\/span><\/p>\n<p style=\"margin-bottom: .0001pt; line-height: 150%; text-autospace: none;\"><span style=\"font-family: 'Arial',sans-serif; color: black;\">-rw-r\u2013r\u2013 4 shs shs\u00a0\u00a0 228 Apr 12 19:37 mytwin<\/span><\/p>\n<p style=\"margin-bottom: .0001pt; line-height: 150%; text-autospace: none;\"><span style=\"font-family: 'Arial',sans-serif; color: black;\">\u00a0<\/span><\/p>\n<h2><span style=\"font-family: 'Arial',sans-serif;\">Finding Duplicate Files in Linux<\/span><\/h2>\n<p style=\"margin-bottom: .0001pt; line-height: 150%; text-autospace: none;\"><span style=\"font-family: 'Arial',sans-serif; color: black;\">\u00a0<\/span><\/p>\n<p style=\"margin-bottom: .0001pt; line-height: 150%; text-autospace: none;\"><span style=\"font-family: 'Arial',sans-serif; color: black;\">To identify the hard links in a single directory, you can list the files using the ls -i command, and sort them by inode number. The inode numbers will appear in the first column of this type of output. <\/span><\/p>\n<p>\u00a0<\/p>\n<p><a href=\"https:\/\/www.poweradmin.com\/blog\/wp-content\/uploads\/2019\/05\/find-duplicate-files-linux.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-6459\" src=\"https:\/\/www.poweradmin.com\/blog\/wp-content\/uploads\/2019\/05\/find-duplicate-files-linux-300x81.png\" alt=\"\" width=\"450\" height=\"122\" srcset=\"https:\/\/www.poweradmin.com\/blog\/wp-content\/uploads\/2019\/05\/find-duplicate-files-linux-300x81.png 300w, https:\/\/www.poweradmin.com\/blog\/wp-content\/uploads\/2019\/05\/find-duplicate-files-linux.png 623w\" sizes=\"auto, (max-width: 450px) 100vw, 450px\"><\/a><\/p>\n<p>\u00a0<\/p>\n<p style=\"margin-bottom: 0.0001pt; line-height: 150%; text-align: center;\"><span style=\"font-family: 'Arial',sans-serif; color: black;\">(Image source: <\/span><a href=\"https:\/\/www.networkworld.com\/article\/3387961\/how-to-identify-duplicate-files-on-linux.html\" rel=\"nofollow\" target=\"_blank\"><span style=\"font-family: 'Arial',sans-serif;\">NetworkWorld<\/span><img class=\"extlink-icon\" src=\"https:\/\/www.poweradmin.com\/blog\/wp-content\/plugins\/external-links-nofollow-open-in-new-tab-favicon\/images\/extlink.png\"><\/a><span style=\"font-family: 'Arial',sans-serif; color: black;\">)<\/span><\/p>\n<p style=\"margin-bottom: 0.0001pt; line-height: 150%; text-align: center;\"><span style=\"font-family: 'Arial',sans-serif; color: black;\">\u00a0<\/span><\/p>\n<p style=\"margin-bottom: .0001pt; line-height: 150%; text-autospace: none;\"><span style=\"font-family: 'Arial',sans-serif; color: black;\">Rather than scanning reams of output for identical node numbers, it\u2019s possible to find out if one particular file is hard-linked to another. In this case, you should use the -samefile option of the find command. The starting location that you provide to the find command will dictate how much of the file system is scanned for matches. In the example below, the search takes place in the current directory and its subdirectories.<\/span><\/p>\n<p>\u00a0<\/p>\n<p style=\"margin-bottom: .0001pt; line-height: 150%; text-autospace: none;\"><span style=\"font-family: 'Arial',sans-serif; color: black;\"> <a href=\"https:\/\/www.poweradmin.com\/blog\/wp-content\/uploads\/2019\/05\/find-duplicate-files-linux-2.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-6461\" src=\"https:\/\/www.poweradmin.com\/blog\/wp-content\/uploads\/2019\/05\/find-duplicate-files-linux-2-300x46.png\" alt=\"\" width=\"450\" height=\"69\" srcset=\"https:\/\/www.poweradmin.com\/blog\/wp-content\/uploads\/2019\/05\/find-duplicate-files-linux-2-300x46.png 300w, https:\/\/www.poweradmin.com\/blog\/wp-content\/uploads\/2019\/05\/find-duplicate-files-linux-2.png 623w\" sizes=\"auto, (max-width: 450px) 100vw, 450px\"><\/a><\/span><\/p>\n<p>\u00a0<\/p>\n<p style=\"margin-bottom: 0.0001pt; line-height: 150%; text-align: center;\"><span style=\"font-family: 'Arial',sans-serif; color: black;\">(Image source: <\/span><a href=\"https:\/\/www.networkworld.com\/article\/3387961\/how-to-identify-duplicate-files-on-linux.html\" rel=\"nofollow\" target=\"_blank\"><span style=\"font-family: 'Arial',sans-serif;\">NetworkWorld<\/span><img class=\"extlink-icon\" src=\"https:\/\/www.poweradmin.com\/blog\/wp-content\/plugins\/external-links-nofollow-open-in-new-tab-favicon\/images\/extlink.png\"><\/a><span style=\"font-family: 'Arial',sans-serif; color: black;\">)<\/span><\/p>\n<p style=\"margin-bottom: .0001pt; line-height: 150%; text-autospace: none;\"><span style=\"font-family: 'Arial',sans-serif; color: black;\">\u00a0<\/span><\/p>\n<p style=\"margin-bottom: .0001pt; line-height: 150%; text-autospace: none;\"><span style=\"font-family: 'Arial',sans-serif; color: black;\">To find duplicate files and drill down to more detail on their characteristics, you can use the find command\u2019s -ls option:<\/span><\/p>\n<p>\u00a0<\/p>\n<p style=\"text-align: center;\"><a href=\"https:\/\/www.poweradmin.com\/blog\/wp-content\/uploads\/2019\/05\/find-duplicate-files-linux-3.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-6463\" src=\"https:\/\/www.poweradmin.com\/blog\/wp-content\/uploads\/2019\/05\/find-duplicate-files-linux-3-300x46.png\" alt=\"\" width=\"450\" height=\"69\" srcset=\"https:\/\/www.poweradmin.com\/blog\/wp-content\/uploads\/2019\/05\/find-duplicate-files-linux-3-300x46.png 300w, https:\/\/www.poweradmin.com\/blog\/wp-content\/uploads\/2019\/05\/find-duplicate-files-linux-3.png 620w\" sizes=\"auto, (max-width: 450px) 100vw, 450px\"><\/a><\/p>\n<p>\u00a0<\/p>\n<p style=\"margin-bottom: 0.0001pt; line-height: 150%; text-align: center;\"><span style=\"font-family: 'Arial',sans-serif; color: black;\">(Image source: <\/span><a href=\"https:\/\/www.networkworld.com\/article\/3387961\/how-to-identify-duplicate-files-on-linux.html\" rel=\"nofollow\" target=\"_blank\"><span style=\"font-family: 'Arial',sans-serif;\">NetworkWorld<\/span><img class=\"extlink-icon\" src=\"https:\/\/www.poweradmin.com\/blog\/wp-content\/plugins\/external-links-nofollow-open-in-new-tab-favicon\/images\/extlink.png\"><\/a><span style=\"font-family: 'Arial',sans-serif; color: black;\">)<\/span><\/p>\n<p style=\"margin-bottom: .0001pt; line-height: 150%; text-autospace: none;\"><span style=\"font-family: 'Arial',sans-serif; color: black;\">\u00a0<\/span><\/p>\n<p style=\"margin-bottom: .0001pt; line-height: 150%; text-autospace: none;\"><span style=\"font-family: 'Arial',sans-serif; color: black;\">The first column of output displays the inode number. It then lists the file permissions, links, owner, file size, date information, and the names of any files that refer to the same disk content. <\/span><\/p>\n<p style=\"margin-bottom: .0001pt; line-height: 150%; text-autospace: none;\"><span style=\"font-family: 'Arial',sans-serif; color: black;\">\u00a0<\/span><\/p>\n<p style=\"margin-bottom: .0001pt; line-height: 150%; text-autospace: none;\"><span style=\"font-family: 'Arial',sans-serif; color: black;\">To locate all instances of hard links in a single directory, you can run a script like this:<\/span><\/p>\n<p style=\"margin-bottom: .0001pt; line-height: 150%; text-autospace: none;\"><span style=\"font-family: 'Arial',sans-serif; color: black;\">\u00a0<\/span><\/p>\n<p><a href=\"https:\/\/www.poweradmin.com\/blog\/wp-content\/uploads\/2019\/05\/locate-hard-links.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-6466\" src=\"https:\/\/www.poweradmin.com\/blog\/wp-content\/uploads\/2019\/05\/locate-hard-links-300x204.png\" alt=\"\" width=\"450\" height=\"306\" srcset=\"https:\/\/www.poweradmin.com\/blog\/wp-content\/uploads\/2019\/05\/locate-hard-links-300x204.png 300w, https:\/\/www.poweradmin.com\/blog\/wp-content\/uploads\/2019\/05\/locate-hard-links.png 616w\" sizes=\"auto, (max-width: 450px) 100vw, 450px\"><\/a><\/p>\n<p>\u00a0<\/p>\n<p style=\"margin-bottom: 0.0001pt; line-height: 150%; text-align: center;\"><span style=\"font-family: 'Arial',sans-serif; color: black;\">(Image source: <\/span><a href=\"https:\/\/www.networkworld.com\/article\/3387961\/how-to-identify-duplicate-files-on-linux.html\" rel=\"nofollow\" target=\"_blank\"><span style=\"font-family: 'Arial',sans-serif;\">NetworkWorld<\/span><img class=\"extlink-icon\" src=\"https:\/\/www.poweradmin.com\/blog\/wp-content\/plugins\/external-links-nofollow-open-in-new-tab-favicon\/images\/extlink.png\"><\/a><span style=\"font-family: 'Arial',sans-serif; color: black;\">)<\/span><\/p>\n<p style=\"margin-bottom: .0001pt; line-height: 150%; text-autospace: none;\"><span style=\"font-family: 'Arial',sans-serif; color: black;\">\u00a0<\/span><\/p>\n<p style=\"margin-bottom: .0001pt; line-height: 150%; text-autospace: none;\"><span style=\"font-family: 'Arial',sans-serif; color: black;\">Note that using this method to scan for Linux duplicate files which contain the same content but don\u2019t share inodes (i.e., simple file copies) takes considerably more time and effort.<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>By Des Nnochiri \u00a0 Keeping redundant copies of essential files and programs can assist in recovery when system glitches or other incidents occur. However, duplicate files also hold the potential to confuse matters and introduce errors. It\u2019s possible to have too much of a good thing, so keeping track of these duplicates is always a [&hellip;]<\/p>\n","protected":false},"author":15,"featured_media":6469,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[4,447,9],"tags":[23,294,269,697,699,698,693,700,704,705,398,526,538,703,702,701],"class_list":["post-6457","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-general-it","category-linux","category-technical","tag-data","tag-data-sharing","tag-data-storage","tag-duplicate-files","tag-file-corruption","tag-file-replication","tag-file-sharing","tag-hard-link","tag-inode","tag-inode-number","tag-linux","tag-linux-command","tag-linux-directory","tag-linux-drive","tag-linux-environment","tag-symbolic-link"],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/www.poweradmin.com\/blog\/wp-json\/wp\/v2\/posts\/6457","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.poweradmin.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.poweradmin.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.poweradmin.com\/blog\/wp-json\/wp\/v2\/users\/15"}],"replies":[{"embeddable":true,"href":"https:\/\/www.poweradmin.com\/blog\/wp-json\/wp\/v2\/comments?post=6457"}],"version-history":[{"count":5,"href":"https:\/\/www.poweradmin.com\/blog\/wp-json\/wp\/v2\/posts\/6457\/revisions"}],"predecessor-version":[{"id":6493,"href":"https:\/\/www.poweradmin.com\/blog\/wp-json\/wp\/v2\/posts\/6457\/revisions\/6493"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.poweradmin.com\/blog\/wp-json\/wp\/v2\/media\/6469"}],"wp:attachment":[{"href":"https:\/\/www.poweradmin.com\/blog\/wp-json\/wp\/v2\/media?parent=6457"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.poweradmin.com\/blog\/wp-json\/wp\/v2\/categories?post=6457"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.poweradmin.com\/blog\/wp-json\/wp\/v2\/tags?post=6457"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}