diff --git a/help/conf.tmpl b/help/conf.tmpl index 7c66d5ef..f7d45f9b 100644 --- a/help/conf.tmpl +++ b/help/conf.tmpl @@ -158,10 +158,10 @@ html_theme = 'haiku' #html_additional_pages = {} # If false, no module index is generated. -html_domain_indices = False +# html_domain_indices = False # If false, no index is generated. -html_use_index = False +# html_use_index = False # If true, the index is split into individual pages for each letter. #html_split_index = False diff --git a/help/en/faq.rst b/help/en/faq.rst index a41ed1af..87d393f1 100644 --- a/help/en/faq.rst +++ b/help/en/faq.rst @@ -29,7 +29,7 @@ What makes it better than other duplicate scanners? --------------------------------------------------- The scanning engine is extremely flexible. You can tweak it to really get the kind of results you -want. You can read more about dupeGuru tweaking option at the :doc:`Preferences page `. +want. You can read more about dupeGuru tweaking option in :doc:`scan`. How safe is it to use dupeGuru? ------------------------------- diff --git a/help/en/folders.rst b/help/en/folders.rst index 20eb035a..a3afb018 100644 --- a/help/en/folders.rst +++ b/help/en/folders.rst @@ -1,53 +1,67 @@ Folder Selection ================ -The first window you see when you launch dupeGuru is the folder selection window. This windows contains the list of the folders that will be scanned when you click on **Scan**. +The first window you see when you launch dupeGuru is the folder selection window. This windows +contains the list of the folders that will be scanned when you click on **Scan**. -This window is quite straightforward to use. If you want to add a folder, click on the **+** button. If you added folder before, a popup menu with a list of recent folders you added will pop. You can click on one of them to add it directly to your list. If you click on the first item of the popup menu, **Add New Folder...**, you will be prompted for a folder to add. If you never added a folder, no menu will pop and you will directly be prompted for a new folder to add. +This window is quite straightforward to use. If you want to add a folder, click on the **+** button. +If you added folder before, a popup menu with a list of recent folders you added will pop. You can +click on one of them to add it directly to your list. If you click on the first item of the popup +menu, **Add New Folder...**, you will be prompted for a folder to add. If you never added a folder, +no menu will pop and you will directly be prompted for a new folder to add. An alternate way to add folders to the list is to drag them in the list. -To remove a folder, select the folder to remove and click on **-**. If a subfolder is selected when you click the button, the selected folder will be set to **excluded** state (see below) instead of being removed. +To remove a folder, select the folder to remove and click on **-**. If a subfolder is selected when +you click the button, the selected folder will be set to **excluded** state (see below) instead of +being removed. Folder states ------------- Every folder can be in one of these 3 states: -* **Normal:** Duplicates found in this folder can be deleted. -* **Reference:** Duplicates found in this folder **cannot** be deleted. Files from this folder can only end up in **reference** position in the dupe group. If more than one file from reference folders end up in the same dupe group, only one will be kept. The others will be removed from the group. -* **Excluded:** Files in this directory will not be included in the scan. +**Normal:** + Duplicates found in this folder can be deleted. +**Reference:** + Duplicates found in this folder **cannot** be deleted. Files from this folder can + only end up in **reference** position in the dupe group. If more than one file from reference + folders end up in the same dupe group, only one will be kept. The others will be removed from + the group. +**Excluded:** + Files in this directory will not be included in the scan. -The default state of a folder is, of course, **Normal**. You can use **Reference** state for a folder if you want to be sure that you won't delete any file from it. +The default state of a folder is, of course, **Normal**. You can use **Reference** state for a +folder if you want to be sure that you won't delete any file from it. -When you set the state of a directory, all subfolders of this folder automatically inherit this state unless you explicitly set a subfolder's state. +When you set the state of a directory, all subfolders of this folder automatically inherit this +state unless you explicitly set a subfolder's state. -.. only:: edition_pe +.. _iphoto: - iPhoto and Aperture libraries - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - - dupeGuru PE supports iPhoto and Aperture, which means that it knows how to read these libraries - and how to communicate with iPhoto and Aperture to remove photos from them. To use this feature, - use the special "Add iPhoto Library" and "Add Aperture Library" buttons in the menu that pops - up when you click the "+" button. This will then add a special folder for those libraries. - - When duplicates are deleted from an iPhoto library, it's sent to iPhoto's trash. - - When duplicates are deleted from an Aperture library, it unfortunately can't send it directly - to trash, but it creates a special project called "dupeGuru Trash" in Aperture and send all - photos in there. You can then send this project to the trash manually. +iPhoto and Aperture libraries +----------------------------- -.. only:: edition_me +dupeGuru Picture Edition supports iPhoto and Aperture, which means that it knows how to read these +libraries and how to communicate with iPhoto and Aperture to remove photos from them. To use this +feature, use the special "Add iPhoto Library" and "Add Aperture Library" buttons in the menu that +pops up when you click the "+" button. This will then add a special folder for those libraries. - iTunes library - ^^^^^^^^^^^^^^ - - dupeGuru ME supports iTunes, which means that it knows how to read its libraries and how to - communicate with iTunes to remove songs from it. To use this feature, use the special - "Add iTunes Library" button in the menu that pops up when you click the "+" button. This will - then add a special folder for those libraries. - - When duplicates are deleted from an iTunes library, it's sent to the system trash, like a - normal file, but it's also removed from iTunes, thus avoiding ending up with missing entries - (entries with the "!" logo next to them). +When duplicates are deleted (sent to trash) from an iPhoto library, it's sent to iPhoto's +trash. + +When duplicates are deleted (sent to trash) from an Aperture library, it unfortunately can't +send it directly to trash, but it creates a special project called "dupeGuru Trash" in Aperture +and send all photos in there. You can then send this project to the trash manually. + +iTunes library +-------------- + +dupeGuru Music Edition supports iTunes, which means that it knows how to read its libraries and how +to communicate with iTunes to remove songs from it. To use this feature, use the special +"Add iTunes Library" button in the menu that pops up when you click the "+" button. This will +then add a special folder for those libraries. + +When duplicates are deleted from an iTunes library, it's sent to the system trash, like a +normal file, but it's also removed from iTunes, thus avoiding ending up with missing entries +(entries with the "!" logo next to them). diff --git a/help/en/index.rst b/help/en/index.rst index b758be2a..746ed55c 100644 --- a/help/en/index.rst +++ b/help/en/index.rst @@ -51,9 +51,16 @@ Contents: quick_start folders preferences + scan results reprioritize faq developer/index changelog credits + +Indices and tables +================== + +* :ref:`genindex` +* :ref:`search` diff --git a/help/en/preferences.rst b/help/en/preferences.rst index 7fb6a41d..1ba55091 100644 --- a/help/en/preferences.rst +++ b/help/en/preferences.rst @@ -1,63 +1,87 @@ Preferences =========== -.. only:: edition_se - - **Scan Type:** This option determines what aspect of the files will be compared in the duplicate scan. If you select **Filename**, dupeGuru will compare every filenames word-by-word and, depending on the other settings below, it will determine if enough words are matching to consider 2 files duplicates. If you select **Content**, only files with the exact same content will match. - - The **Folders** scan type is a bit special. When you choose it, dupeGuru will scan for duplicate *folders* instead of duplicate files. To determine whether two folders are duplicates, all files contained in the folders will be scanned, and if the contents of **all** files in the folders match, the folders will be considered duplicates. - - **Filter Hardness:** If you chose the **Filename** scan type, this option determines how similar two filenames must be for dupeGuru to consider them duplicates. If the filter hardness is, for example 80, it means that 80% of the words of two filenames must match. To determine the matching percentage, dupeGuru first counts the total number of words in **both** filenames, then count the number of words matching (every word matching count as 2), and then divide the number of words matching by the total number of words. If the result is higher or equal to the filter hardness, we have a duplicate match. For example, "a b c d" and "c d e" have a matching percentage of 57 (4 words matching, 7 total words). +**Scan Type:** + Basic scan type to use. See :doc:`scan` for details. .. only:: edition_me - **Scan Type:** This option determines what aspect of the files will be compared in the duplicate scan. The nature of the duplicate scan varies greatly depending on what you select for this option. - - * **Filename:** Every song will have its filename split into words, and then every word will be compared to compute a matching percentage. If this percentage is higher or equal to the **Filter Hardness** (see below for more details), dupeGuru will consider the 2 songs duplicates. - * **Filename - Fields:** Like **Filename**, except that once filename have been split into words, these words are then grouped into fields. The field separator is " - ". The final matching percentage will be the lowest matching percentage among the fields. Thus, "An Artist - The Title" and "An Artist - Other Title" would have a matching percentage of 50 (With a **Filename** scan, it would be 75). - * **Filename - Fields (No Order):** Like **Filename - Fields**, except that field order doesn't matter. For example, "An Artist - The Title" and "The Title - An Artist" would have a matching percentage of 100 instead of 0. - * **Tags:** This method reads the tag (metadata) of every song and compare their fields. This method, like the **Filename - Fields**, considers the lowest matching field as its final matching percentage. - * **Content:** This scan method use the actual content of the songs to determine which are duplicates. For 2 songs to match with this method, they must have the **exact same content**. - * **Audio Content:** Same as content, but only the audio content is compared (without metadata). - - **Filter Hardness:** If you chose a filename or tag based scan type, this option determines how similar two filenames/tags must be for dupeGuru to consider them duplicates. If the filter hardness is, for example 80, it means that 80% of the words of two filenames must match. To determine the matching percentage, dupeGuru first counts the total number of words in **both** filenames, then count the number of words matching (every word matching count as 2), and then divide the number of words matching by the total number of words. If the result is higher or equal to the filter hardness, we have a duplicate match. For example, "a b c d" and "c d e" have a matching percentage of 57 (4 words matching, 7 total words). - - **Tags to scan:** When using the **Tags** scan type, you can select the tags that will be used for comparison. + **Tags to scan:** + When using the **Tags** scan type, you can select the tags that will be used for comparison. .. only:: edition_se or edition_me - **Word weighting:** If you chose the **Filename** scan type, this option slightly changes how matching percentage is calculated. With word weighting, instead of having a value of 1 in the duplicate count and total word count, every word have a value equal to the number of characters they have. With word weighting, "ab cde fghi" and "ab cde fghij" would have a matching percentage of 53% (19 total characters, 10 characters matching (4 for "ab" and 6 for "cde")). + **Word weighting:** + See :ref:`word-weighting`. - **Match similar words:** If you turn this option on, similar words will be counted as matches. For example "The White Stripes" and "The White Stripe" would have a match % of 100 instead of 66 with that option turned on. **Warning:** Use this option with caution. It is likely that you will get a lot of false positives in your results when turning it on. However, it will help you to find duplicates that you wouldn't have found otherwise. The scan process also is significantly slower with this option turned on. + **Match similar words:** + See :ref:`similarity-matching`. .. only:: edition_pe - **Scan Type:** This option determines the type of scan that will be made on your pictures. The **Contents** scan type compares the actual contents of the pictures in a fuzzy way (making it possible to find not only exact duplicates, but also similar ones). The **EXIF Timestamp** scan type looks at the EXIF metadata of the picture (if it exists) and matches pictures that have the same one. It's much faster than the Contents scan. **Warning:** Modified pictures often keep the same EXIF timestamp, so watch out for false positives when you use that scan type. - - **Filter Hardness:** *Contents scan type only.* The higher is this setting, the "harder" is the filter (In other words, the less results you get). Most pictures of the same quality match at 100% even if the format is different (PNG and JPG for example.). However, if you want to make a PNG match with a lower quality JPG, you will have to set the filer hardness to lower than 100. The default, 95, is a sweet spot. + **Match pictures of different dimensions:** + If you check this box, pictures of different dimensions will be allowed in the same + duplicate group. - **Match pictures of different dimensions:** If you check this box, pictures of different dimensions will be allowed in the same duplicate group. +.. _filter-hardness: -**Can mix file kind:** If you check this box, duplicate groups are allowed to have files with different extensions. If you don't check it, well, they aren't! +**Filter Hardness:** + The threshold needed for two files to be considered duplicates. A lower value means more + duplicates. The meaning of the threshold depends on the scanning type (see :doc:`scan`). + Only works for :ref:`worded ` and :ref:`picture blocks ` + scans. -**Ignore duplicates hardlinking to the same file:** If this option is enabled, dupeGuru will verify duplicates to see if they refer to the same `inode `_. If they do, they will not be considered duplicates. (Only for OS X and Linux) +**Can mix file kind:** + If you check this box, duplicate groups are allowed to have files with different extensions. If + you don't check it, well, they aren't! -**Use regular expressions when filtering:** If you check this box, the filtering feature will treat your filter query as a **regular expression**. Explaining them is beyond the scope of this document. A good place to start learning it is `regular-expressions.info `_. +**Ignore duplicates hardlinking to the same file:** + If this option is enabled, dupeGuru will verify duplicates to see if they refer to the same + `inode`_. If they do, they will not be considered duplicates. (Only for OS X and Linux) -**Remove empty folders after delete or move:** When this option is enabled, folders are deleted after a file is deleted or moved and the folder is empty. +**Use regular expressions when filtering:** + If you check this box, the filtering feature will treat your filter query as a + **regular expression**. Explaining them is beyond the scope of this document. A good place to + start learning it is `regular-expressions.info`_. -**Copy and Move:** Determines how the Copy and Move operations (in the Action menu) will behave. +**Remove empty folders after delete or move:** + When this option is enabled, folders are deleted after a file is deleted or moved and the folder + is empty. -* **Right in destination:** All files will be sent directly in the selected destination, without trying to recreate the source path at all. -* **Recreate relative path:** The source file's path will be re-created in the destination folder up to the root selection in the Directories panel. For example, if you added ``/Users/foobar/SomeFolder`` to your Directories panel and you move ``/Users/foobar/SomeFolder/SubFolder/SomeFile.ext`` to the destination ``/Users/foobar/MyDestination``, the final destination for the file will be ``/Users/foobar/MyDestination/SubFolder`` (``SomeFolder`` has been trimmed from source's path in the final destination.). -* **Recreate absolute path:** The source file's path will be re-created in the destination folder in it's entirety. For example, if you move ``/Users/foobar/SomeFolder/SubFolder/SomeFile.ext`` to the destination ``/Users/foobar/MyDestination``, the final destination for the file will be ``/Users/foobar/MyDestination/Users/foobar/SomeFolder/SubFolder``. +**Copy and Move:** + Determines how the Copy and Move operations (in the Action menu) will behave. -In all cases, dupeGuru nicely handles naming conflicts by prepending a number to the destination filename if the filename already exists in the destination. +* **Right in destination:** All files will be sent directly in the selected destination, without + trying to recreate the source path at all. +* **Recreate relative path:** The source file's path will be re-created in the destination folder up + to the root selection in the Directories panel. For example, if you added + ``/Users/foobar/SomeFolder`` to your Directories panel and you move + ``/Users/foobar/SomeFolder/SubFolder/SomeFile.ext`` to the destination + ``/Users/foobar/MyDestination``, the final destination for the file will be + ``/Users/foobar/MyDestination/SubFolder`` (``SomeFolder`` has been trimmed from source's path in + the final destination.). +* **Recreate absolute path:** The source file's path will be re-created in the destination folder in + its entirety. For example, if you move ``/Users/foobar/SomeFolder/SubFolder/SomeFile.ext`` to the + destination ``/Users/foobar/MyDestination``, the final destination for the file will be + ``/Users/foobar/MyDestination/Users/foobar/SomeFolder/SubFolder``. -**Custom Command:** This preference determines the command that will be invoked by the "Invoke Custom Command" action. You can invoke any external application through this action. This can be useful if, for example, you have a nice diffing application installed. +In all cases, dupeGuru nicely handles naming conflicts by prepending a number to the destination +filename if the filename already exists in the destination. -The format of the command is the same as what you would write in the command line, except that there are 2 placeholders: **%d** and **%r**. These placeholders will be replaced by the path of the selected dupe (%d) and the path of the selected dupe's reference file (%r). +**Custom Command:** + This preference determines the command that will be invoked by the "Invoke Custom Command" + action. You can invoke any external application through this action. This can be useful if, + for example, you have a nice diffing application installed. + +The format of the command is the same as what you would write in the command line, except that there +are 2 placeholders: **%d** and **%r**. These placeholders will be replaced by the path of the +selected dupe (%d) and the path of the selected dupe's reference file (%r). -If the path to your executable contains space characters, you should enclose it in "" quotes. You should also enclose placeholders in quotes because it's very possible that paths to dupes and refs will contain spaces. Here's an example custom command:: +If the path to your executable contains space characters, you should enclose it in "" quotes. You +should also enclose placeholders in quotes because it's very possible that paths to dupes and refs +will contain spaces. Here's an example custom command:: "C:\Program Files\SuperDiffProg\SuperDiffProg.exe" "%d" "%r" + +.. _inode: http://en.wikipedia.org/wiki/Inode +.. _regular-expressions.info: http://www.regular-expressions.info \ No newline at end of file diff --git a/help/en/results.rst b/help/en/results.rst index 515b8a31..49c3eba6 100644 --- a/help/en/results.rst +++ b/help/en/results.rst @@ -1,6 +1,8 @@ Results ======= +.. contents:: + When dupeGuru is finished scanning for duplicates, it will show its results in the form of duplicate group list. About duplicate groups @@ -118,42 +120,54 @@ filtered duplicates. Action Menu ----------- -* **Clear Ignore List:** Remove all ignored matches you added. You have to start a new scan for the - newly cleared ignore list to be effective. -* **Export Results to XHTML:** Take the current results, and create an XHTML file out of it. The - columns that are visible when you click on this button will be the columns present in the XHTML - file. The file will automatically be opened in your default browser. -* **Send Marked to Trash:** Send all marked duplicates to trash, obviously. Before proceeding, - you'll be presented deletion options (see below). -* **Move Marked to...:** Prompt you for a destination, and then move all marked files to that - destination. Source file's path might be re-created in destination, depending on the - "Copy and Move" preference. -* **Copy Marked to...:** Prompt you for a destination, and then copy all marked files to that - destination. Source file's path might be re-created in destination, depending on the - "Copy and Move" preference. -* **Remove Marked from Results:** Remove all marked duplicates from results. The actual files will - not be touched and will stay where they are. -* **Remove Selected from Results:** Remove all selected duplicates from results. Note that all - selected reference files will be ignored, only duplicates can be removed with this action. -* **Make Selected into Reference:** Promote all selected duplicates to reference. If a duplicate is - a part of a group having a reference file coming from a reference folder (in blue color), no - action will be taken for this duplicate. If more than one duplicate among the same group are - selected, only the first of each group will be promoted. -* **Add Selected to Ignore List:** This first removes all selected duplicates from results, and - then add the match of that duplicate and the current reference in the ignore list. This match - will not come up again in further scan. The duplicate itself might come back, but it will be - matched with another reference file. You can clear the ignore list with the Clear Ignore List - command. -* **Open Selected with Default Application:** Open the file with the application associated with - selected file's type. -* **Reveal Selected in Finder:** Open the folder containing selected file. -* **Invoke Custom Command:** Invokes the external application you've set up in your preferences - using the current selection as arguments in the invocation. -* **Rename Selected:** Prompts you for a new name, and then rename the selected file. +**Clear Ignore List:** + Remove all ignored matches you added. You have to start a new scan for the + newly cleared ignore list to be effective. +**Export Results to XHTML:** + Take the current results, and create an XHTML file out of it. The + columns that are visible when you click on this button will be the columns present in the XHTML + file. The file will automatically be opened in your default browser. +**Send Marked to Trash:** + Send all marked duplicates to trash, obviously. Before proceeding, + you'll be presented deletion options (see below). +**Move Marked to...:** + Prompt you for a destination, and then move all marked files to that + destination. Source file's path might be re-created in destination, depending on the + "Copy and Move" preference. +**Copy Marked to...:** + Prompt you for a destination, and then copy all marked files to that + destination. Source file's path might be re-created in destination, depending on the + "Copy and Move" preference. +**Remove Marked from Results:** + Remove all marked duplicates from results. The actual files will + not be touched and will stay where they are. +**Remove Selected from Results:** + Remove all selected duplicates from results. Note that all + selected reference files will be ignored, only duplicates can be removed with this action. +**Make Selected into Reference:** + Promote all selected duplicates to reference. If a duplicate is + a part of a group having a reference file coming from a reference folder (in blue color), no + action will be taken for this duplicate. If more than one duplicate among the same group are + selected, only the first of each group will be promoted. +**Add Selected to Ignore List:** + This first removes all selected duplicates from results, and + then add the match of that duplicate and the current reference in the ignore list. This match + will not come up again in further scan. The duplicate itself might come back, but it will be + matched with another reference file. You can clear the ignore list with the Clear Ignore List + command. +**Open Selected with Default Application:** + Open the file with the application associated with selected file's type. +**Reveal Selected in Finder:** + Open the folder containing selected file. +**Invoke Custom Command:** + Invokes the external application you've set up in your preferences using the current selection + as arguments in the invocation. +**Rename Selected:** + Prompts you for a new name, and then rename the selected file. -**Warning about moving files in iPhoto/iTunes:** When using the "Move Marked" action on duplicates -that come from iPhoto or iTunes, files are copied, not moved. dupeGuru cannot use the Move action -on those files. +**Warning about moving files in iPhoto/iTunes/Aperture:** When using the "Move Marked" action on +duplicates that come from iPhoto, Aperture or iTunes, files are copied, not moved. dupeGuru cannot +use the Move action on those files. Deletion Options ---------------- @@ -161,21 +175,23 @@ Deletion Options These options affect how duplicate deletion takes place. Most of the time, you don't need to enable any of them. -* **Link deleted files:** The deleted files are replaced by a link to the reference file. You have - a choice of replacing it either with a `symlink`_ or a `hardlink`_. It's better to read the whole - wikipedia pages about them to make a informed choice, but in short, a symlink is a shortcut to - the file's path. If the original file is deleted or moved, the link is broken. A hardlink is a - link to the file *itself*. That link is as good as a "real" file. Only when *all* hardlinks to a - file are deleted is the file itself deleted. +**Link deleted files:** + The deleted files are replaced by a link to the reference file. You have a choice of replacing + it either with a `symlink`_ or a `hardlink`_. It's better to read the whole + wikipedia pages about them to make a informed choice, but in short, a symlink is a shortcut to + the file's path. If the original file is deleted or moved, the link is broken. A hardlink is a + link to the file *itself*. That link is as good as a "real" file. Only when *all* hardlinks to a + file are deleted is the file itself deleted. - On OSX and Linux, this feature is supported fully, but under Windows, it's a bit complicated. - Windows XP doesn't support it, but Vista and up support it. However, for the feature to work, - dupeGuru has to run with administrative privileges. + On OSX and Linux, this feature is supported fully, but under Windows, it's a bit complicated. + Windows XP doesn't support it, but Vista and up support it. However, for the feature to work, + dupeGuru has to run with administrative privileges. -* **Directly delete files:** Instead of sending files to trash, directly delete them. This is used - for troubleshooting and you normally don't need to enable this unless dupeGuru has problems - deleting files normally, something that can happens when you try to delete files on network - storage (NAS). +**Directly delete files:** + Instead of sending files to trash, directly delete them. This is used + for troubleshooting and you normally don't need to enable this unless dupeGuru has problems + deleting files normally, something that can happens when you try to delete files on network + storage (NAS). .. _regular-expressions.info: http://www.regular-expressions.info .. _hardlink: http://en.wikipedia.org/wiki/Hard_link diff --git a/help/en/scan.rst b/help/en/scan.rst new file mode 100644 index 00000000..689af049 --- /dev/null +++ b/help/en/scan.rst @@ -0,0 +1,186 @@ +The scanning process +==================== + +.. contents:: + +dupeGuru has 3 basic ways of scanning: :ref:`worded-scan` and :ref:`contents-scan` and +:ref:`picture blocks `. The first two modes are for the Standard and Music +editions, the last is for the Picture edition. The scanning process is configured through the +:doc:`Preference pane `. + +.. _worded-scan: + +Worded scans +------------ + +*Standard and Music Editions only*. + +Worded scans extract a string from each file and split it into words. The string can come from two +different sources: **Filename** or **Tags** (Music Edition only). + +When our source is music tags, we have to choose which tags to use. If, for example, we choose to +analyse *artist* and *title* tags, we'd end up with strings like +"The White Stripes - Seven Nation Army". + +Words are split by space characters, with all punctuation removed (some are replaced by spaces, some +by nothing) and all words lowercased. For example, the string "This guy's song(remix)" yields +*this*, *guys*, *song* and *remix*. + +Once this is done, the scanning dance begins. Finding duplicates is only a matter of finding how +many words in common two given strings have. If the :ref:`filter hardness ` is, +for example, ``80``, it means that 80% of the words of two strings must match. To determine the +matching percentage, dupeGuru first counts the total number of words in **both** strings, then count +the number of words matching (every word matching count as 2), and then divide the number of words +matching by the total number of words. If the result is higher or equal than the filter hardness, +we have a duplicate match. For example, "a b c d" and "c d e" have a matching percentage of 57 +(4 words matching, 7 total words). + +Fields +^^^^^^ + +*Music Edition only*. + +Song filenames often come with multiple and distinct parts and this can cause problems. For example, +let's take these two songs: "Dolly Parton - I Will Always Love You" and +"Whitney Houston - I Will Always Love You". They are clearly not the same song (they come from +different artists), but they still still have a matching score of 71%! This means that, with a naive +scanning method, we would get these songs as a false positive as soon as we try to dig a bit deeper +in our dupe hunt by lowering the threshold a bit. + +This is why we have the "Fields" concept. Fields are separated by dashes (``-``). When the +"Filename - Fields" scan type is chosen, each field is compared separately. Our final matching score +will only be the lowest of all the fields. In our example, the title has a 100% match, but the +artist has a 0% match, making our final match score 0. + +Sometimes, our song filename policy isn't completely homogenous, which means that we can end up with +"The White Stripes - Seven Nation Army" and "Seven Nation Army - The White Stripes". This is why +we have the "Filename - Fields (No Order)" scan type. With this scan type, all fields are compared +with each other, and the highest score is kept. Then, the final matching score is the lowest of them +all. In our case, the final matching score is 100. + +Note: Each field is used once. Thus, "The White Stripes - The White Stripes" and +"The White Stripes - Seven Nation Army" have a match score of 0 because the second +"The White Stripes" can't be compared with the first field of the other name because it has already +been "used up" by the first field. Our final match score would be 0. + +*Tags* scanning method is always "fielded". When choosing this scan method, we also choose which +tags are going to be compared, each being a field. + +.. _word-weighting: + +Word weighting +^^^^^^^^^^^^^^ + +When enabled, this option slightly changes how matching percentage is calculated by making bigger +words worth more. With word weighting, instead of having a value of 1 in the duplicate count and +total word count, every word have a value equal to the number of characters they have. With word +weighting, "ab cde fghi" and "ab cde fghij" would have a matching percentage of 53% (19 total +characters, 10 characters matching (4 for "ab" and 6 for "cde")). + +.. _similarity-matching: + +Similarity matching +^^^^^^^^^^^^^^^^^^^ + +When enabled, similar words will be counted as matches. For example "The White Stripes" and +"The White Stripe" would have a match score of 100 instead of 66 with that option turned on. + +Two words are considered similar if they can be made equal with only a few edit operations (removing +a letter, adding one etc.). The process used is not unlike the +`Levenshtein distance`_. For the technically inclined, the actual function used is +Python's `get_close_matches`_ with a ``0.8`` cutoff. + +**Warning:** Use this option with caution. It is likely that you will get a lot of false positives +in your results when turning it on. However, it will help you to find duplicates that you wouldn't +have found otherwise. The scan process also is significantly slower with this option turned on. + +.. _contents-scan: + +Contents scans +-------------- + +Contents scans are much simpler than worded scans. We read files and if the contents is exactly the +same, we consider the two files duplicates. + +This is, of course, quite longer than comparing filenames and, to avoid needlessly reading whole +file contents, we start by looking at file sizes. After having grouped our files by size, we discard +every file that is alone in its group. Then, we proceed to read the contents of our remaining files. + +MD5 hashes are used to compute compare contents. Yes, it is widely known that forging files having +the same MD5 hash is easy, but this file has to be knowingly forged. The possibilities of two files +having the same MD5 hash *and* the same size by accident is still very, very small. + +The :ref:`filter hardness ` preference is ignored in this scan. + +Audio contents +^^^^^^^^^^^^^^ + +*Music Edition only*. + +This mode is very much like the normal contents scan. The only difference is that it ignores +metadata included in the file and only compares audio data. *It doesn't do audio data fuzzy +matching, only exact matching. It would be really cool to have that, but we aren't there yet.* + +Folders +^^^^^^^ + +*Standard Edition only*. + +This is a special Contents scan type. It works like a normal contens scan, but instead of trying to +find duplicate files, it tries to find duplicate folders. A folder is duplicate to another if all +files it contains have the same contents as the other folder's file. + +This scan is, of course, recursive and subfolders are checked. dupeGuru keeps only the biggest +fishes. Therefore, if two folders that are considered as matching contain subfolders, these +subfolders will not be included in the final results. + +With this mode, we end up with folders as results instead of files. + +.. _picture-blocks-scan: + +Picture blocks +-------------- + +*Picture Edition only*. + +dupeGuru Picture Edition stands apart of its two friends. Its scan types are completely different. +The first one is its "Contents" scan, which is a bit too generic, hence the name we use here, +"Picture blocks". + +We start by opening every picture in RGB bitmap mode, then we "blockify" the picture. We create a +15x15 grid and compute the average color of each grid tile. This is the "picture analysis" phase. +It's very time consuming and the result is cached in a database (the "picture cache"). + +Once we've done that, we can start comparing them. Each tile in the grid (an average color) is +compared to its corresponding grid on the other picture and a color diff is computer (it's simply +a sum of the difference of R, G and B on each side). All these sums are added up to a final "score". + +If that score is smaller or equal to ``100 - threshold``, we have a match. + +A threshold of 100 adds an additional constraint that pictures have to be exactly the same (it's +possible, due to averaging, that the tile comparison yields ``0`` for pictures that aren't exactly +the same, but since "100%" suggests "exactly the same", we discard those ocurrences). If you want +to get pictures that are very, very similar but still allow a bit of fuzzy differences, go for 99%. + +This second part of the scan is CPU intensive and can take quite a bit of time. This task has been +made to take advatange of multi-core CPUs and has been optimized to the best of my abilities, but +the fact of the matter is that, due to the fuzziness of the task, we still have to compare every picture +to every other, making the algorithm quadratic (if ``N`` is the number of pictures to compare, the +number of comparisons to perform is ``N*N``). + +This algorithm is very naive, but in the field, it works rather well. If you master a better +algorithm and want to improve dupeGuru, by all means, let me know! + +EXIF Timestamp +-------------- + +*Picture Edition only*. + +This one is easy. We read the EXIF information of every picture and extract the ``DateTimeOriginal`` +tag. If the tag is the same for two pictures, they're considered duplicates. + +**Warning:** Modified pictures often keep the same EXIF timestamp, so watch out for false positives +when you use that scan type. + +.. _Levenshtein distance: http://en.wikipedia.org/wiki/Levenshtein_distance +.. _get_close_matches: http://docs.python.org/3/library/difflib.html#difflib.get_close_matches