1
0
mirror of https://github.com/arsenetar/dupeguru.git synced 2025-09-11 17:58:17 +00:00

Greatly improved docs

Added a new scan.rst page, laying out in much more details than before
the inner workings of the scanning process.

Fixes #208, but does much more than that.
This commit is contained in:
Virgil Dupras 2013-11-17 12:03:48 -05:00
parent 508e9a5d94
commit 398ac9b7c6
7 changed files with 369 additions and 122 deletions

View File

@ -158,10 +158,10 @@ html_theme = 'haiku'
#html_additional_pages = {} #html_additional_pages = {}
# If false, no module index is generated. # If false, no module index is generated.
html_domain_indices = False # html_domain_indices = False
# If false, no index is generated. # If false, no index is generated.
html_use_index = False # html_use_index = False
# If true, the index is split into individual pages for each letter. # If true, the index is split into individual pages for each letter.
#html_split_index = False #html_split_index = False

View File

@ -29,7 +29,7 @@ What makes it better than other duplicate scanners?
--------------------------------------------------- ---------------------------------------------------
The scanning engine is extremely flexible. You can tweak it to really get the kind of results you The scanning engine is extremely flexible. You can tweak it to really get the kind of results you
want. You can read more about dupeGuru tweaking option at the :doc:`Preferences page <preferences>`. want. You can read more about dupeGuru tweaking option in :doc:`scan`.
How safe is it to use dupeGuru? How safe is it to use dupeGuru?
------------------------------- -------------------------------

View File

@ -1,53 +1,67 @@
Folder Selection Folder Selection
================ ================
The first window you see when you launch dupeGuru is the folder selection window. This windows contains the list of the folders that will be scanned when you click on **Scan**. The first window you see when you launch dupeGuru is the folder selection window. This windows
contains the list of the folders that will be scanned when you click on **Scan**.
This window is quite straightforward to use. If you want to add a folder, click on the **+** button. If you added folder before, a popup menu with a list of recent folders you added will pop. You can click on one of them to add it directly to your list. If you click on the first item of the popup menu, **Add New Folder...**, you will be prompted for a folder to add. If you never added a folder, no menu will pop and you will directly be prompted for a new folder to add. This window is quite straightforward to use. If you want to add a folder, click on the **+** button.
If you added folder before, a popup menu with a list of recent folders you added will pop. You can
click on one of them to add it directly to your list. If you click on the first item of the popup
menu, **Add New Folder...**, you will be prompted for a folder to add. If you never added a folder,
no menu will pop and you will directly be prompted for a new folder to add.
An alternate way to add folders to the list is to drag them in the list. An alternate way to add folders to the list is to drag them in the list.
To remove a folder, select the folder to remove and click on **-**. If a subfolder is selected when you click the button, the selected folder will be set to **excluded** state (see below) instead of being removed. To remove a folder, select the folder to remove and click on **-**. If a subfolder is selected when
you click the button, the selected folder will be set to **excluded** state (see below) instead of
being removed.
Folder states Folder states
------------- -------------
Every folder can be in one of these 3 states: Every folder can be in one of these 3 states:
* **Normal:** Duplicates found in this folder can be deleted. **Normal:**
* **Reference:** Duplicates found in this folder **cannot** be deleted. Files from this folder can only end up in **reference** position in the dupe group. If more than one file from reference folders end up in the same dupe group, only one will be kept. The others will be removed from the group. Duplicates found in this folder can be deleted.
* **Excluded:** Files in this directory will not be included in the scan. **Reference:**
Duplicates found in this folder **cannot** be deleted. Files from this folder can
only end up in **reference** position in the dupe group. If more than one file from reference
folders end up in the same dupe group, only one will be kept. The others will be removed from
the group.
**Excluded:**
Files in this directory will not be included in the scan.
The default state of a folder is, of course, **Normal**. You can use **Reference** state for a folder if you want to be sure that you won't delete any file from it. The default state of a folder is, of course, **Normal**. You can use **Reference** state for a
folder if you want to be sure that you won't delete any file from it.
When you set the state of a directory, all subfolders of this folder automatically inherit this state unless you explicitly set a subfolder's state. When you set the state of a directory, all subfolders of this folder automatically inherit this
state unless you explicitly set a subfolder's state.
.. only:: edition_pe .. _iphoto:
iPhoto and Aperture libraries iPhoto and Aperture libraries
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -----------------------------
dupeGuru PE supports iPhoto and Aperture, which means that it knows how to read these libraries dupeGuru Picture Edition supports iPhoto and Aperture, which means that it knows how to read these
and how to communicate with iPhoto and Aperture to remove photos from them. To use this feature, libraries and how to communicate with iPhoto and Aperture to remove photos from them. To use this
use the special "Add iPhoto Library" and "Add Aperture Library" buttons in the menu that pops feature, use the special "Add iPhoto Library" and "Add Aperture Library" buttons in the menu that
up when you click the "+" button. This will then add a special folder for those libraries. pops up when you click the "+" button. This will then add a special folder for those libraries.
When duplicates are deleted from an iPhoto library, it's sent to iPhoto's trash. When duplicates are deleted (sent to trash) from an iPhoto library, it's sent to iPhoto's
trash.
When duplicates are deleted from an Aperture library, it unfortunately can't send it directly When duplicates are deleted (sent to trash) from an Aperture library, it unfortunately can't
to trash, but it creates a special project called "dupeGuru Trash" in Aperture and send all send it directly to trash, but it creates a special project called "dupeGuru Trash" in Aperture
photos in there. You can then send this project to the trash manually. and send all photos in there. You can then send this project to the trash manually.
.. only:: edition_me iTunes library
--------------
iTunes library dupeGuru Music Edition supports iTunes, which means that it knows how to read its libraries and how
^^^^^^^^^^^^^^ to communicate with iTunes to remove songs from it. To use this feature, use the special
"Add iTunes Library" button in the menu that pops up when you click the "+" button. This will
then add a special folder for those libraries.
dupeGuru ME supports iTunes, which means that it knows how to read its libraries and how to When duplicates are deleted from an iTunes library, it's sent to the system trash, like a
communicate with iTunes to remove songs from it. To use this feature, use the special normal file, but it's also removed from iTunes, thus avoiding ending up with missing entries
"Add iTunes Library" button in the menu that pops up when you click the "+" button. This will (entries with the "!" logo next to them).
then add a special folder for those libraries.
When duplicates are deleted from an iTunes library, it's sent to the system trash, like a
normal file, but it's also removed from iTunes, thus avoiding ending up with missing entries
(entries with the "!" logo next to them).

View File

@ -51,9 +51,16 @@ Contents:
quick_start quick_start
folders folders
preferences preferences
scan
results results
reprioritize reprioritize
faq faq
developer/index developer/index
changelog changelog
credits credits
Indices and tables
==================
* :ref:`genindex`
* :ref:`search`

View File

@ -1,63 +1,87 @@
Preferences Preferences
=========== ===========
.. only:: edition_se **Scan Type:**
Basic scan type to use. See :doc:`scan` for details.
**Scan Type:** This option determines what aspect of the files will be compared in the duplicate scan. If you select **Filename**, dupeGuru will compare every filenames word-by-word and, depending on the other settings below, it will determine if enough words are matching to consider 2 files duplicates. If you select **Content**, only files with the exact same content will match.
The **Folders** scan type is a bit special. When you choose it, dupeGuru will scan for duplicate *folders* instead of duplicate files. To determine whether two folders are duplicates, all files contained in the folders will be scanned, and if the contents of **all** files in the folders match, the folders will be considered duplicates.
**Filter Hardness:** If you chose the **Filename** scan type, this option determines how similar two filenames must be for dupeGuru to consider them duplicates. If the filter hardness is, for example 80, it means that 80% of the words of two filenames must match. To determine the matching percentage, dupeGuru first counts the total number of words in **both** filenames, then count the number of words matching (every word matching count as 2), and then divide the number of words matching by the total number of words. If the result is higher or equal to the filter hardness, we have a duplicate match. For example, "a b c d" and "c d e" have a matching percentage of 57 (4 words matching, 7 total words).
.. only:: edition_me .. only:: edition_me
**Scan Type:** This option determines what aspect of the files will be compared in the duplicate scan. The nature of the duplicate scan varies greatly depending on what you select for this option. **Tags to scan:**
When using the **Tags** scan type, you can select the tags that will be used for comparison.
* **Filename:** Every song will have its filename split into words, and then every word will be compared to compute a matching percentage. If this percentage is higher or equal to the **Filter Hardness** (see below for more details), dupeGuru will consider the 2 songs duplicates.
* **Filename - Fields:** Like **Filename**, except that once filename have been split into words, these words are then grouped into fields. The field separator is " - ". The final matching percentage will be the lowest matching percentage among the fields. Thus, "An Artist - The Title" and "An Artist - Other Title" would have a matching percentage of 50 (With a **Filename** scan, it would be 75).
* **Filename - Fields (No Order):** Like **Filename - Fields**, except that field order doesn't matter. For example, "An Artist - The Title" and "The Title - An Artist" would have a matching percentage of 100 instead of 0.
* **Tags:** This method reads the tag (metadata) of every song and compare their fields. This method, like the **Filename - Fields**, considers the lowest matching field as its final matching percentage.
* **Content:** This scan method use the actual content of the songs to determine which are duplicates. For 2 songs to match with this method, they must have the **exact same content**.
* **Audio Content:** Same as content, but only the audio content is compared (without metadata).
**Filter Hardness:** If you chose a filename or tag based scan type, this option determines how similar two filenames/tags must be for dupeGuru to consider them duplicates. If the filter hardness is, for example 80, it means that 80% of the words of two filenames must match. To determine the matching percentage, dupeGuru first counts the total number of words in **both** filenames, then count the number of words matching (every word matching count as 2), and then divide the number of words matching by the total number of words. If the result is higher or equal to the filter hardness, we have a duplicate match. For example, "a b c d" and "c d e" have a matching percentage of 57 (4 words matching, 7 total words).
**Tags to scan:** When using the **Tags** scan type, you can select the tags that will be used for comparison.
.. only:: edition_se or edition_me .. only:: edition_se or edition_me
**Word weighting:** If you chose the **Filename** scan type, this option slightly changes how matching percentage is calculated. With word weighting, instead of having a value of 1 in the duplicate count and total word count, every word have a value equal to the number of characters they have. With word weighting, "ab cde fghi" and "ab cde fghij" would have a matching percentage of 53% (19 total characters, 10 characters matching (4 for "ab" and 6 for "cde")). **Word weighting:**
See :ref:`word-weighting`.
**Match similar words:** If you turn this option on, similar words will be counted as matches. For example "The White Stripes" and "The White Stripe" would have a match % of 100 instead of 66 with that option turned on. **Warning:** Use this option with caution. It is likely that you will get a lot of false positives in your results when turning it on. However, it will help you to find duplicates that you wouldn't have found otherwise. The scan process also is significantly slower with this option turned on. **Match similar words:**
See :ref:`similarity-matching`.
.. only:: edition_pe .. only:: edition_pe
**Scan Type:** This option determines the type of scan that will be made on your pictures. The **Contents** scan type compares the actual contents of the pictures in a fuzzy way (making it possible to find not only exact duplicates, but also similar ones). The **EXIF Timestamp** scan type looks at the EXIF metadata of the picture (if it exists) and matches pictures that have the same one. It's much faster than the Contents scan. **Warning:** Modified pictures often keep the same EXIF timestamp, so watch out for false positives when you use that scan type. **Match pictures of different dimensions:**
If you check this box, pictures of different dimensions will be allowed in the same
duplicate group.
**Filter Hardness:** *Contents scan type only.* The higher is this setting, the "harder" is the filter (In other words, the less results you get). Most pictures of the same quality match at 100% even if the format is different (PNG and JPG for example.). However, if you want to make a PNG match with a lower quality JPG, you will have to set the filer hardness to lower than 100. The default, 95, is a sweet spot. .. _filter-hardness:
**Match pictures of different dimensions:** If you check this box, pictures of different dimensions will be allowed in the same duplicate group. **Filter Hardness:**
The threshold needed for two files to be considered duplicates. A lower value means more
duplicates. The meaning of the threshold depends on the scanning type (see :doc:`scan`).
Only works for :ref:`worded <worded-scan>` and :ref:`picture blocks <picture-blocks-scan>`
scans.
**Can mix file kind:** If you check this box, duplicate groups are allowed to have files with different extensions. If you don't check it, well, they aren't! **Can mix file kind:**
If you check this box, duplicate groups are allowed to have files with different extensions. If
you don't check it, well, they aren't!
**Ignore duplicates hardlinking to the same file:** If this option is enabled, dupeGuru will verify duplicates to see if they refer to the same `inode <http://en.wikipedia.org/wiki/Inode>`_. If they do, they will not be considered duplicates. (Only for OS X and Linux) **Ignore duplicates hardlinking to the same file:**
If this option is enabled, dupeGuru will verify duplicates to see if they refer to the same
`inode`_. If they do, they will not be considered duplicates. (Only for OS X and Linux)
**Use regular expressions when filtering:** If you check this box, the filtering feature will treat your filter query as a **regular expression**. Explaining them is beyond the scope of this document. A good place to start learning it is `regular-expressions.info <http://www.regular-expressions.info>`_. **Use regular expressions when filtering:**
If you check this box, the filtering feature will treat your filter query as a
**regular expression**. Explaining them is beyond the scope of this document. A good place to
start learning it is `regular-expressions.info`_.
**Remove empty folders after delete or move:** When this option is enabled, folders are deleted after a file is deleted or moved and the folder is empty. **Remove empty folders after delete or move:**
When this option is enabled, folders are deleted after a file is deleted or moved and the folder
is empty.
**Copy and Move:** Determines how the Copy and Move operations (in the Action menu) will behave. **Copy and Move:**
Determines how the Copy and Move operations (in the Action menu) will behave.
* **Right in destination:** All files will be sent directly in the selected destination, without trying to recreate the source path at all. * **Right in destination:** All files will be sent directly in the selected destination, without
* **Recreate relative path:** The source file's path will be re-created in the destination folder up to the root selection in the Directories panel. For example, if you added ``/Users/foobar/SomeFolder`` to your Directories panel and you move ``/Users/foobar/SomeFolder/SubFolder/SomeFile.ext`` to the destination ``/Users/foobar/MyDestination``, the final destination for the file will be ``/Users/foobar/MyDestination/SubFolder`` (``SomeFolder`` has been trimmed from source's path in the final destination.). trying to recreate the source path at all.
* **Recreate absolute path:** The source file's path will be re-created in the destination folder in it's entirety. For example, if you move ``/Users/foobar/SomeFolder/SubFolder/SomeFile.ext`` to the destination ``/Users/foobar/MyDestination``, the final destination for the file will be ``/Users/foobar/MyDestination/Users/foobar/SomeFolder/SubFolder``. * **Recreate relative path:** The source file's path will be re-created in the destination folder up
to the root selection in the Directories panel. For example, if you added
``/Users/foobar/SomeFolder`` to your Directories panel and you move
``/Users/foobar/SomeFolder/SubFolder/SomeFile.ext`` to the destination
``/Users/foobar/MyDestination``, the final destination for the file will be
``/Users/foobar/MyDestination/SubFolder`` (``SomeFolder`` has been trimmed from source's path in
the final destination.).
* **Recreate absolute path:** The source file's path will be re-created in the destination folder in
its entirety. For example, if you move ``/Users/foobar/SomeFolder/SubFolder/SomeFile.ext`` to the
destination ``/Users/foobar/MyDestination``, the final destination for the file will be
``/Users/foobar/MyDestination/Users/foobar/SomeFolder/SubFolder``.
In all cases, dupeGuru nicely handles naming conflicts by prepending a number to the destination filename if the filename already exists in the destination. In all cases, dupeGuru nicely handles naming conflicts by prepending a number to the destination
filename if the filename already exists in the destination.
**Custom Command:** This preference determines the command that will be invoked by the "Invoke Custom Command" action. You can invoke any external application through this action. This can be useful if, for example, you have a nice diffing application installed. **Custom Command:**
This preference determines the command that will be invoked by the "Invoke Custom Command"
action. You can invoke any external application through this action. This can be useful if,
for example, you have a nice diffing application installed.
The format of the command is the same as what you would write in the command line, except that there are 2 placeholders: **%d** and **%r**. These placeholders will be replaced by the path of the selected dupe (%d) and the path of the selected dupe's reference file (%r). The format of the command is the same as what you would write in the command line, except that there
are 2 placeholders: **%d** and **%r**. These placeholders will be replaced by the path of the
selected dupe (%d) and the path of the selected dupe's reference file (%r).
If the path to your executable contains space characters, you should enclose it in "" quotes. You should also enclose placeholders in quotes because it's very possible that paths to dupes and refs will contain spaces. Here's an example custom command:: If the path to your executable contains space characters, you should enclose it in "" quotes. You
should also enclose placeholders in quotes because it's very possible that paths to dupes and refs
will contain spaces. Here's an example custom command::
"C:\Program Files\SuperDiffProg\SuperDiffProg.exe" "%d" "%r" "C:\Program Files\SuperDiffProg\SuperDiffProg.exe" "%d" "%r"
.. _inode: http://en.wikipedia.org/wiki/Inode
.. _regular-expressions.info: http://www.regular-expressions.info

View File

@ -1,6 +1,8 @@
Results Results
======= =======
.. contents::
When dupeGuru is finished scanning for duplicates, it will show its results in the form of duplicate group list. When dupeGuru is finished scanning for duplicates, it will show its results in the form of duplicate group list.
About duplicate groups About duplicate groups
@ -118,42 +120,54 @@ filtered duplicates.
Action Menu Action Menu
----------- -----------
* **Clear Ignore List:** Remove all ignored matches you added. You have to start a new scan for the **Clear Ignore List:**
newly cleared ignore list to be effective. Remove all ignored matches you added. You have to start a new scan for the
* **Export Results to XHTML:** Take the current results, and create an XHTML file out of it. The newly cleared ignore list to be effective.
columns that are visible when you click on this button will be the columns present in the XHTML **Export Results to XHTML:**
file. The file will automatically be opened in your default browser. Take the current results, and create an XHTML file out of it. The
* **Send Marked to Trash:** Send all marked duplicates to trash, obviously. Before proceeding, columns that are visible when you click on this button will be the columns present in the XHTML
you'll be presented deletion options (see below). file. The file will automatically be opened in your default browser.
* **Move Marked to...:** Prompt you for a destination, and then move all marked files to that **Send Marked to Trash:**
destination. Source file's path might be re-created in destination, depending on the Send all marked duplicates to trash, obviously. Before proceeding,
"Copy and Move" preference. you'll be presented deletion options (see below).
* **Copy Marked to...:** Prompt you for a destination, and then copy all marked files to that **Move Marked to...:**
destination. Source file's path might be re-created in destination, depending on the Prompt you for a destination, and then move all marked files to that
"Copy and Move" preference. destination. Source file's path might be re-created in destination, depending on the
* **Remove Marked from Results:** Remove all marked duplicates from results. The actual files will "Copy and Move" preference.
not be touched and will stay where they are. **Copy Marked to...:**
* **Remove Selected from Results:** Remove all selected duplicates from results. Note that all Prompt you for a destination, and then copy all marked files to that
selected reference files will be ignored, only duplicates can be removed with this action. destination. Source file's path might be re-created in destination, depending on the
* **Make Selected into Reference:** Promote all selected duplicates to reference. If a duplicate is "Copy and Move" preference.
a part of a group having a reference file coming from a reference folder (in blue color), no **Remove Marked from Results:**
action will be taken for this duplicate. If more than one duplicate among the same group are Remove all marked duplicates from results. The actual files will
selected, only the first of each group will be promoted. not be touched and will stay where they are.
* **Add Selected to Ignore List:** This first removes all selected duplicates from results, and **Remove Selected from Results:**
then add the match of that duplicate and the current reference in the ignore list. This match Remove all selected duplicates from results. Note that all
will not come up again in further scan. The duplicate itself might come back, but it will be selected reference files will be ignored, only duplicates can be removed with this action.
matched with another reference file. You can clear the ignore list with the Clear Ignore List **Make Selected into Reference:**
command. Promote all selected duplicates to reference. If a duplicate is
* **Open Selected with Default Application:** Open the file with the application associated with a part of a group having a reference file coming from a reference folder (in blue color), no
selected file's type. action will be taken for this duplicate. If more than one duplicate among the same group are
* **Reveal Selected in Finder:** Open the folder containing selected file. selected, only the first of each group will be promoted.
* **Invoke Custom Command:** Invokes the external application you've set up in your preferences **Add Selected to Ignore List:**
using the current selection as arguments in the invocation. This first removes all selected duplicates from results, and
* **Rename Selected:** Prompts you for a new name, and then rename the selected file. then add the match of that duplicate and the current reference in the ignore list. This match
will not come up again in further scan. The duplicate itself might come back, but it will be
matched with another reference file. You can clear the ignore list with the Clear Ignore List
command.
**Open Selected with Default Application:**
Open the file with the application associated with selected file's type.
**Reveal Selected in Finder:**
Open the folder containing selected file.
**Invoke Custom Command:**
Invokes the external application you've set up in your preferences using the current selection
as arguments in the invocation.
**Rename Selected:**
Prompts you for a new name, and then rename the selected file.
**Warning about moving files in iPhoto/iTunes:** When using the "Move Marked" action on duplicates **Warning about moving files in iPhoto/iTunes/Aperture:** When using the "Move Marked" action on
that come from iPhoto or iTunes, files are copied, not moved. dupeGuru cannot use the Move action duplicates that come from iPhoto, Aperture or iTunes, files are copied, not moved. dupeGuru cannot
on those files. use the Move action on those files.
Deletion Options Deletion Options
---------------- ----------------
@ -161,21 +175,23 @@ Deletion Options
These options affect how duplicate deletion takes place. Most of the time, you don't need to enable These options affect how duplicate deletion takes place. Most of the time, you don't need to enable
any of them. any of them.
* **Link deleted files:** The deleted files are replaced by a link to the reference file. You have **Link deleted files:**
a choice of replacing it either with a `symlink`_ or a `hardlink`_. It's better to read the whole The deleted files are replaced by a link to the reference file. You have a choice of replacing
wikipedia pages about them to make a informed choice, but in short, a symlink is a shortcut to it either with a `symlink`_ or a `hardlink`_. It's better to read the whole
the file's path. If the original file is deleted or moved, the link is broken. A hardlink is a wikipedia pages about them to make a informed choice, but in short, a symlink is a shortcut to
link to the file *itself*. That link is as good as a "real" file. Only when *all* hardlinks to a the file's path. If the original file is deleted or moved, the link is broken. A hardlink is a
file are deleted is the file itself deleted. link to the file *itself*. That link is as good as a "real" file. Only when *all* hardlinks to a
file are deleted is the file itself deleted.
On OSX and Linux, this feature is supported fully, but under Windows, it's a bit complicated. On OSX and Linux, this feature is supported fully, but under Windows, it's a bit complicated.
Windows XP doesn't support it, but Vista and up support it. However, for the feature to work, Windows XP doesn't support it, but Vista and up support it. However, for the feature to work,
dupeGuru has to run with administrative privileges. dupeGuru has to run with administrative privileges.
* **Directly delete files:** Instead of sending files to trash, directly delete them. This is used **Directly delete files:**
for troubleshooting and you normally don't need to enable this unless dupeGuru has problems Instead of sending files to trash, directly delete them. This is used
deleting files normally, something that can happens when you try to delete files on network for troubleshooting and you normally don't need to enable this unless dupeGuru has problems
storage (NAS). deleting files normally, something that can happens when you try to delete files on network
storage (NAS).
.. _regular-expressions.info: http://www.regular-expressions.info .. _regular-expressions.info: http://www.regular-expressions.info
.. _hardlink: http://en.wikipedia.org/wiki/Hard_link .. _hardlink: http://en.wikipedia.org/wiki/Hard_link

186
help/en/scan.rst Normal file
View File

@ -0,0 +1,186 @@
The scanning process
====================
.. contents::
dupeGuru has 3 basic ways of scanning: :ref:`worded-scan` and :ref:`contents-scan` and
:ref:`picture blocks <picture-blocks-scan>`. The first two modes are for the Standard and Music
editions, the last is for the Picture edition. The scanning process is configured through the
:doc:`Preference pane <preferences>`.
.. _worded-scan:
Worded scans
------------
*Standard and Music Editions only*.
Worded scans extract a string from each file and split it into words. The string can come from two
different sources: **Filename** or **Tags** (Music Edition only).
When our source is music tags, we have to choose which tags to use. If, for example, we choose to
analyse *artist* and *title* tags, we'd end up with strings like
"The White Stripes - Seven Nation Army".
Words are split by space characters, with all punctuation removed (some are replaced by spaces, some
by nothing) and all words lowercased. For example, the string "This guy's song(remix)" yields
*this*, *guys*, *song* and *remix*.
Once this is done, the scanning dance begins. Finding duplicates is only a matter of finding how
many words in common two given strings have. If the :ref:`filter hardness <filter-hardness>` is,
for example, ``80``, it means that 80% of the words of two strings must match. To determine the
matching percentage, dupeGuru first counts the total number of words in **both** strings, then count
the number of words matching (every word matching count as 2), and then divide the number of words
matching by the total number of words. If the result is higher or equal than the filter hardness,
we have a duplicate match. For example, "a b c d" and "c d e" have a matching percentage of 57
(4 words matching, 7 total words).
Fields
^^^^^^
*Music Edition only*.
Song filenames often come with multiple and distinct parts and this can cause problems. For example,
let's take these two songs: "Dolly Parton - I Will Always Love You" and
"Whitney Houston - I Will Always Love You". They are clearly not the same song (they come from
different artists), but they still still have a matching score of 71%! This means that, with a naive
scanning method, we would get these songs as a false positive as soon as we try to dig a bit deeper
in our dupe hunt by lowering the threshold a bit.
This is why we have the "Fields" concept. Fields are separated by dashes (``-``). When the
"Filename - Fields" scan type is chosen, each field is compared separately. Our final matching score
will only be the lowest of all the fields. In our example, the title has a 100% match, but the
artist has a 0% match, making our final match score 0.
Sometimes, our song filename policy isn't completely homogenous, which means that we can end up with
"The White Stripes - Seven Nation Army" and "Seven Nation Army - The White Stripes". This is why
we have the "Filename - Fields (No Order)" scan type. With this scan type, all fields are compared
with each other, and the highest score is kept. Then, the final matching score is the lowest of them
all. In our case, the final matching score is 100.
Note: Each field is used once. Thus, "The White Stripes - The White Stripes" and
"The White Stripes - Seven Nation Army" have a match score of 0 because the second
"The White Stripes" can't be compared with the first field of the other name because it has already
been "used up" by the first field. Our final match score would be 0.
*Tags* scanning method is always "fielded". When choosing this scan method, we also choose which
tags are going to be compared, each being a field.
.. _word-weighting:
Word weighting
^^^^^^^^^^^^^^
When enabled, this option slightly changes how matching percentage is calculated by making bigger
words worth more. With word weighting, instead of having a value of 1 in the duplicate count and
total word count, every word have a value equal to the number of characters they have. With word
weighting, "ab cde fghi" and "ab cde fghij" would have a matching percentage of 53% (19 total
characters, 10 characters matching (4 for "ab" and 6 for "cde")).
.. _similarity-matching:
Similarity matching
^^^^^^^^^^^^^^^^^^^
When enabled, similar words will be counted as matches. For example "The White Stripes" and
"The White Stripe" would have a match score of 100 instead of 66 with that option turned on.
Two words are considered similar if they can be made equal with only a few edit operations (removing
a letter, adding one etc.). The process used is not unlike the
`Levenshtein distance`_. For the technically inclined, the actual function used is
Python's `get_close_matches`_ with a ``0.8`` cutoff.
**Warning:** Use this option with caution. It is likely that you will get a lot of false positives
in your results when turning it on. However, it will help you to find duplicates that you wouldn't
have found otherwise. The scan process also is significantly slower with this option turned on.
.. _contents-scan:
Contents scans
--------------
Contents scans are much simpler than worded scans. We read files and if the contents is exactly the
same, we consider the two files duplicates.
This is, of course, quite longer than comparing filenames and, to avoid needlessly reading whole
file contents, we start by looking at file sizes. After having grouped our files by size, we discard
every file that is alone in its group. Then, we proceed to read the contents of our remaining files.
MD5 hashes are used to compute compare contents. Yes, it is widely known that forging files having
the same MD5 hash is easy, but this file has to be knowingly forged. The possibilities of two files
having the same MD5 hash *and* the same size by accident is still very, very small.
The :ref:`filter hardness <filter-hardness>` preference is ignored in this scan.
Audio contents
^^^^^^^^^^^^^^
*Music Edition only*.
This mode is very much like the normal contents scan. The only difference is that it ignores
metadata included in the file and only compares audio data. *It doesn't do audio data fuzzy
matching, only exact matching. It would be really cool to have that, but we aren't there yet.*
Folders
^^^^^^^
*Standard Edition only*.
This is a special Contents scan type. It works like a normal contens scan, but instead of trying to
find duplicate files, it tries to find duplicate folders. A folder is duplicate to another if all
files it contains have the same contents as the other folder's file.
This scan is, of course, recursive and subfolders are checked. dupeGuru keeps only the biggest
fishes. Therefore, if two folders that are considered as matching contain subfolders, these
subfolders will not be included in the final results.
With this mode, we end up with folders as results instead of files.
.. _picture-blocks-scan:
Picture blocks
--------------
*Picture Edition only*.
dupeGuru Picture Edition stands apart of its two friends. Its scan types are completely different.
The first one is its "Contents" scan, which is a bit too generic, hence the name we use here,
"Picture blocks".
We start by opening every picture in RGB bitmap mode, then we "blockify" the picture. We create a
15x15 grid and compute the average color of each grid tile. This is the "picture analysis" phase.
It's very time consuming and the result is cached in a database (the "picture cache").
Once we've done that, we can start comparing them. Each tile in the grid (an average color) is
compared to its corresponding grid on the other picture and a color diff is computer (it's simply
a sum of the difference of R, G and B on each side). All these sums are added up to a final "score".
If that score is smaller or equal to ``100 - threshold``, we have a match.
A threshold of 100 adds an additional constraint that pictures have to be exactly the same (it's
possible, due to averaging, that the tile comparison yields ``0`` for pictures that aren't exactly
the same, but since "100%" suggests "exactly the same", we discard those ocurrences). If you want
to get pictures that are very, very similar but still allow a bit of fuzzy differences, go for 99%.
This second part of the scan is CPU intensive and can take quite a bit of time. This task has been
made to take advatange of multi-core CPUs and has been optimized to the best of my abilities, but
the fact of the matter is that, due to the fuzziness of the task, we still have to compare every picture
to every other, making the algorithm quadratic (if ``N`` is the number of pictures to compare, the
number of comparisons to perform is ``N*N``).
This algorithm is very naive, but in the field, it works rather well. If you master a better
algorithm and want to improve dupeGuru, by all means, let me know!
EXIF Timestamp
--------------
*Picture Edition only*.
This one is easy. We read the EXIF information of every picture and extract the ``DateTimeOriginal``
tag. If the tag is the same for two pictures, they're considered duplicates.
**Warning:** Modified pictures often keep the same EXIF timestamp, so watch out for false positives
when you use that scan type.
.. _Levenshtein distance: http://en.wikipedia.org/wiki/Levenshtein_distance
.. _get_close_matches: http://docs.python.org/3/library/difflib.html#difflib.get_close_matches