core.engine¶
-
class
core.engine.
Match
¶ Represents a match between two
File
.Regarless of the matching method, when two files are determined to match, a Match pair is created, which holds, of course, the two matched files, but also their match “level”.
-
first
¶ first file of the pair.
-
second
¶ second file of the pair.
-
percentage
¶ their match level according to the scan method which found the match. int from 1 to 100. For exact scan methods, such as Contents scans, this will always be 100.
-
-
class
core.engine.
Group
¶ A group of
File
that match together.This manages match pairs into groups and ensures that all files in the group match to each other.
-
ref
¶ The “reference” file, which is the file among the group that isn’t going to be deleted.
-
add_match
(match)¶ Adds
match
to internal match list and possibly add duplicates to the group.A duplicate can only be considered as such if it matches all other duplicates in the group. This method registers that pair (A, B) represented in
match
as possible candidates and, if A and/or B end up matching every other duplicates in the group, add these duplicates to the group.Parameters: match (tuple) – pair of File
to add
-
discard_matches
()¶ Remove all recorded matches that didn’t result in a duplicate being added to the group.
You can call this after the duplicate scanning process to free a bit of memory.
-
-
core.engine.
build_word_dict
(objects, j=<hscommon.jobprogress.job.NullJob object>)¶ Returns a dict of objects mapped by their words.
objects must have a
words
attribute being a list of strings or a list of lists of strings (Fields).The result will be a dict with words as keys, lists of objects as values.
-
core.engine.
compare
(first, second, flags=())¶ Returns the % of words that match between
first
andsecond
The result is a
int
in the range 0..100.first
andsecond
can be either a string or a list (of words).
-
core.engine.
compare_fields
(first, second, flags=())¶ Returns the score for the lowest matching Fields.
first
andsecond
must be lists of lists of string. Each sub-list is then compared withcompare()
.
-
core.engine.
getmatches
(objects, min_match_percentage=0, match_similar_words=False, weight_words=False, no_field_order=False, j=<hscommon.jobprogress.job.NullJob object>)¶ Returns a list of
Match
withinobjects
after fuzzily matching their words.Parameters: - objects – List of
File
to match. - min_match_percentage (int) – minimum % of words that have to match.
- match_similar_words (bool) – make similar words (see
merge_similar_words()
) match. - weight_words (bool) – longer words are worth more in match % computations.
- no_field_order (bool) – match Fields regardless of their order.
- j – A job progress instance.
- objects – List of
-
core.engine.
getmatches_by_contents
(files, j=<hscommon.jobprogress.job.NullJob object>)¶ Returns a list of
Match
withinfiles
if their contents is the same.Parameters: j – A job progress instance.
-
core.engine.
get_groups
(matches)¶ Returns a list of
Group
frommatches
.Create groups out of match pairs in the smartest way possible.
-
core.engine.
merge_similar_words
(word_dict)¶ Take all keys in
word_dict
that are similar, and merge them together.word_dict
has been built withbuild_word_dict()
. Similarity is computed with Python’sdifflib.get_close_matches()
, which computes the number of edits that are necessary to make a word equal to the other.
-
core.engine.
reduce_common_words
(word_dict, threshold)¶ Remove all objects from
word_dict
values where the object count >=threshold
word_dict
has been built withbuild_word_dict()
.The exception to this removal are the objects where all the words of the object are common. Because if we remove them, we will miss some duplicates!
Fields¶
Fields are groups of words which each represent a significant part of the whole name. This concept is sifnificant in music file names, where we often have names like “My Artist - a very long title with many many words”.
This title has 10 words. If you run as scan with a bit of tolerance, let’s say 90%, you’ll be able to find a dupe that has only one “many” in the song title. However, you would also get false duplicates from a title like “My Giraffe - a very long title with many many words”, which is of course a very different song and it doesn’t make sense to match them.
When matching by fields, each field (separated by “-“) is considered as a separate string to match independently. After all fields are matched, the lowest result is kept. In the “Giraffe” example we gave, the result would be 50% instead of 90% in normal mode.