core.engine¶

class core.engine.Match¶

Represents a match between two File.

Regarless of the matching method, when two files are determined to match, a Match pair is created, which holds, of course, the two matched files, but also their match “level”.

first¶: first file of the pair.

second¶: second file of the pair.

percentage¶: their match level according to the scan method which found the match. int from 1 to 100. For exact scan methods, such as Contents scans, this will always be 100.

class core.engine.Group¶

A group of File that match together.

This manages match pairs into groups and ensures that all files in the group match to each other.

ref¶: The “reference” file, which is the file among the group that isn’t going to be deleted.

ordered¶: Ordered list of duplicates in the group (including the ref).

unordered¶: Set duplicates in the group (including the ref).

dupes¶: An ordered list of the group’s duplicate, without ref. Equivalent to ordered[1:]

percentage¶: Average match percentage of match pairs containing ref.

add_match(match)¶

Adds match to internal match list and possibly add duplicates to the group.

A duplicate can only be considered as such if it matches all other duplicates in the group. This method registers that pair (A, B) represented in match as possible candidates and, if A and/or B end up matching every other duplicates in the group, add these duplicates to the group.

Parameters:	match (tuple) – pair of `File` to add

discard_matches()¶

Remove all recorded matches that didn’t result in a duplicate being added to the group.

You can call this after the duplicate scanning process to free a bit of memory.

get_match_of(item)¶: Returns the match pair between item and ref.

prioritize(key_func, tie_breaker=None)¶

Reorders ordered according to key_func.

Parameters:	key_func – Key (f(x)) to be used for sorting tie_breaker – function to be used to select the reference position in case the top duplicates have the same key_func() result.

switch_ref(with_dupe)¶: Make the ref dupe of the group switch position with with_dupe.

core.engine.build_word_dict(objects, j=<hscommon.jobprogress.job.NullJob object>)¶

Returns a dict of objects mapped by their words.

objects must have a words attribute being a list of strings or a list of lists of strings (Fields).

The result will be a dict with words as keys, lists of objects as values.

core.engine.compare(first, second, flags=())¶

Returns the % of words that match between first and second

The result is a int in the range 0..100. first and second can be either a string or a list (of words).

core.engine.compare_fields(first, second, flags=())¶

Returns the score for the lowest matching Fields.

first and second must be lists of lists of string. Each sub-list is then compared with compare().

core.engine.getmatches(objects, min_match_percentage=0, match_similar_words=False, weight_words=False, no_field_order=False, j=<hscommon.jobprogress.job.NullJob object>)¶

Returns a list of Match within objects after fuzzily matching their words.

Parameters:

Parameters:	objects – List of `File` to match. min_match_percentage (int) – minimum % of words that have to match. match_similar_words (bool) – make similar words (see `merge_similar_words()`) match. weight_words (bool) – longer words are worth more in match % computations. no_field_order (bool) – match Fields regardless of their order. j – A job progress instance.

objects – List of File to match.
min_match_percentage (int) – minimum % of words that have to match.
match_similar_words (bool) – make similar words (see merge_similar_words()) match.
weight_words (bool) – longer words are worth more in match % computations.
no_field_order (bool) – match Fields regardless of their order.
j – A job progress instance.

core.engine.getmatches_by_contents(files, j=<hscommon.jobprogress.job.NullJob object>)¶

Returns a list of Match within files if their contents is the same.

Parameters:	j – A job progress instance.

core.engine.get_groups(matches)¶

Returns a list of Group from matches.

Create groups out of match pairs in the smartest way possible.

core.engine.merge_similar_words(word_dict)¶

Take all keys in word_dict that are similar, and merge them together.

word_dict has been built with build_word_dict(). Similarity is computed with Python’s difflib.get_close_matches(), which computes the number of edits that are necessary to make a word equal to the other.

core.engine.reduce_common_words(word_dict, threshold)¶

Remove all objects from word_dict values where the object count >= threshold

word_dict has been built with build_word_dict().

The exception to this removal are the objects where all the words of the object are common. Because if we remove them, we will miss some duplicates!

Fields¶

Fields are groups of words which each represent a significant part of the whole name. This concept is sifnificant in music file names, where we often have names like “My Artist - a very long title with many many words”.

This title has 10 words. If you run as scan with a bit of tolerance, let’s say 90%, you’ll be able to find a dupe that has only one “many” in the song title. However, you would also get false duplicates from a title like “My Giraffe - a very long title with many many words”, which is of course a very different song and it doesn’t make sense to match them.

When matching by fields, each field (separated by “-“) is considered as a separate string to match independently. After all fields are matched, the lowest result is kept. In the “Giraffe” example we gave, the result would be 50% instead of 90% in normal mode.