Switch file hashing to xxhash instead of md5

- Improves performance significantly in some cases - Add xxhash to requirements.txt and sort requirements - Rename md5 based members to digest - Update all tests to use new member names and hashing methods - Update hash db code to upgrade schema NOTE: May consider supporting multiple hashing algorithms in the future.
Add vscode extension recommendation
2025-05-06 17:09:49 +00:00 · 2022-03-25 23:13:12 -05:00 · 2022-03-21 22:27:16 -05:00 · 2022-03-21 22:19:58 -05:00 · 2022-03-21 22:18:22 -05:00 · 2022-03-21 22:04:45 -05:00
31 changed files with 607 additions and 355 deletions
--- a/.github/FUNDING.yml
+++ b/.github/FUNDING.yml
@ -0,0 +1,13 @@
+# These are supported funding model platforms
+
+github: arsenetar
+patreon: # Replace with a single Patreon username
+open_collective: # Replace with a single Open Collective username
+ko_fi: # Replace with a single Ko-fi username
+tidelift: # Replace with a single Tidelift platform-name/package-name e.g., npm/babel
+community_bridge: # Replace with a single Community Bridge project-name e.g., cloud-foundry
+liberapay: # Replace with a single Liberapay username
+issuehunt: # Replace with a single IssueHunt username
+otechie: # Replace with a single Otechie username
+lfx_crowdfunding: # Replace with a single LFX Crowdfunding project-name e.g., cloud-foundry
+custom: # Replace with up to 4 custom sponsorship URLs e.g., ['link1', 'link2']
--- a/.github/workflows/codeql-analysis.yml
+++ b/.github/workflows/codeql-analysis.yml
@ -2,12 +2,12 @@ name: "CodeQL"

 on:
  push:
-    branches: [ master ]
+    branches: [master]
  pull_request:
    # The branches below must be a subset of the branches above
-    branches: [ master ]
+    branches: [master]
  schedule:
-    - cron: '24 20 * * 2'
+    - cron: "24 20 * * 2"

 jobs:
  analyze:
@ -21,30 +21,30 @@ jobs:
    strategy:
      fail-fast: false
      matrix:
-        language: [ 'cpp', 'python' ]
+        language: ["cpp", "python"]

    steps:
-    - name: Checkout repository
-      uses: actions/checkout@v2
+      - name: Checkout repository
+        uses: actions/checkout@v2

-    # Initializes the CodeQL tools for scanning.
-    - name: Initialize CodeQL
-      uses: github/codeql-action/init@v1
-      with:
-        languages: ${{ matrix.language }}
-        # If you wish to specify custom queries, you can do so here or in a config file.
-        # By default, queries listed here will override any specified in a config file.
-        # Prefix the list here with "+" to use these queries and those in the config file.
-        # queries: ./path/to/local/query, your-org/your-repo/queries@main
-    - if: matrix.language == 'cpp' 
-      name: Build Cpp
-      run: |
-        sudo apt-get update
-        sudo apt-get install python3-pyqt5
-        make modules
-    - if: matrix.language == 'python'
-      name: Autobuild
-      uses: github/codeql-action/autobuild@v1
-    # Analysis
-    - name: Perform CodeQL Analysis
-      uses: github/codeql-action/analyze@v1
+      # Initializes the CodeQL tools for scanning.
+      - name: Initialize CodeQL
+        uses: github/codeql-action/init@v1
+        with:
+          languages: ${{ matrix.language }}
+          # If you wish to specify custom queries, you can do so here or in a config file.
+          # By default, queries listed here will override any specified in a config file.
+          # Prefix the list here with "+" to use these queries and those in the config file.
+          # queries: ./path/to/local/query, your-org/your-repo/queries@main
+      - if: matrix.language == 'cpp'
+        name: Build Cpp
+        run: |
+          sudo apt-get update
+          sudo apt-get install python3-pyqt5
+          make modules
+      - if: matrix.language == 'python'
+        name: Autobuild
+        uses: github/codeql-action/autobuild@v1
+      # Analysis
+      - name: Perform CodeQL Analysis
+        uses: github/codeql-action/analyze@v1
--- a/.github/workflows/default.yml
+++ b/.github/workflows/default.yml
@ -4,48 +4,48 @@ name: Default CI/CD

 on:
  push:
-    branches: [ master ]
+    branches: [master]
  pull_request:
-    branches: [ master ]
+    branches: [master]

 jobs:
  lint:
    runs-on: ubuntu-latest
    steps:
-    - uses: actions/checkout@v2
-    - name: Set up Python 3.10
-      uses: actions/setup-python@v2
-      with:
-        python-version: '3.10'
-    - name: Install dependencies
-      run: |
-        python -m pip install --upgrade pip
-        pip install -r requirements.txt -r requirements-extra.txt
-    - name: Lint with flake8
-      run: |
-        flake8 .
+      - uses: actions/checkout@v2
+      - name: Set up Python 3.10
+        uses: actions/setup-python@v2
+        with:
+          python-version: "3.10"
+      - name: Install dependencies
+        run: |
+          python -m pip install --upgrade pip
+          pip install -r requirements.txt -r requirements-extra.txt
+      - name: Lint with flake8
+        run: |
+          flake8 .
  format:
    runs-on: ubuntu-latest
    steps:
-    - uses: actions/checkout@v2
-    - name: Set up Python 3.10
-      uses: actions/setup-python@v2
-      with:
-        python-version: '3.10'
-    - name: Install dependencies
-      run: |
-        python -m pip install --upgrade pip
-        pip install -r requirements.txt -r requirements-extra.txt
-    - name: Check format with black
-      run: |
-        black .
+      - uses: actions/checkout@v2
+      - name: Set up Python 3.10
+        uses: actions/setup-python@v2
+        with:
+          python-version: "3.10"
+      - name: Install dependencies
+        run: |
+          python -m pip install --upgrade pip
+          pip install -r requirements.txt -r requirements-extra.txt
+      - name: Check format with black
+        run: |
+          black .
  test:
    needs: [lint, format]
    runs-on: ${{ matrix.os }}
    strategy:
      matrix:
        os: [ubuntu-latest, macos-latest, windows-latest]
-        python-version: [3.7, 3.8, 3.9, '3.10']
+        python-version: [3.7, 3.8, 3.9, "3.10"]
        exclude:
          - os: macos-latest
            python-version: 3.7
@ -61,24 +61,24 @@ jobs:
            python-version: 3.9

    steps:
-    - uses: actions/checkout@v2
-    - name: Set up Python ${{ matrix.python-version }}
-      uses: actions/setup-python@v2
-      with:
-        python-version: ${{ matrix.python-version }}
-    - name: Install dependencies
-      run: |
-        python -m pip install --upgrade pip
-        pip install -r requirements.txt -r requirements-extra.txt
-    - name: Build python modules
-      run: |
-        python build.py --modules
-    - name: Run tests
-      run: |
-        pytest core hscommon
-    - name: Upload Artifacts
-      if: matrix.os == 'ubuntu-latest'
-      uses: actions/upload-artifact@v3
-      with:
-        name: modules ${{ matrix.python-version }}
-        path: ${{ github.workspace }}/**/*.so
+      - uses: actions/checkout@v2
+      - name: Set up Python ${{ matrix.python-version }}
+        uses: actions/setup-python@v2
+        with:
+          python-version: ${{ matrix.python-version }}
+      - name: Install dependencies
+        run: |
+          python -m pip install --upgrade pip
+          pip install -r requirements.txt -r requirements-extra.txt
+      - name: Build python modules
+        run: |
+          python build.py --modules
+      - name: Run tests
+        run: |
+          pytest core hscommon
+      - name: Upload Artifacts
+        if: matrix.os == 'ubuntu-latest'
+        uses: actions/upload-artifact@v3
+        with:
+          name: modules ${{ matrix.python-version }}
+          path: ${{ github.workspace }}/**/*.so
--- a/.gitignore
+++ b/.gitignore
@ -1,30 +1,111 @@
-.DS_Store
-__pycache__
-*.egg-info
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# C extensions
 *.so
+
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+cover/
+
+# Translations
 *.mo
-*.waf*
-.lock-waf*
-.tox
-/tags
-*.eggs
+#*.pot

-build
-dist
-env*
-/deps
-cocoa/autogen
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow
+__pypackages__/

-/run.py
-/cocoa/*/Info.plist
-/cocoa/*/build
+# Environments
+.env
+.venv
+env*/
+venv/
+ENV/
+env.bak/
+venv.bak/
+
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+
+# Pyre type checker
+.pyre/
+
+# pytype static type analyzer
+.pytype/
+
+# Cython debug symbols
+cython_debug/
+
+# macOS
+.DS_Store
+
+# Visual Studio Code
+.vscode/*
+!.vscode/settings.json
+#!.vscode/tasks.json
+#!.vscode/launch.json
+!.vscode/extensions.json
+!.vscode/*.code-snippets
+
+# Local History for Visual Studio Code
+.history/
+
+# Built Visual Studio Code Extensions
+*.vsix
+
+# dupeGuru Specific
 /qt/*_rc.py
 /help/*/conf.py
 /help/*/changelog.rst
-/transifex
+cocoa/autogen
+/cocoa/*/Info.plist
+/cocoa/*/build

-*.pyd
-*.exe
-*.spec
-
-.vscode
+*.waf*
+.lock-waf*
+/tags
--- a/.tx/config
+++ b/.tx/config
@ -1,26 +1,27 @@
 [main]
 host = https://www.transifex.com

-[dupeguru-1.core]
-file_filter = locale/<lang>/LC_MESSAGES/core.po
-source_file = locale/core.pot
-source_lang = en
-type = PO
-
-[dupeguru-1.columns]
+[o:voltaicideas:p:dupeguru-1:r:columns]
 file_filter = locale/<lang>/LC_MESSAGES/columns.po
 source_file = locale/columns.pot
 source_lang = en
-type = PO
+type        = PO

-[dupeguru-1.ui]
-file_filter = locale/<lang>/LC_MESSAGES/ui.po
-source_file = locale/ui.pot
+[o:voltaicideas:p:dupeguru-1:r:core]
+file_filter = locale/<lang>/LC_MESSAGES/core.po
+source_file = locale/core.pot
 source_lang = en
-type = PO
+type        = PO

-[dupeguru-1.qtlib]
+[o:voltaicideas:p:dupeguru-1:r:qtlib]
 file_filter = qtlib/locale/<lang>/LC_MESSAGES/qtlib.po
 source_file = qtlib/locale/qtlib.pot
 source_lang = en
-type = PO
+type        = PO
+
+[o:voltaicideas:p:dupeguru-1:r:ui]
+file_filter = locale/<lang>/LC_MESSAGES/ui.po
+source_file = locale/ui.pot
+source_lang = en
+type        = PO
+
--- a/.vscode/extensions.json
+++ b/.vscode/extensions.json
@ -0,0 +1,10 @@
+{
+    // List of extensions which should be recommended for users of this workspace.
+    "recommendations": [
+        "redhat.vscode-yaml",
+        "ms-python.vscode-pylance",
+        "ms-python.python"
+    ],
+    // List of extensions recommended by VS Code that should not be recommended for users of this workspace.
+    "unwantedRecommendations": []
+}
--- a/.vscode/settings.json
+++ b/.vscode/settings.json
@ -0,0 +1,12 @@
+{
+    "python.formatting.provider": "black",
+    "cSpell.words": [
+        "Dupras",
+        "hscommon"
+    ],
+    "python.languageServer": "Pylance",
+    "yaml.schemaStore.enable": true,
+    "yaml.schemas": {
+        "https://json.schemastore.org/github-workflow.json": ".github/workflows/*.yml"
+    }
+}
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@ -0,0 +1,88 @@
+# Contributing to dupeGuru
+
+The following is a set of guidelines and information for contributing to dupeGuru.
+
+#### Table of Contents
+
+[Things to Know Before Starting](#things-to-know-before-starting)
+
+[Ways to Contribute](#ways-to-contribute)
+  * [Reporting Bugs](#reporting-bugs)
+  * [Suggesting Enhancements](#suggesting-enhancements)
+  * [Localization](#localization)
+  * [Code Contribution](#code-contribution)
+  * [Pull Requests](#pull-requests)
+
+[Style Guides](#style-guides)
+  * [Git Commit Messages](#git-commit-messages)
+  * [Python Style Guide](#python-style-guide)
+  * [Documentation Style Guide](#documentation-style-guide)
+
+[Additional Notes](#additional-notes)
+  * [Issue and Pull Request Labels](#issue-and-pull-request-labels)
+
+## Things to Know Before Starting
+**TODO**
+## Ways to contribute
+### Reporting Bugs
+**TODO**
+### Suggesting Enhancements
+**TODO**
+### Localization
+**TODO**
+### Code Contribution
+**TODO**
+### Pull Requests
+Please follow these steps to have your contribution considered by the maintainers:
+
+1. Keep Pull Request specific to one feature or bug.
+2. Follow the [style guides](#style-guides)
+3. After you submit your pull request, verify that all [status checks](https://help.github.com/articles/about-status-checks/) are passing <details><summary>What if the status checks are failing?</summary>If a status check is failing, and you believe that the failure is unrelated to your change, please leave a comment on the pull request explaining why you believe the failure is unrelated. A maintainer will re-run the status check for you. If we conclude that the failure was a false positive, then we will open an issue to track that problem with our status check suite.</details>
+
+While the prerequisites above must be satisfied prior to having your pull request reviewed, the reviewer(s) may ask you to complete additional design work, tests, or other changes before your pull request can be ultimately accepted.
+
+## Style Guides
+### Git Commit Messages
+- Use the present tense ("Add feature" not "Added feature")
+- Use the imperative mood ("Move cursor to..." not "Moves cursor to...")
+- Limit the first line to 72 characters or less
+- Reference issues and pull requests liberally after the first line
+
+### Python Style Guide
+- All files are formatted with [Black](https://github.com/psf/black)
+- Follow [PEP 8](https://peps.python.org/pep-0008/) as much as practical
+- Pass [flake8](https://flake8.pycqa.org/en/latest/) linting
+- Include [PEP 484](https://peps.python.org/pep-0484/) type hints (new code)
+
+### Documentation Style Guide
+**TODO**
+
+## Additional Notes
+### Issue and Pull Request Labels
+This section lists and describes the various labels used with issues and pull requests.  Each of the labels is listed with a search link as well.
+
+#### Issue Type and Status
+| Label name | Search | Description |
+|------------|--------|-------------|
+| `enhancement` | [search](https://github.com/arsenetar/dupeguru/issues?q=is%3Aopen+is%3Aissue+label%3Aenhancement) | Feature requests and enhancements. |
+| `bug` | [search](https://github.com/arsenetar/dupeguru/issues?q=is%3Aopen+is%3Aissue+label%3Abug) | Bug reports. |
+| `duplicate` | [search](https://github.com/arsenetar/dupeguru/issues?q=is%3Aopen+is%3Aissue+label%3Aduplicate) | Issue is a duplicate of existing issue. |
+| `needs-reproduction` | [search](https://github.com/arsenetar/dupeguru/issues?q=is%3Aopen+is%3Aissue+label%3Aneeds-reproduction) | A bug that has not been able to be reproduced. |
+| `needs-information` | [search](https://github.com/arsenetar/dupeguru/issues?q=is%3Aopen+is%3Aissue+label%3Aneeds-information) | More information needs to be collected about these problems or feature requests (e.g. steps to reproduce). |
+| `blocked` | [search](https://github.com/arsenetar/dupeguru/issues?q=is%3Aopen+is%3Aissue+label%3Ablocked) | Issue blocked by other issues. |
+| `beginner` | [search](https://github.com/arsenetar/dupeguru/issues?q=is%3Aopen+is%3Aissue+label%3Abeginner) | Less complex issues for users who want to start contributing. |
+
+#### Category Labels
+| Label name | Search | Description |
+|------------|--------|-------------|
+| `3rd party` | [search](https://github.com/arsenetar/dupeguru/issues?q=is%3Aopen+is%3Aissue+label%3A%223rd%20party%22)  | Related to a 3rd party dependency. |
+| `crash` | [search](https://github.com/arsenetar/dupeguru/issues?q=is%3Aopen+is%3Aissue+label%3Acrash) | Related to crashes (complete, or unhandled). |
+| `documentation` | [search](https://github.com/arsenetar/dupeguru/issues?q=is%3Aopen+is%3Aissue+label%3Adocumentation) | Related to any documentation. |
+| `linux` | [search](https://github.com/arsenetar/dupeguru/issues?q=is%3Aopen+is%3Aissue+label%3linux) | Related to running on Linux. |
+| `mac` | [search](https://github.com/arsenetar/dupeguru/issues?q=is%3Aopen+is%3Aissue+label%3Amac) | Related to running on macOS. |
+| `performance` | [search](https://github.com/arsenetar/dupeguru/issues?q=is%3Aopen+is%3Aissue+label%3Aperformance) | Related to the performance. |
+| `ui` | [search](https://github.com/arsenetar/dupeguru/issues?q=is%3Aopen+is%3Aissue+label%3Aui)| Related to the visual design. |
+| `windows` | [search](https://github.com/arsenetar/dupeguru/issues?q=is%3Aopen+is%3Aissue+label%3Awindows) | Related to running on Windows. |
+
+#### Pull Request Labels
+None at this time, if the volume of Pull Requests increase labels may be added to manage.
--- a/2
+++ b/2
@ -1,7 +1,7 @@
 PYTHON ?= python3
 PYTHON_VERSION_MINOR := $(shell ${PYTHON} -c "import sys; print(sys.version_info.minor)")
 PYRCC5 ?= pyrcc5
-REQ_MINOR_VERSION = 6
+REQ_MINOR_VERSION = 7
 PREFIX ?= /usr/local

 # Window compatability via Msys2 
--- a/core/init.py
+++ b/core/init.py
@ -1,2 +1,2 @@
-__version__ = "4.2.0"
+__version__ = "4.2.1"
 __appname__ = "dupeGuru"
--- a/core/engine.py
+++ b/core/engine.py
@ -283,7 +283,7 @@ def getmatches_by_contents(files, bigsize=0, j=job.nulljob):
    """Returns a list of :class:`Match` within ``files`` if their contents is the same.

    :param bigsize: The size in bytes over which we consider files big enough to
-                    justify taking samples of md5. If 0, compute md5 as usual.
+                    justify taking samples of the file for hashing. If 0, compute digest as usual.
    :param j: A :ref:`job progress instance <jobs>`.
    """
    size2files = defaultdict(set)
@ -300,15 +300,15 @@ def getmatches_by_contents(files, bigsize=0, j=job.nulljob):
            if first.is_ref and second.is_ref:
                continue  # Don't spend time comparing two ref pics together.
            if first.size == 0 and second.size == 0:
-                # skip md5 for zero length files
+                # skip hashing for zero length files
                result.append(Match(first, second, 100))
                continue
-            if first.md5partial == second.md5partial:
+            if first.digest_partial == second.digest_partial:
                if bigsize > 0 and first.size > bigsize:
-                    if first.md5samples == second.md5samples:
+                    if first.digest_samples == second.digest_samples:
                        result.append(Match(first, second, 100))
                else:
-                    if first.md5 == second.md5:
+                    if first.digest == second.digest:
                        result.append(Match(first, second, 100))
        group_count += 1
        j.add_progress(desc=PROGRESS_MESSAGE % (len(result), group_count))
--- a/core/fs.py
+++ b/core/fs.py
@ -11,12 +11,13 @@
 # resulting needless complexity and memory usage. It's been a while since I wanted to do that fork,
 # and I'm doing it now.

-import hashlib
+import os
+import xxhash
 from math import floor
 import logging
 import sqlite3
 from threading import Lock
-from typing import Any
+from typing import Any, AnyStr, Union

 from hscommon.path import Path
 from hscommon.util import nonone, get_file_ext
@ -40,7 +41,7 @@ NOT_SET = object()
 # CPU.
 CHUNK_SIZE = 1024 * 1024  # 1 MiB

-# Minimum size below which partial hashes don't need to be computed
+# Minimum size below which partial hashing is not used
 MIN_FILE_SIZE = 3 * CHUNK_SIZE  # 3MiB, because we take 3 samples


@ -83,9 +84,11 @@ class OperationError(FSError):


 class FilesDB:
+    schema_version = 1
+    schema_version_description = "Changed from md5 to xxhash"

-    create_table_query = "CREATE TABLE IF NOT EXISTS files (path TEXT PRIMARY KEY, size INTEGER, mtime_ns INTEGER, entry_dt DATETIME, md5 BLOB, md5partial BLOB)"
-    drop_table_query = "DROP TABLE files;"
+    create_table_query = "CREATE TABLE IF NOT EXISTS files (path TEXT PRIMARY KEY, size INTEGER, mtime_ns INTEGER, entry_dt DATETIME, digest BLOB, digest_partial BLOB, digest_samples BLOB)"
+    drop_table_query = "DROP TABLE IF EXISTS files;"
    select_query = "SELECT {key} FROM files WHERE path=:path AND size=:size and mtime_ns=:mtime_ns"
    insert_query = """
        INSERT INTO files (path, size, mtime_ns, entry_dt, {key}) VALUES (:path, :size, :mtime_ns, datetime('now'), :value)
@ -97,24 +100,37 @@ class FilesDB:
        self.cur = None
        self.lock = None

-    def connect(self, path):
-        # type: (str, ) -> None
-
+    def connect(self, path: Union[AnyStr, os.PathLike]) -> None:
        self.conn = sqlite3.connect(path, check_same_thread=False)
        self.cur = self.conn.cursor()
-        self.cur.execute(self.create_table_query)
        self.lock = Lock()
+        self._check_upgrade()

-    def clear(self):
-        # type: () -> None
+    def _check_upgrade(self) -> None:
+        with self.lock:
+            has_schema = self.cur.execute(
+                "SELECT NAME FROM sqlite_master WHERE type='table' AND name='schema_version'"
+            ).fetchall()
+            version = None
+            if has_schema:
+                version = self.cur.execute("SELECT version FROM schema_version ORDER BY version DESC").fetchone()[0]
+            else:
+                self.cur.execute("CREATE TABLE schema_version (version int PRIMARY KEY, description TEXT)")
+            if version != self.schema_version:
+                self.cur.execute(self.drop_table_query)
+                self.cur.execute(
+                    "INSERT OR REPLACE INTO schema_version VALUES (:version, :description)",
+                    {"version": self.schema_version, "description": self.schema_version_description},
+                )
+            self.cur.execute(self.create_table_query)
+            self.conn.commit()

+    def clear(self) -> None:
        with self.lock:
            self.cur.execute(self.drop_table_query)
            self.cur.execute(self.create_table_query)

-    def get(self, path, key):
-        # type: (Path, str) -> bytes
-
+    def get(self, path: Path, key: str) -> Union[bytes, None]:
        stat = path.stat()
        size = stat.st_size
        mtime_ns = stat.st_mtime_ns
@ -128,9 +144,7 @@ class FilesDB:

        return None

-    def put(self, path, key, value):
-        # type: (Path, str, Any) -> None
-
+    def put(self, path: Path, key: str, value: Any) -> None:
        stat = path.stat()
        size = stat.st_size
        mtime_ns = stat.st_mtime_ns
@ -141,15 +155,11 @@ class FilesDB:
                {"path": str(path), "size": size, "mtime_ns": mtime_ns, "value": value},
            )

-    def commit(self):
-        # type: () -> None
-
+    def commit(self) -> None:
        with self.lock:
            self.conn.commit()

-    def close(self):
-        # type: () -> None
-
+    def close(self) -> None:
        with self.lock:
            self.cur.close()
            self.conn.close()
@ -161,7 +171,7 @@ filesdb = FilesDB()  # Singleton
 class File:
    """Represents a file and holds metadata to be used for scanning."""

-    INITIAL_INFO = {"size": 0, "mtime": 0, "md5": b"", "md5partial": b"", "md5samples": b""}
+    INITIAL_INFO = {"size": 0, "mtime": 0, "digest": b"", "digest_partial": b"", "digest_samples": b""}
    # Slots for File make us save quite a bit of memory. In a memory test I've made with a lot of
    # files, I saved 35% memory usage with "unread" files (no _read_info() call) and gains become
    # even greater when we take into account read attributes (70%!). Yeah, it's worth it.
@ -187,32 +197,51 @@ class File:
                result = self.INITIAL_INFO[attrname]
        return result

-    def _calc_md5(self):
+    def _calc_digest(self):
        # type: () -> bytes

        with self.path.open("rb") as fp:
-            md5 = hashlib.md5()
+            file_hash = xxhash.xxh128()
            # The goal here is to not run out of memory on really big files. However, the chunk
            # size has to be large enough so that the python loop isn't too costly in terms of
            # CPU.
            CHUNK_SIZE = 1024 * 1024  # 1 mb
            filedata = fp.read(CHUNK_SIZE)
            while filedata:
-                md5.update(filedata)
+                file_hash.update(filedata)
                filedata = fp.read(CHUNK_SIZE)
-            return md5.digest()
+            return file_hash.digest()

-    def _calc_md5partial(self):
+    def _calc_digest_partial(self):
        # type: () -> bytes

-        # This offset is where we should start reading the file to get a partial md5
+        # This offset is where we should start reading the file to get a partial hash
        # For audio file, it should be where audio data starts
        offset, size = (0x4000, 0x4000)

        with self.path.open("rb") as fp:
            fp.seek(offset)
-            partialdata = fp.read(size)
-            return hashlib.md5(partialdata).digest()
+            partial_data = fp.read(size)
+            return xxhash.xxh128_digest(partial_data)
+
+    def _calc_digest_samples(self) -> bytes:
+        size = self.size
+        with self.path.open("rb") as fp:
+            # Chunk at 25% of the file
+            fp.seek(floor(size * 25 / 100), 0)
+            file_data = fp.read(CHUNK_SIZE)
+            file_hash = xxhash.xxh128(file_data)
+
+            # Chunk at 60% of the file
+            fp.seek(floor(size * 60 / 100), 0)
+            file_data = fp.read(CHUNK_SIZE)
+            file_hash.update(file_data)
+
+            # Last chunk of the file
+            fp.seek(-CHUNK_SIZE, 2)
+            file_data = fp.read(CHUNK_SIZE)
+            file_hash.update(file_data)
+            return file_hash.digest()

    def _read_info(self, field):
        # print(f"_read_info({field}) for {self}")
@ -220,48 +249,35 @@ class File:
            stats = self.path.stat()
            self.size = nonone(stats.st_size, 0)
            self.mtime = nonone(stats.st_mtime, 0)
-        elif field == "md5partial":
+        elif field == "digest_partial":
            try:
-                self.md5partial = filesdb.get(self.path, "md5partial")
-                if self.md5partial is None:
-                    self.md5partial = self._calc_md5partial()
-                    filesdb.put(self.path, "md5partial", self.md5partial)
+                self.digest_partial = filesdb.get(self.path, "digest_partial")
+                if self.digest_partial is None:
+                    self.digest_partial = self._calc_digest_partial()
+                    filesdb.put(self.path, "digest_partial", self.digest_partial)
            except Exception as e:
-                logging.warning("Couldn't get md5partial for %s: %s", self.path, e)
-        elif field == "md5":
+                logging.warning("Couldn't get digest_partial for %s: %s", self.path, e)
+        elif field == "digest":
            try:
-                self.md5 = filesdb.get(self.path, "md5")
-                if self.md5 is None:
-                    self.md5 = self._calc_md5()
-                    filesdb.put(self.path, "md5", self.md5)
+                self.digest = filesdb.get(self.path, "digest")
+                if self.digest is None:
+                    self.digest = self._calc_digest()
+                    filesdb.put(self.path, "digest", self.digest)
            except Exception as e:
-                logging.warning("Couldn't get md5 for %s: %s", self.path, e)
-        elif field == "md5samples":
+                logging.warning("Couldn't get digest for %s: %s", self.path, e)
+        elif field == "digest_samples":
+            size = self.size
+            # Might as well hash such small files entirely.
+            if size <= MIN_FILE_SIZE:
+                setattr(self, field, self.digest)
+                return
            try:
-                with self.path.open("rb") as fp:
-                    size = self.size
-                    # Might as well hash such small files entirely.
-                    if size <= MIN_FILE_SIZE:
-                        setattr(self, field, self.md5)
-                        return
-
-                    # Chunk at 25% of the file
-                    fp.seek(floor(size * 25 / 100), 0)
-                    filedata = fp.read(CHUNK_SIZE)
-                    md5 = hashlib.md5(filedata)
-
-                    # Chunk at 60% of the file
-                    fp.seek(floor(size * 60 / 100), 0)
-                    filedata = fp.read(CHUNK_SIZE)
-                    md5.update(filedata)
-
-                    # Last chunk of the file
-                    fp.seek(-CHUNK_SIZE, 2)
-                    filedata = fp.read(CHUNK_SIZE)
-                    md5.update(filedata)
-                    setattr(self, field, md5.digest())
+                self.digest_samples = filesdb.get(self.path, "digest_samples")
+                if self.digest_samples is None:
+                    self.digest_samples = self._calc_digest_samples()
+                    filesdb.put(self.path, "digest_samples", self.digest_samples)
            except Exception as e:
-                logging.error(f"Error computing md5samples: {e}")
+                logging.warning(f"Couldn't get digest_samples for {self.path}: {e}")

    def _read_all_info(self, attrnames=None):
        """Cache all possible info.
@ -314,7 +330,7 @@ class File:
 class Folder(File):
    """A wrapper around a folder path.

-    It has the size/md5 info of a File, but its value is the sum of its subitems.
+    It has the size/digest info of a File, but its value is the sum of its subitems.
    """

    __slots__ = File.__slots__ + ("_subfolders",)
@ -335,19 +351,18 @@ class Folder(File):
            self.size = size
            stats = self.path.stat()
            self.mtime = nonone(stats.st_mtime, 0)
-        elif field in {"md5", "md5partial", "md5samples"}:
+        elif field in {"digest", "digest_partial", "digest_samples"}:
            # What's sensitive here is that we must make sure that subfiles'
-            # md5 are always added up in the same order, but we also want a
-            # different md5 if a file gets moved in a different subdirectory.
+            # digest are always added up in the same order, but we also want a
+            # different digest if a file gets moved in a different subdirectory.

-            def get_dir_md5_concat():
+            def get_dir_digest_concat():
                items = self._all_items()
                items.sort(key=lambda f: f.path)
-                md5s = [getattr(f, field) for f in items]
-                return b"".join(md5s)
+                digests = [getattr(f, field) for f in items]
+                return b"".join(digests)

-            md5 = hashlib.md5(get_dir_md5_concat())
-            digest = md5.digest()
+            digest = xxhash.xxh128_digest(get_dir_digest_concat())
            setattr(self, field, digest)

    @property
--- a/core/me/fs.py
+++ b/core/me/fs.py
@ -97,11 +97,6 @@ class MusicFile(fs.File):
            "dupe_count": format_dupe_count(dupe_count),
        }

-    def _get_md5partial_offset_and_size(self):
-        # No longer calculating the offset and audio size, just whole file
-        size = self.path.stat().st_size
-        return (0, size)
-
    def _read_info(self, field):
        fs.File._read_info(self, field)
        if field in TAG_FIELDS:
--- a/core/pe/matchblock.py
+++ b/core/pe/matchblock.py
@ -238,7 +238,7 @@ def getmatches(pictures, cache_path, threshold, match_scaled=False, j=job.nulljo
    for ref_id, other_id, percentage in myiter:
        ref = id2picture[ref_id]
        other = id2picture[other_id]
-        if percentage == 100 and ref.md5 != other.md5:
+        if percentage == 100 and ref.digest != other.digest:
            percentage = 99
        if percentage >= threshold:
            ref.dimensions  # pre-read dimensions for display in results
--- a/core/tests/base.py
+++ b/core/tests/base.py
@ -86,9 +86,9 @@ class NamedObject:
            folder = "basepath"
        self._folder = Path(folder)
        self.size = size
-        self.md5partial = name
-        self.md5 = name
-        self.md5samples = name
+        self.digest_partial = name
+        self.digest = name
+        self.digest_samples = name
        if with_words:
            self.words = getwords(name)
        self.is_ref = False
--- a/core/tests/engine_test.py
+++ b/core/tests/engine_test.py
@ -530,7 +530,7 @@ class TestCaseGetMatches:


 class TestCaseGetMatchesByContents:
-    def test_big_file_partial_hashes(self):
+    def test_big_file_partial_hashing(self):
        smallsize = 1
        bigsize = 100 * 1024 * 1024  # 100MB
        f = [
@ -539,17 +539,17 @@ class TestCaseGetMatchesByContents:
            no("smallfoo", size=smallsize),
            no("smallbar", size=smallsize),
        ]
-        f[0].md5 = f[0].md5partial = f[0].md5samples = "foobar"
-        f[1].md5 = f[1].md5partial = f[1].md5samples = "foobar"
-        f[2].md5 = f[2].md5partial = "bleh"
-        f[3].md5 = f[3].md5partial = "bleh"
+        f[0].digest = f[0].digest_partial = f[0].digest_samples = "foobar"
+        f[1].digest = f[1].digest_partial = f[1].digest_samples = "foobar"
+        f[2].digest = f[2].digest_partial = "bleh"
+        f[3].digest = f[3].digest_partial = "bleh"
        r = getmatches_by_contents(f, bigsize=bigsize)
        eq_(len(r), 2)
-        # User disabled optimization for big files, compute hashes as usual
+        # User disabled optimization for big files, compute digests as usual
        r = getmatches_by_contents(f, bigsize=0)
        eq_(len(r), 2)
-        # Other file is now slightly different, md5partial is still the same
-        f[1].md5 = f[1].md5samples = "foobardiff"
+        # Other file is now slightly different, digest_partial is still the same
+        f[1].digest = f[1].digest_samples = "foobardiff"
        r = getmatches_by_contents(f, bigsize=bigsize)
        # Successfully filter it out
        eq_(len(r), 1)
--- a/core/tests/fs_test.py
+++ b/core/tests/fs_test.py
@ -6,7 +6,7 @@
 # which should be included with this package. The terms are also available at
 # http://www.gnu.org/licenses/gpl-3.0.html

-import hashlib
+import xxhash
 from os import urandom

 from hscommon.path import Path
@ -52,54 +52,54 @@ def test_size_aggregates_subfiles(tmpdir):
    eq_(b.size, 12)


-def test_md5_aggregate_subfiles_sorted(tmpdir):
-    # dir.allfiles can return child in any order. Thus, bundle.md5 must aggregate
-    # all files' md5 it contains, but it must make sure that it does so in the
+def test_digest_aggregate_subfiles_sorted(tmpdir):
+    # dir.allfiles can return child in any order. Thus, bundle.digest must aggregate
+    # all files' digests it contains, but it must make sure that it does so in the
    # same order everytime.
    p = create_fake_fs_with_random_data(Path(str(tmpdir)))
    b = fs.Folder(p)
-    md51 = fs.File(p["dir1"]["file1.test"]).md5
-    md52 = fs.File(p["dir2"]["file2.test"]).md5
-    md53 = fs.File(p["dir3"]["file3.test"]).md5
-    md54 = fs.File(p["file1.test"]).md5
-    md55 = fs.File(p["file2.test"]).md5
-    md56 = fs.File(p["file3.test"]).md5
-    # The expected md5 is the md5 of md5s for folders and the direct md5 for files
-    folder_md51 = hashlib.md5(md51).digest()
-    folder_md52 = hashlib.md5(md52).digest()
-    folder_md53 = hashlib.md5(md53).digest()
-    md5 = hashlib.md5(folder_md51 + folder_md52 + folder_md53 + md54 + md55 + md56)
-    eq_(b.md5, md5.digest())
+    digest1 = fs.File(p["dir1"]["file1.test"]).digest
+    digest2 = fs.File(p["dir2"]["file2.test"]).digest
+    digest3 = fs.File(p["dir3"]["file3.test"]).digest
+    digest4 = fs.File(p["file1.test"]).digest
+    digest5 = fs.File(p["file2.test"]).digest
+    digest6 = fs.File(p["file3.test"]).digest
+    # The expected digest is the hash of digests for folders and the direct digest for files
+    folder_digest1 = xxhash.xxh128_digest(digest1)
+    folder_digest2 = xxhash.xxh128_digest(digest2)
+    folder_digest3 = xxhash.xxh128_digest(digest3)
+    digest = xxhash.xxh128_digest(folder_digest1 + folder_digest2 + folder_digest3 + digest4 + digest5 + digest6)
+    eq_(b.digest, digest)


-def test_partial_md5_aggregate_subfile_sorted(tmpdir):
+def test_partial_digest_aggregate_subfile_sorted(tmpdir):
    p = create_fake_fs_with_random_data(Path(str(tmpdir)))
    b = fs.Folder(p)
-    md51 = fs.File(p["dir1"]["file1.test"]).md5partial
-    md52 = fs.File(p["dir2"]["file2.test"]).md5partial
-    md53 = fs.File(p["dir3"]["file3.test"]).md5partial
-    md54 = fs.File(p["file1.test"]).md5partial
-    md55 = fs.File(p["file2.test"]).md5partial
-    md56 = fs.File(p["file3.test"]).md5partial
-    # The expected md5 is the md5 of md5s for folders and the direct md5 for files
-    folder_md51 = hashlib.md5(md51).digest()
-    folder_md52 = hashlib.md5(md52).digest()
-    folder_md53 = hashlib.md5(md53).digest()
-    md5 = hashlib.md5(folder_md51 + folder_md52 + folder_md53 + md54 + md55 + md56)
-    eq_(b.md5partial, md5.digest())
+    digest1 = fs.File(p["dir1"]["file1.test"]).digest_partial
+    digest2 = fs.File(p["dir2"]["file2.test"]).digest_partial
+    digest3 = fs.File(p["dir3"]["file3.test"]).digest_partial
+    digest4 = fs.File(p["file1.test"]).digest_partial
+    digest5 = fs.File(p["file2.test"]).digest_partial
+    digest6 = fs.File(p["file3.test"]).digest_partial
+    # The expected digest is the hash of digests for folders and the direct digest for files
+    folder_digest1 = xxhash.xxh128_digest(digest1)
+    folder_digest2 = xxhash.xxh128_digest(digest2)
+    folder_digest3 = xxhash.xxh128_digest(digest3)
+    digest = xxhash.xxh128_digest(folder_digest1 + folder_digest2 + folder_digest3 + digest4 + digest5 + digest6)
+    eq_(b.digest_partial, digest)

-    md51 = fs.File(p["dir1"]["file1.test"]).md5samples
-    md52 = fs.File(p["dir2"]["file2.test"]).md5samples
-    md53 = fs.File(p["dir3"]["file3.test"]).md5samples
-    md54 = fs.File(p["file1.test"]).md5samples
-    md55 = fs.File(p["file2.test"]).md5samples
-    md56 = fs.File(p["file3.test"]).md5samples
-    # The expected md5 is the md5 of md5s for folders and the direct md5 for files
-    folder_md51 = hashlib.md5(md51).digest()
-    folder_md52 = hashlib.md5(md52).digest()
-    folder_md53 = hashlib.md5(md53).digest()
-    md5 = hashlib.md5(folder_md51 + folder_md52 + folder_md53 + md54 + md55 + md56)
-    eq_(b.md5samples, md5.digest())
+    digest1 = fs.File(p["dir1"]["file1.test"]).digest_samples
+    digest2 = fs.File(p["dir2"]["file2.test"]).digest_samples
+    digest3 = fs.File(p["dir3"]["file3.test"]).digest_samples
+    digest4 = fs.File(p["file1.test"]).digest_samples
+    digest5 = fs.File(p["file2.test"]).digest_samples
+    digest6 = fs.File(p["file3.test"]).digest_samples
+    # The expected digest is the digest of digests for folders and the direct digest for files
+    folder_digest1 = xxhash.xxh128_digest(digest1)
+    folder_digest2 = xxhash.xxh128_digest(digest2)
+    folder_digest3 = xxhash.xxh128_digest(digest3)
+    digest = xxhash.xxh128_digest(folder_digest1 + folder_digest2 + folder_digest3 + digest4 + digest5 + digest6)
+    eq_(b.digest_samples, digest)


 def test_has_file_attrs(tmpdir):
--- a/core/tests/scanner_test.py
+++ b/core/tests/scanner_test.py
@ -123,19 +123,19 @@ def test_content_scan(fake_fileexists):
    s = Scanner()
    s.scan_type = ScanType.CONTENTS
    f = [no("foo"), no("bar"), no("bleh")]
-    f[0].md5 = f[0].md5partial = f[0].md5samples = "foobar"
-    f[1].md5 = f[1].md5partial = f[1].md5samples = "foobar"
-    f[2].md5 = f[2].md5partial = f[1].md5samples = "bleh"
+    f[0].digest = f[0].digest_partial = f[0].digest_samples = "foobar"
+    f[1].digest = f[1].digest_partial = f[1].digest_samples = "foobar"
+    f[2].digest = f[2].digest_partial = f[1].digest_samples = "bleh"
    r = s.get_dupe_groups(f)
    eq_(len(r), 1)
    eq_(len(r[0]), 2)
-    eq_(s.discarded_file_count, 0)  # don't count the different md5 as discarded!
+    eq_(s.discarded_file_count, 0)  # don't count the different digest as discarded!


 def test_content_scan_compare_sizes_first(fake_fileexists):
    class MyFile(no):
        @property
-        def md5(self):
+        def digest(self):
            raise AssertionError()

    s = Scanner()
@ -161,14 +161,14 @@ def test_ignore_file_size(fake_fileexists):
        no("largeignore1", large_size + 1),
        no("largeignore2", large_size + 1),
    ]
-    f[0].md5 = f[0].md5partial = f[0].md5samples = "smallignore"
-    f[1].md5 = f[1].md5partial = f[1].md5samples = "smallignore"
-    f[2].md5 = f[2].md5partial = f[2].md5samples = "small"
-    f[3].md5 = f[3].md5partial = f[3].md5samples = "small"
-    f[4].md5 = f[4].md5partial = f[4].md5samples = "large"
-    f[5].md5 = f[5].md5partial = f[5].md5samples = "large"
-    f[6].md5 = f[6].md5partial = f[6].md5samples = "largeignore"
-    f[7].md5 = f[7].md5partial = f[7].md5samples = "largeignore"
+    f[0].digest = f[0].digest_partial = f[0].digest_samples = "smallignore"
+    f[1].digest = f[1].digest_partial = f[1].digest_samples = "smallignore"
+    f[2].digest = f[2].digest_partial = f[2].digest_samples = "small"
+    f[3].digest = f[3].digest_partial = f[3].digest_samples = "small"
+    f[4].digest = f[4].digest_partial = f[4].digest_samples = "large"
+    f[5].digest = f[5].digest_partial = f[5].digest_samples = "large"
+    f[6].digest = f[6].digest_partial = f[6].digest_samples = "largeignore"
+    f[7].digest = f[7].digest_partial = f[7].digest_samples = "largeignore"

    r = s.get_dupe_groups(f)
    # No ignores
@ -197,21 +197,21 @@ def test_big_file_partial_hashes(fake_fileexists):
    s.big_file_size_threshold = bigsize

    f = [no("bigfoo", bigsize), no("bigbar", bigsize), no("smallfoo", smallsize), no("smallbar", smallsize)]
-    f[0].md5 = f[0].md5partial = f[0].md5samples = "foobar"
-    f[1].md5 = f[1].md5partial = f[1].md5samples = "foobar"
-    f[2].md5 = f[2].md5partial = "bleh"
-    f[3].md5 = f[3].md5partial = "bleh"
+    f[0].digest = f[0].digest_partial = f[0].digest_samples = "foobar"
+    f[1].digest = f[1].digest_partial = f[1].digest_samples = "foobar"
+    f[2].digest = f[2].digest_partial = "bleh"
+    f[3].digest = f[3].digest_partial = "bleh"
    r = s.get_dupe_groups(f)
    eq_(len(r), 2)

-    # md5partial is still the same, but the file is actually different
-    f[1].md5 = f[1].md5samples = "difffoobar"
-    # here we compare the full md5s, as the user disabled the optimization
+    # digest_partial is still the same, but the file is actually different
+    f[1].digest = f[1].digest_samples = "difffoobar"
+    # here we compare the full digests, as the user disabled the optimization
    s.big_file_size_threshold = 0
    r = s.get_dupe_groups(f)
    eq_(len(r), 1)

-    # here we should compare the md5samples, and see they are different
+    # here we should compare the digest_samples, and see they are different
    s.big_file_size_threshold = bigsize
    r = s.get_dupe_groups(f)
    eq_(len(r), 1)
@ -221,9 +221,9 @@ def test_min_match_perc_doesnt_matter_for_content_scan(fake_fileexists):
    s = Scanner()
    s.scan_type = ScanType.CONTENTS
    f = [no("foo"), no("bar"), no("bleh")]
-    f[0].md5 = f[0].md5partial = f[0].md5samples = "foobar"
-    f[1].md5 = f[1].md5partial = f[1].md5samples = "foobar"
-    f[2].md5 = f[2].md5partial = f[2].md5samples = "bleh"
+    f[0].digest = f[0].digest_partial = f[0].digest_samples = "foobar"
+    f[1].digest = f[1].digest_partial = f[1].digest_samples = "foobar"
+    f[2].digest = f[2].digest_partial = f[2].digest_samples = "bleh"
    s.min_match_percentage = 101
    r = s.get_dupe_groups(f)
    eq_(len(r), 1)
@ -234,12 +234,16 @@ def test_min_match_perc_doesnt_matter_for_content_scan(fake_fileexists):
    eq_(len(r[0]), 2)


-def test_content_scan_doesnt_put_md5_in_words_at_the_end(fake_fileexists):
+def test_content_scan_doesnt_put_digest_in_words_at_the_end(fake_fileexists):
    s = Scanner()
    s.scan_type = ScanType.CONTENTS
    f = [no("foo"), no("bar")]
-    f[0].md5 = f[0].md5partial = f[0].md5samples = "\x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
-    f[1].md5 = f[1].md5partial = f[1].md5samples = "\x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+    f[0].digest = f[0].digest_partial = f[
+        0
+    ].digest_samples = "\x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+    f[1].digest = f[1].digest_partial = f[
+        1
+    ].digest_samples = "\x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
    r = s.get_dupe_groups(f)
    # FIXME looks like we are missing something here?
    r[0]
@ -587,21 +591,21 @@ def test_folder_scan_exclude_subfolder_matches(fake_fileexists):
    s = Scanner()
    s.scan_type = ScanType.FOLDERS
    topf1 = no("top folder 1", size=42)
-    topf1.md5 = topf1.md5partial = topf1.md5samples = b"some_md5_1"
+    topf1.digest = topf1.digest_partial = topf1.digest_samples = b"some_digest__1"
    topf1.path = Path("/topf1")
    topf2 = no("top folder 2", size=42)
-    topf2.md5 = topf2.md5partial = topf2.md5samples = b"some_md5_1"
+    topf2.digest = topf2.digest_partial = topf2.digest_samples = b"some_digest__1"
    topf2.path = Path("/topf2")
    subf1 = no("sub folder 1", size=41)
-    subf1.md5 = subf1.md5partial = subf1.md5samples = b"some_md5_2"
+    subf1.digest = subf1.digest_partial = subf1.digest_samples = b"some_digest__2"
    subf1.path = Path("/topf1/sub")
    subf2 = no("sub folder 2", size=41)
-    subf2.md5 = subf2.md5partial = subf2.md5samples = b"some_md5_2"
+    subf2.digest = subf2.digest_partial = subf2.digest_samples = b"some_digest__2"
    subf2.path = Path("/topf2/sub")
    eq_(len(s.get_dupe_groups([topf1, topf2, subf1, subf2])), 1)  # only top folders
    # however, if another folder matches a subfolder, keep in in the matches
    otherf = no("other folder", size=41)
-    otherf.md5 = otherf.md5partial = otherf.md5samples = b"some_md5_2"
+    otherf.digest = otherf.digest_partial = otherf.digest_samples = b"some_digest__2"
    otherf.path = Path("/otherfolder")
    eq_(len(s.get_dupe_groups([topf1, topf2, subf1, subf2, otherf])), 2)

@ -624,9 +628,9 @@ def test_dont_count_ref_files_as_discarded(fake_fileexists):
    o1 = no("foo", path="p1")
    o2 = no("foo", path="p2")
    o3 = no("foo", path="p3")
-    o1.md5 = o1.md5partial = o1.md5samples = "foobar"
-    o2.md5 = o2.md5partial = o2.md5samples = "foobar"
-    o3.md5 = o3.md5partial = o3.md5samples = "foobar"
+    o1.digest = o1.digest_partial = o1.digest_samples = "foobar"
+    o2.digest = o2.digest_partial = o2.digest_samples = "foobar"
+    o3.digest = o3.digest_partial = o3.digest_samples = "foobar"
    o1.is_ref = True
    o2.is_ref = True
    eq_(len(s.get_dupe_groups([o1, o2, o3])), 1)
--- a/help/en/contribute.rst
+++ b/help/en/contribute.rst
@ -12,7 +12,7 @@ a community around this project.

 So, whatever your skills, if you're interested in contributing to dupeGuru, please do so. Normally,
 this documentation should be enough to get you started, but if it isn't, then **please**,
-`let me know`_ because it's a problem that I'm committed to fix. If there's any situation where you'd
+open a discussion at https://github.com/arsenetar/dupeguru/discussions.  If there's any situation where you'd
 wish to contribute but some doubt you're having prevent you from going forward, please contact me.
 I'd much prefer to spend the time figuring out with you whether (and how) you can contribute than
 taking the chance of missing that opportunity.
@ -82,10 +82,9 @@ agree on what should be added to the documentation.
 dupeGuru. For more information about how to do that, you can refer to the `translator guide`_.

 .. _been open source: https://www.hardcoded.net/articles/free-as-in-speech-fair-as-in-trade
-.. _let me know: mailto:hsoft@hardcoded.net
 .. _Source code repository: https://github.com/arsenetar/dupeguru
-.. _Issue Tracker: https://github.com/hsoft/arsenetar/issues
-.. _Issue labels meaning: https://github.com/hsoft/arsenetar/wiki/issue-labels
+.. _Issue Tracker: https://github.com/arsenetar/issues
+.. _Issue labels meaning: https://github.com/arsenetar/wiki/issue-labels
 .. _Sphinx: http://sphinx-doc.org/
 .. _reST: http://en.wikipedia.org/wiki/ReStructuredText
-.. _translator guide: https://github.com/hsoft/arsenetar/wiki/Translator-Guide
+.. _translator guide: https://github.com/arsenetar/wiki/Translator-Guide
--- a/hscommon/sqlite.py
+++ b/hscommon/sqlite.py
@ -45,7 +45,7 @@ class _ActualThread(threading.Thread):
        self._lock = threading.Lock()
        self._run = True
        self.lastrowid = -1
-        self.setDaemon(True)
+        self.daemon = True
        self.start()

    def _query(self, query):
--- a/locale/ms/LC_MESSAGES/ui.po
+++ b/locale/ms/LC_MESSAGES/ui.po
@ -1,9 +1,9 @@
 # Translators:
-# Yaya - Nurul Azeera Hidayah @ Muhammad Nur Hidayat Yasuyoshi (MNH48) <admin@mnh48.moe>, 2021
+# Yaya - Nurul Azeera Hidayah @ Muhammad Nur Hidayat Yasuyoshi (MNH48) <admin@mnh48.moe>, 2022
 #
 msgid ""
 msgstr ""
-"Last-Translator: Yaya - Nurul Azeera Hidayah @ Muhammad Nur Hidayat Yasuyoshi (MNH48) <admin@mnh48.moe>, 2021\n"
+"Last-Translator: Yaya - Nurul Azeera Hidayah @ Muhammad Nur Hidayat Yasuyoshi (MNH48) <admin@mnh48.moe>, 2022\n"
 "Language-Team: Malay (https://www.transifex.com/voltaicideas/teams/116153/ms/)\n"
 "Language: ms\n"
 "Content-Type: text/plain; charset=UTF-8\n"
@ -987,4 +987,4 @@ msgstr "Cache dikosongkan."

 #: qt\preferences_dialog.py:173
 msgid "Use dark style"
-msgstr ""
+msgstr "Guna gaya gelap"
--- a/locale/tr/LC_MESSAGES/ui.po
+++ b/locale/tr/LC_MESSAGES/ui.po
@ -1,10 +1,10 @@
 # Translators:
 # Ahmet Haydar Işık <itsahmthydr@gmail.com>, 2021
-# Emin Tufan Çetin <etcetin@gmail.com>, 2021
+# Emin Tufan Çetin <etcetin@gmail.com>, 2022
 #
 msgid ""
 msgstr ""
-"Last-Translator: Emin Tufan Çetin <etcetin@gmail.com>, 2021\n"
+"Last-Translator: Emin Tufan Çetin <etcetin@gmail.com>, 2022\n"
 "Language-Team: Turkish (https://www.transifex.com/voltaicideas/teams/116153/tr/)\n"
 "Language: tr\n"
 "Content-Type: text/plain; charset=UTF-8\n"
@ -983,4 +983,4 @@ msgstr "Önbellek temizlendi."

 #: qt\preferences_dialog.py:173
 msgid "Use dark style"
-msgstr ""
+msgstr "Karanlık biçem kullan"
--- a/locale/zh_TW/LC_MESSAGES/ui.po
+++ b/locale/zh_TW/LC_MESSAGES/ui.po
@ -1,6 +1,9 @@
+# Translators:
+# 太子 VC <taiziccf@gmail.com>, 2021
 #
 msgid ""
 msgstr ""
+"Last-Translator: 太子 VC <taiziccf@gmail.com>, 2021\n"
 "Language-Team: Chinese (Taiwan) (https://www.transifex.com/voltaicideas/teams/116153/zh_TW/)\n"
 "Language: zh_TW\n"
 "Content-Type: text/plain; charset=UTF-8\n"
@ -9,53 +12,53 @@ msgstr ""

 #: qt/app.py:81
 msgid "Quit"
-msgstr ""
+msgstr "退出"

 #: qt/app.py:82 qt/preferences_dialog.py:116
 #: cocoa/en.lproj/Localizable.strings:0
 msgid "Options"
-msgstr ""
+msgstr "选项"

 #: qt/app.py:83 qt/ignore_list_dialog.py:32
 #: cocoa/en.lproj/Localizable.strings:0
 msgid "Ignore List"
-msgstr ""
+msgstr "忽略列表"

 #: qt/app.py:84 qt/app.py:179 cocoa/en.lproj/Localizable.strings:0
 msgid "Clear Picture Cache"
-msgstr ""
+msgstr "清空图片缓存"

 #: qt/app.py:85 cocoa/en.lproj/Localizable.strings:0
 msgid "dupeGuru Help"
-msgstr ""
+msgstr "dupeGuru 帮助"

 #: qt/app.py:86 cocoa/en.lproj/Localizable.strings:0
 msgid "About dupeGuru"
-msgstr ""
+msgstr "关于 dupeGuru"

 #: qt/app.py:87
 msgid "Open Debug Log"
-msgstr ""
+msgstr "打开调试记录"

 #: qt/app.py:180 cocoa/en.lproj/Localizable.strings:0
 msgid "Do you really want to remove all your cached picture analysis?"
-msgstr ""
+msgstr "确定要移除所有缓存分析图片?"

 #: qt/app.py:184
 msgid "Picture cache cleared."
-msgstr ""
+msgstr "图片缓存已清空。"

 #: qt/app.py:251
 msgid "{} file (*.{})"
-msgstr ""
+msgstr "{} 文件 (*.{})"

 #: qt/deletion_options.py:30 cocoa/en.lproj/Localizable.strings:0
 msgid "Deletion Options"
-msgstr ""
+msgstr "删除选项"

 #: qt/deletion_options.py:35 cocoa/en.lproj/Localizable.strings:0
 msgid "Link deleted files"
-msgstr ""
+msgstr "链接已删除的文件"

 #: qt/deletion_options.py:37 cocoa/en.lproj/Localizable.strings:0
 msgid ""
@ -91,20 +94,20 @@ msgstr ""

 #: qt/deletion_options.py:60 cocoa/en.lproj/Localizable.strings:0
 msgid "Cancel"
-msgstr ""
+msgstr "取消"

 #: qt/details_table.py:16 cocoa/en.lproj/Localizable.strings:0
 msgid "Attribute"
-msgstr ""
+msgstr "属性"

 #: qt/details_table.py:16 cocoa/en.lproj/Localizable.strings:0
 msgid "Selected"
-msgstr ""
+msgstr "已选择"

 #: qt/details_table.py:16 qt/directories_model.py:24
 #: cocoa/en.lproj/Localizable.strings:0
 msgid "Reference"
-msgstr ""
+msgstr "源文件"

 #: qt/directories_dialog.py:64 cocoa/en.lproj/Localizable.strings:0
 msgid "Load Results..."
@ -908,3 +911,43 @@ msgstr ""
 #: qt\preferences_dialog.py:286
 msgid "Display"
 msgstr ""
+
+#: qt\se\preferences_dialog.py:70
+msgid "Partially hash files bigger than"
+msgstr ""
+
+#: qt\se\preferences_dialog.py:80
+msgid "MB"
+msgstr ""
+
+#: qt\preferences_dialog.py:163
+msgid "Use native OS dialogs"
+msgstr ""
+
+#: qt\preferences_dialog.py:166
+msgid ""
+"For actions such as file/folder selection use the OS native dialogs.\n"
+"Some native dialogs have limited functionality."
+msgstr ""
+
+#: qt\se\preferences_dialog.py:68
+msgid "Ignore files larger than"
+msgstr ""
+
+#: qt\app.py:135 qt\app.py:293
+msgid "Clear Cache"
+msgstr ""
+
+#: qt\app.py:294
+msgid ""
+"Do you really want to clear the cache? This will remove all cached file "
+"hashes and picture analysis."
+msgstr ""
+
+#: qt\app.py:299
+msgid "Cache cleared."
+msgstr ""
+
+#: qt\preferences_dialog.py:173
+msgid "Use dark style"
+msgstr ""
--- a/pyproject.toml
+++ b/pyproject.toml
@ -3,3 +3,7 @@ requires = ["setuptools"]
 build-backend = "setuptools.build_meta"
 [tool.black]
 line-length = 120
+[tool.isort]
+# make it compatible with black
+profile = "black" 
+skip_gitignore = true
--- a/qt/pe/image_viewer.py
+++ b/qt/pe/image_viewer.py
@ -1041,44 +1041,26 @@ class ScrollAreaImageViewer(QScrollArea):
        """After scaling, no mouse position, default to center."""
        # scrollBar.setMaximum(scrollBar.maximum() - scrollBar.minimum() + scrollBar.pageStep())
        self._horizontalScrollBar.setValue(
-            int(
-                factor * self._horizontalScrollBar.value()
-                + ((factor - 1) * self._horizontalScrollBar.pageStep() / 2)
-            )
+            int(factor * self._horizontalScrollBar.value() + ((factor - 1) * self._horizontalScrollBar.pageStep() / 2))
        )
        self._verticalScrollBar.setValue(
-            int(
-                factor * self._verticalScrollBar.value()
-                + ((factor - 1) * self._verticalScrollBar.pageStep() / 2)
-            )
+            int(factor * self._verticalScrollBar.value() + ((factor - 1) * self._verticalScrollBar.pageStep() / 2))
        )

    def adjustScrollBarsScaled(self, delta):
        """After scaling with the mouse, update relative to mouse position."""
-        self._horizontalScrollBar.setValue(
-            int(self._horizontalScrollBar.value() + delta.x())
-        )
-        self._verticalScrollBar.setValue(
-            int(self._verticalScrollBar.value() + delta.y())
-        )
+        self._horizontalScrollBar.setValue(int(self._horizontalScrollBar.value() + delta.x()))
+        self._verticalScrollBar.setValue(int(self._verticalScrollBar.value() + delta.y()))

    def adjustScrollBarsAuto(self):
        """After panning, update accordingly."""
-        self.horizontalScrollBar().setValue(
-            int(self.horizontalScrollBar().value() - self._mousePanningDelta.x())
-        )
-        self.verticalScrollBar().setValue(
-            int(self.verticalScrollBar().value() - self._mousePanningDelta.y())
-        )
+        self.horizontalScrollBar().setValue(int(self.horizontalScrollBar().value() - self._mousePanningDelta.x()))
+        self.verticalScrollBar().setValue(int(self.verticalScrollBar().value() - self._mousePanningDelta.y()))

    def adjustScrollBarCentered(self):
        """Just center in the middle."""
-        self._horizontalScrollBar.setValue(
-            int(self._horizontalScrollBar.maximum() / 2)
-        )
-        self._verticalScrollBar.setValue(
-            int(self._verticalScrollBar.maximum() / 2)
-        )
+        self._horizontalScrollBar.setValue(int(self._horizontalScrollBar.maximum() / 2))
+        self._verticalScrollBar.setValue(int(self._verticalScrollBar.maximum() / 2))

    def resetCenter(self):
        """Resets origin"""
--- a/qt/platform.py
+++ b/qt/platform.py
@ -14,11 +14,11 @@ if op.exists(__file__):
 else:
    # Should be a frozen environment
    if ISOSX:
-        BASE_PATH = op.abspath(op.join(op.dirname(__file__), '..', '..', 'Resources'))
+        BASE_PATH = op.abspath(op.join(op.dirname(__file__), "..", "..", "Resources"))
    else:
        # For others our base path is ''.
        BASE_PATH = ""
-HELP_PATH = op.join(BASE_PATH, "help")
+HELP_PATH = op.join(BASE_PATH, "help", "en")

 if ISWINDOWS:
    INITIAL_FOLDER_IN_DIALOGS = "C:\\"
--- a/qt/tabbed_window.py
+++ b/qt/tabbed_window.py
@ -221,7 +221,7 @@ class TabWindow(QMainWindow):
        super().showEvent(event)

    def changeEvent(self, event):
-        if event.type() == QEvent.Type.WindowStateChange and not self.isMaximized():
+        if event.type() == QEvent.WindowStateChange and not self.isMaximized():
            move_to_screen_center(self)
        super().changeEvent(event)

--- a/qtlib/locale/tr/LC_MESSAGES/qtlib.po
+++ b/qtlib/locale/tr/LC_MESSAGES/qtlib.po
@ -1,11 +1,10 @@
 # Translators:
 # Ahmet Haydar Işık <itsahmthydr@gmail.com>, 2021
-# Emin Tufan Çetin <etcetin@gmail.com>, 2021
-# Andrew Senetar <arsenetar@gmail.com>, 2022
+# Emin Tufan Çetin <etcetin@gmail.com>, 2022
 #
 msgid ""
 msgstr ""
-"Last-Translator: Andrew Senetar <arsenetar@gmail.com>, 2022\n"
+"Last-Translator: Emin Tufan Çetin <etcetin@gmail.com>, 2022\n"
 "Language-Team: Turkish (https://www.transifex.com/voltaicideas/teams/116153/tr/)\n"
 "Language: tr\n"
 "Content-Type: text/plain; charset=UTF-8\n"
@ -100,7 +99,7 @@ msgstr "Korece"

 #: qtlib\preferences.py:34
 msgid "Malay"
-msgstr "Malay dili"
+msgstr "Malayca"

 #: qtlib\preferences.py:35
 msgid "Dutch"
--- a/requirements.txt
+++ b/requirements.txt
@ -1,7 +1,7 @@
+distro>=1.5.0
+mutagen>=1.44.0
+PyQt5 >=5.14.1,<6.0; sys_platform != 'linux'
+pywin32>=228; sys_platform == 'win32'
 Send2Trash>=1.3.0
 sphinx>=3.0.0
-polib>=1.1.0
-mutagen>=1.44.0
-distro>=1.5.0
-PyQt5 >=5.14.1,<6.0; sys_platform != 'linux'
-pywin32>=228; sys_platform == 'win32'
+xxhash>=3.0.0,<4.0.0
--- a/setup.cfg
+++ b/setup.cfg
@ -30,13 +30,13 @@ packages = find:
 python_requires = >=3.7
 install_requires = 
    Send2Trash>=1.3.0
-    polib>=1.1.0
    mutagen>=1.45.1
    distro>=1.5.0
    PyQt5 >=5.14.1,<6.0; sys_platform != 'linux'
    pywin32>=228; sys_platform == 'win32'
 setup_requires =
    sphinx>=3.0.0
+    polib>=1.1.0
 tests_require = 
    pytest >=6,<7
 include_package_data = true
--- a/setup.nsi
+++ b/setup.nsi
@ -12,6 +12,8 @@ Unicode true
 SetCompressor /SOLID lzma
 ; General Headers
 !include "FileFunc.nsh"
+!include "WinVer.nsh"
+!include "LogicLib.nsh"

 ;==============================================================================
 ; Configuration Defines
@ -279,6 +281,10 @@ SectionEnd
 ;==============================================================================

 Function .onInit
+  ${IfNot} ${AtLeastWin7}
+    MessageBox MB_OK "Windows 7 and above required"
+    Quit
+  ${EndIf}
  !if ${BITS} == "64"
    SetRegView 64
  !else
Author	SHA1	Message	Date
Andrew Senetar	51b18d4c84	Switch file hashing to xxhash instead of md5 - Improves performance significantly in some cases - Add xxhash to requirements.txt and sort requirements - Rename md5 based members to digest - Update all tests to use new member names and hashing methods - Update hash db code to upgrade schema NOTE: May consider supporting multiple hashing algorithms in the future.	2022-03-25 23:13:12 -05:00
Andrew Senetar	bbcdfbf698	Add vscode extension recommendation	2022-03-21 22:27:16 -05:00
Andrew Senetar	8cee1a9467	Fix internal links in CONTRIBUTING.md	2022-03-21 22:19:58 -05:00
Andrew Senetar	448d33dcb6	Add workflow yml validation settings - Add yml validation to project for vscode - Allow .vscode/settings.json - Apply formatting to workflow files	2022-03-21 22:18:22 -05:00
Andrew Senetar	8d414cadac	Add initial partial CONTRIBUTING.md - Adopt a CONTRIBUTING.md format similar to that used by atom/atom. - Add label section as replacement to wiki - Add style guide section - Setup basic document structure TODO: - Migrate some existing wiki information here where applicable. - Migrate some existing help information here. - Finish up remaining sections.	2022-03-21 22:04:45 -05:00
Andrew Senetar	f902ee889a	Add configuration for isort to pyproject.toml	2022-03-21 00:25:36 -05:00
Andrew Senetar	bc89e71935	Update .gitignore - Pull from github/gitignore to cover some things better - Organize remaining items - Remove a few no longer relevant items	2022-03-20 23:25:01 -05:00
Andrew Senetar	17b83c8001	Move polib to setup_requires instead of install_requires	2022-03-20 22:48:03 -05:00
Andrew Senetar	0f845ee67a	Update min python version in Makefile	2022-03-20 01:23:01 -05:00
Andrew Senetar	d40e32a143	Update transifex config & pull latest updates - Update transifex configuration to new format - Pull translation updates	2022-03-19 20:21:14 -05:00
Andrew Senetar	1bc206e62d	Bump version to 4.2.1	2022-03-19 19:02:41 -05:00
Andrew Senetar	106a0feaba	Add sponsor information	2022-03-19 17:46:12 -05:00
Andrew Senetar	984e0c4094	Fix help path for local files and some help doc updates	2022-03-19 17:43:11 -05:00
Andrew Senetar	9321e811d7	Enforce minimum Windows version ref #983	2022-03-19 17:01:54 -05:00
Andrew Senetar	a64fcbfb5c	Fix deprecation warning from sqlite	2022-03-19 17:01:53 -05:00
Andrew Senetar	cff07a12d6	Black formatter changes	2022-03-19 17:01:53 -05:00
Alfonso Montero	b9c7832c4a	Apply @arsenetar 's proposed change to fix for errors on window change event. Solves #937 . (#980 )	2022-03-15 20:47:48 -05:00