Fix stripping (japanese) unicode characters

* Accents are getting removed from Unicode characters to generate similar "words". * Non-latin characters which cannot be processed that way (eg. japanese, greek, russian, etc.) should not be filtered out at all otherwise files are erroneously skipped or detected as dupes if only some characters make it passed the filter. * Starting from an arbitrary unicode codepoint (converted to decimal), above which we know it is pointless to try any sort of processing, we leave the characters as is. * Fix #878.
2026-01-22 14:41:39 +00:00 · 2021-04-29 05:08:43 +02:00
parent 0840104edf
commit c4dcfd3d4b
2 changed files with 17 additions and 2 deletions
--- a/core/engine.py
+++ b/core/engine.py
@@ -26,8 +26,19 @@ def getwords(s):
    # We decompose the string so that ascii letters with accents can be part of the word.
    s = normalize("NFD", s)
    s = multi_replace(s, "-_&+():;\\[]{}.,<>/?~!@#$*", " ").lower()
+    # logging.debug(f"DEBUG chars for: {s}\n"
+    #               f"{[c for c in s if ord(c) != 32]}\n"
+    #               f"{[ord(c) for c in s if ord(c) != 32]}")
+    # HACK We shouldn't ignore non-ascii characters altogether. Any Unicode char
+    # above common european characters that cannot be "sanitized" (ie. stripped
+    # of their accents, etc.) are preserved as is. The arbitrary limit is
+    # obtained from this one: ord("\u037e") GREEK QUESTION MARK
    s = "".join(
-        c for c in s if c in string.ascii_letters + string.digits + string.whitespace
+        c for c in s
+        if (ord(c) < 894
+            and c in string.ascii_letters + string.digits + string.whitespace
+            )
+        or ord(c) > 894
    )
    return [_f for _f in s.split(" ") if _f]  # remove empty elements