diff --git a/UnderstandTheSource.md b/UnderstandTheSource.md new file mode 100644 index 0000000..99cf42a --- /dev/null +++ b/UnderstandTheSource.md @@ -0,0 +1,35 @@ +# Understand the source + +When looking at a non-trivial codebase for the first time, it's very difficult to understand anything of it until you get the "Big Picture". This page is meant to, hopefully, make you get dupeGuru's big picture. + +## Model/View/Controller... nope! + +dupeGuru's codebase has quite a few design flaws. The Model, View and Controller roles are filled by different classes, scattered around. If you're aware of that, it might help you to understand what the heck is going on. + +The central piece of dupeGuru is `dupeguru.app.DupeGuru` (in the `core` code). It's the only interface to the python's code for the GUI code. A duplicate scan is started with `start_scanning()`, directories are added through `add_directory()`, etc.. + +A lot of functionalities of the App are implemented in the platform-specific subclasses of `app.DupeGuru`, like `app_cocoa.DupeGuru`, or the `base.app.DupeGuru` class in the PyQt codebase. For example, when performing "Remove Selected From Results", `app_cocoa.Dupeguru.RemoveSelected()` on the Obj-C side, and `base.app.DupeGuru.remove_duplicates()` on the PyQt side, are respectively called to perform the thing. All of this is quite ugly, I know (see the "Refactoring" section below). + +# Jobs + +A lot of operations in dupeGuru take a significant amount of time. This is why there's a generalized threaded job mechanism built-in `app.DupeGuru`. First, `app.DupeGuru` has a `progress` member which is an instance of `jobprogress.job.ThreadedJobPerformer`. It lets the GUI code know of the progress of the current threaded job. When `app.DupeGuru` needs to start a job, it calls `_start_job()` and the platform specific subclass deals with the details of starting the job. + +## Core principles + +The core of the duplicate matching takes place (for SE and ME, not PE) in `dupeguru.engine`. There's `MatchFactory.getmatches()` which take a list of `fs.File` instances and return a list of `(firstfile, secondfile, match_percentage)` matches. Then, there's `get_groups()` which takes a list of matches and returns a list of `Group` instances (a `Group` is basically a list of `fs.File` matching together). + +When a scan is over, the final result (the list of groups from `get_groups()`) is placed into `app.DupeGuru.results`, which is a `results.Results` instance. The `Results` instance is where all the dupe marking, sorting, removing, power marking, etc. takes place. + +## Refactoring + +As I mentioned at the beginning of the page, quite a few design mistakes have been made during the development of dupeGuru. One could argue that there should be a huge refactoring work done on the codebase at once, and then be done with it. The problem is that huge refactorings are error-prone, especially with a weak testunit coverage. Also, dupeGuru's development is not as active as it used to be. Sure, there are still features to be implemented, but nothing major (except the recent dupeGuru PE cython/multiprocessing improvement). The approach I want to take on this is the "slowly but surely" approach. So, how it works is that when you're about to work on a piece of code that needs refactoring, *then* do the refactoring. Until you need to work on that piece of code, leave it alone. Here's a list of ongoing refactorings: + +**Obj-C's dgbase merge.** When I created the different dupeGuru editions, I made the awful mistake of copy/pasting the whole Obj-C code, then just modifying what needed it. I know that was stupid, but I did it anyway. Then, a while after, I created the `dgbase` project which contains Obj-C code common to all editions. Instead of moving it all at once, which would have been error prone, I just slowly push code down to dgbase when appropriate. Therefore, whenever a piece of Obj-C code is about to be modified, if it's common to all editions, _it has to be moved down to dgbase first_. No exception. If you copy/paste your modification 3 times, it means you're doing something wrong. + +**PEP8.** There are still some `CamelCaseMethods` lying around. When working near one of them, just change them to `lowercase_with_underscore()` (don't forget the project-wide search/replace). + +**Platform-independent code in platform-specific units.** Some behavior in dupeGuru is defined by code in the platform-specific units, but is in fact platform-independent behavior. This is actually pretty tricky to refactor, because we're not dealing with clear-cut code duplication here. Pushing that behavior down to platform independent units usually involves building an override mechanism and stuff like that. + +**Placing specific and common code where they belong.** Some code is not at the right place. For example, `app_cocoa` is not supposed to be in `dupeguru`, which is platform-independent code. But there's also platform-independent-but-edition-specific code, like `dupeguru.picture`. Although this unit is platform-independent, it is being checked out with deupGuru ME and dupeGuru SE. This normally shouldn't be so. However, this kind of refactoring is tricky to do, and I'm not exactly sure how the code should be arranged for everything to be at the correct place. This has to be thought out. + +**PyQt camelCase.** My first experience with PyQt was by porting dupeGuru's .NET code to PyQt. At first, I used underscore_method_names(), but later, I decided I'd switch to camelCase() for PyQt code to blend in more with Qt's style. The result is that there's an ongoing refactoring changing underscore_method() to camelCase methods. \ No newline at end of file