diff --git a/content/dev/rieter/devlog-2.md b/content/dev/rieter/devlog-2.md new file mode 100644 index 0000000..3fffcea --- /dev/null +++ b/content/dev/rieter/devlog-2.md @@ -0,0 +1,140 @@ +--- +title: "Progress on concurrent repositories" +date: 2024-06-18 +--- + +During the last devlog I was working on a system for concurrent repositories. +After a lot of trying, I've found a system that should work pretty well, even +on larger scales. In doing so, the overall complexity of the system has +actually decreased on several points as well! Let me explain. + +## Concurrent repositories + +I went through a lot of ideas before settling on the current implementation. +Initially both the parsing of packages and the regeneration of the package +archives happened inside the request handler, without any form of +synchronisation. This had several unwanted effects. For one, multiple packages +could quickly overload the CPU as they would all be processed in parallel. +These would then also try to generate the package archives in parallel, causing +writes to the same files which was a mess of its own. Because all work was +performed inside the request handlers, the time it took for the server to +respond was dependent on how congested the system was, which wasn't acceptable +for me. Something definitely had to change. + +My first solution heavily utilized the Tokio async runtime that Rieter is built +on. Each package that gets uploaded would spawn a new task that competes for a +semaphore, allowing me to control how many packages get parsed in parallel. +Important to note here is that the request handler no longer needs to wait +until a package is finished parsing. The parse task is handled asynchronously, +allowing the server to respond immediately with a [`202 +Accepted`](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/202). This +way, clients no longer need to wait unnecessarily long for a task that can be +performed asynchronously on the server. Each parse task would then regenerate +the package archives if it was able to successfully parse a package. + +Because each task regenerates the package archives, this approach performed a +lot of extra work. The constant spawning of Tokio tasks also didn't sit right +with me, so I tried another design, which ended up being the current version. + +### Current design + +I settled on a much more classic design: worker threads, or rather, Tokio +tasks. On startup, Rieter launches `n` worker tasks that listen for messages on +an [mpsc](https://docs.rs/tokio/latest/tokio/sync/mpsc/index.html) channel. The +receiver is shared between the workers using a mutex, so each message only gets +picked up by one of the workers. Each request first uploads its respective +package to a temporary file, and sends a tuple `(repo, path)` to the channel, +notifying one of the workers a new package is to be parsed. Each time the queue +for a repository is empty, the package archives get regenerated, effectively +batching this operation. This technique is so much simpler and works wonders. + +### Package queueing + +I did have some fun designing the intrinsics of this system. My goal was to +have a repository seamlessly handle any number of packages being uploaded, even +different versions of the same package. To achieve this I leveraged the +database. + +Each parsed package's information gets added to the database with a unique +monotonically increasing ID. Each repository can only have one version of a +package present for each of its architectures. For each package name, the +relevant package to add to the package archives is thus the one with the +largest ID. This resulted in this (in my opinion rather elegant) query: + +```sql +SELECT * FROM "package" AS "p1" INNER JOIN ( + SELECT "repo_id", "arch", "name", MAX("package"."id") AS "max_id" + FROM "package" + GROUP BY "repo_id", "arch", "name" +) AS "p2" ON "p1"."id" = "p2"."max_id" + WHERE "p1"."repo_id" = 1 AND "p1"."arch" IN ('x86_64', 'any') AND "p1"."state" <> 2 +``` + +For each `(repo, arch, name)` tuple, we find the largest ID and select it, but +only if its state is not `2`, which means *pending deletion*. Knowing what old +packages to remove is then a similar query to this, where we instead select all +packages that are marked as *pending deletion* or whose ID is less than the +currently committed package. + +This design not only seamlessly supports any order of packages being added; it +also paves the way for implementing repository mirroring down the line. This +allows me to atomically update a repository, a feature that I'll be using for +the mirroring system. I'll simply queue new packages and only regenerate the +package archives once all packages have successfully synced to the server. + +## Simplifying things + +During my development of the repository system, I realized how complex I was +making some things. For example, repositories are grouped into distros, and +this structure was also visible in the codebase. Each distro had its own "disto +manager" that managed packages for that repository. However, this was a +needless overcomplication, as distros are solely an aesthetic feature. Each +repository has a unique ID in the database anyways, so this extra level of +complexity was completely unnecessary. + +Package organisation on disk is also still overly complex right now. Each +repository has its own folder with its own packages, but this once again is an +excessive layer as packages have unique IDs anyways. The database tracks which +packages are part of which repositories, so I'll switch to storing all packages +next to each other instead. This might also pave the way for some cool features +down the line, such as staging repositories. + +I've been needlessly holding on to how I've done things with Vieter, while I +can make completely new choices in Rieter. The file system for Rieter doesn't +need to resemble the Vieter file system at all, nor should it follow any notion +of how Arch repositories usually look. If need be, I can add an export utility +to convert the directory structure into a more classic layout, but I shouldn't +bother keeping it in mind while developing Rieter. + +## Configuration + +I switched configuration from environment variables and CLI arguments to a +dedicated config file. The former would've been too simplistic for the later +configuration options I'll be adding, so I opted for a configuration file +instead. + +```toml +api_key = "test" +pkg_workers = 2 +log_level = "rieterd=debug" + +[fs] +type = "local" +data_dir = "./data" + +[db] +type = "postgres" +host = "localhost" +db = "rieter" +user = "rieter" +password = "rieter" +``` + +This will allow me a lot more flexibility in the future. + +## First release + +There's still some polish to be done, but I'm definitely nearing an initial 0.1 +release for this project. I'm looking forward to announcing it! + +As usual, thanks for reading, and having a nice day.