rieter: devlog-2

2024-06-18 18:17:12 +02:00 · 2024-06-18 18:17:12 +02:00 · c27f2e058a
parent 083db2bfc8
commit c27f2e058a
1 changed files with 140 additions and 0 deletions
--- a/content/dev/rieter/devlog-2.md
+++ b/content/dev/rieter/devlog-2.md
@ -0,0 +1,140 @@
+---
+title: "Progress on concurrent repositories"
+date: 2024-06-18
+---
+
+During the last devlog I was working on a system for concurrent repositories.
+After a lot of trying, I've found a system that should work pretty well, even
+on larger scales. In doing so, the overall complexity of the system has
+actually decreased on several points as well! Let me explain.
+
+## Concurrent repositories
+
+I went through a lot of ideas before settling on the current implementation.
+Initially both the parsing of packages and the regeneration of the package
+archives happened inside the request handler, without any form of
+synchronisation. This had several unwanted effects. For one, multiple packages
+could quickly overload the CPU as they would all be processed in parallel.
+These would then also try to generate the package archives in parallel, causing
+writes to the same files which was a mess of its own. Because all work was
+performed inside the request handlers, the time it took for the server to
+respond was dependent on how congested the system was, which wasn't acceptable
+for me. Something definitely had to change.
+
+My first solution heavily utilized the Tokio async runtime that Rieter is built
+on. Each package that gets uploaded would spawn a new task that competes for a
+semaphore, allowing me to control how many packages get parsed in parallel.
+Important to note here is that the request handler no longer needs to wait
+until a package is finished parsing. The parse task is handled asynchronously,
+allowing the server to respond immediately with a [`202
+Accepted`](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/202). This
+way, clients no longer need to wait unnecessarily long for a task that can be
+performed asynchronously on the server. Each parse task would then regenerate
+the package archives if it was able to successfully parse a package.
+
+Because each task regenerates the package archives, this approach performed a
+lot of extra work. The constant spawning of Tokio tasks also didn't sit right
+with me, so I tried another design, which ended up being the current version.
+
+### Current design
+
+I settled on a much more classic design: worker threads, or rather, Tokio
+tasks. On startup, Rieter launches `n` worker tasks that listen for messages on
+an [mpsc](https://docs.rs/tokio/latest/tokio/sync/mpsc/index.html) channel. The
+receiver is shared between the workers using a mutex, so each message only gets
+picked up by one of the workers. Each request first uploads its respective
+package to a temporary file, and sends a tuple `(repo, path)` to the channel,
+notifying one of the workers a new package is to be parsed. Each time the queue
+for a repository is empty, the package archives get regenerated, effectively
+batching this operation. This technique is so much simpler and works wonders.
+
+### Package queueing
+
+I did have some fun designing the intrinsics of this system. My goal was to
+have a repository seamlessly handle any number of packages being uploaded, even
+different versions of the same package. To achieve this I leveraged the
+database.
+
+Each parsed package's information gets added to the database with a unique
+monotonically increasing ID. Each repository can only have one version of a
+package present for each of its architectures. For each package name, the
+relevant package to add to the package archives is thus the one with the
+largest ID. This resulted in this (in my opinion rather elegant) query:
+
+```sql
+SELECT * FROM "package" AS "p1" INNER JOIN (
+    SELECT "repo_id", "arch", "name", MAX("package"."id") AS "max_id" 
+    FROM "package"
+    GROUP BY "repo_id", "arch", "name"
+) AS "p2" ON "p1"."id" = "p2"."max_id"
+         WHERE "p1"."repo_id" = 1 AND "p1"."arch" IN ('x86_64', 'any') AND "p1"."state" <> 2
+```
+
+For each `(repo, arch, name)` tuple, we find the largest ID and select it, but
+only if its state is not `2`, which means *pending deletion*. Knowing what old
+packages to remove is then a similar query to this, where we instead select all
+packages that are marked as *pending deletion* or whose ID is less than the
+currently committed package.
+
+This design not only seamlessly supports any order of packages being added; it
+also paves the way for implementing repository mirroring down the line. This
+allows me to atomically update a repository, a feature that I'll be using for
+the mirroring system. I'll simply queue new packages and only regenerate the
+package archives once all packages have successfully synced to the server.
+
+## Simplifying things
+
+During my development of the repository system, I realized how complex I was
+making some things. For example, repositories are grouped into distros, and
+this structure was also visible in the codebase. Each distro had its own "disto
+manager" that managed packages for that repository. However, this was a
+needless overcomplication, as distros are solely an aesthetic feature. Each
+repository has a unique ID in the database anyways, so this extra level of
+complexity was completely unnecessary.
+
+Package organisation on disk is also still overly complex right now. Each
+repository has its own folder with its own packages, but this once again is an
+excessive layer as packages have unique IDs anyways. The database tracks which
+packages are part of which repositories, so I'll switch to storing all packages
+next to each other instead. This might also pave the way for some cool features
+down the line, such as staging repositories.
+
+I've been needlessly holding on to how I've done things with Vieter, while I
+can make completely new choices in Rieter. The file system for Rieter doesn't
+need to resemble the Vieter file system at all, nor should it follow any notion
+of how Arch repositories usually look. If need be, I can add an export utility
+to convert the directory structure into a more classic layout, but I shouldn't
+bother keeping it in mind while developing Rieter.
+
+## Configuration
+
+I switched configuration from environment variables and CLI arguments to a
+dedicated config file. The former would've been too simplistic for the later
+configuration options I'll be adding, so I opted for a configuration file
+instead.
+
+```toml
+api_key = "test"
+pkg_workers = 2
+log_level = "rieterd=debug"
+
+[fs]
+type = "local"
+data_dir = "./data"
+
+[db]
+type = "postgres"
+host = "localhost"
+db = "rieter"
+user = "rieter"
+password = "rieter"
+```
+
+This will allow me a lot more flexibility in the future.
+
+## First release
+
+There's still some polish to be done, but I'm definitely nearing an initial 0.1
+release for this project. I'm looking forward to announcing it!
+
+As usual, thanks for reading, and having a nice day.