site/devlog-2.md at main - site

6.5 KiB

Raw Permalink Blame History

title	date
Progress on concurrent repositories	2024-06-18

During the last devlog I was working on a system for concurrent repositories. After a lot of trying, I've found a system that should work pretty well, even on larger scales. In doing so, the overall complexity of the system has actually decreased on several points as well! Let me explain.

Concurrent repositories

I went through a lot of ideas before settling on the current implementation. Initially both the parsing of packages and the regeneration of the package archives happened inside the request handler, without any form of synchronisation. This had several unwanted effects. For one, multiple packages could quickly overload the CPU as they would all be processed in parallel. These would then also try to generate the package archives in parallel, causing writes to the same files which was a mess of its own. Because all work was performed inside the request handlers, the time it took for the server to respond was dependent on how congested the system was, which wasn't acceptable for me. Something definitely had to change.

My first solution heavily utilized the Tokio async runtime that Rieter is built on. Each package that gets uploaded would spawn a new task that competes for a semaphore, allowing me to control how many packages get parsed in parallel. Important to note here is that the request handler no longer needs to wait until a package is finished parsing. The parse task is handled asynchronously, allowing the server to respond immediately with a 202 Accepted. This way, clients no longer need to wait unnecessarily long for a task that can be performed asynchronously on the server. Each parse task would then regenerate the package archives if it was able to successfully parse a package.

Because each task regenerates the package archives, this approach performed a lot of extra work. The constant spawning of Tokio tasks also didn't sit right with me, so I tried another design, which ended up being the current version.

Current design

I settled on a much more classic design: worker threads, or rather, Tokio tasks. On startup, Rieter launches n worker tasks that listen for messages on an mpsc channel. The receiver is shared between the workers using a mutex, so each message only gets picked up by one of the workers. Each request first uploads its respective package to a temporary file, and sends a tuple (repo, path) to the channel, notifying one of the workers a new package is to be parsed. Each time the queue for a repository is empty, the package archives get regenerated, effectively batching this operation. This technique is so much simpler and works wonders.

Package queueing

I did have some fun designing the intrinsics of this system. My goal was to have a repository seamlessly handle any number of packages being uploaded, even different versions of the same package. To achieve this I leveraged the database.

Each parsed package's information gets added to the database with a unique monotonically increasing ID. Each repository can only have one version of a package present for each of its architectures. For each package name, the relevant package to add to the package archives is thus the one with the largest ID. This resulted in this (in my opinion rather elegant) query:

SELECT * FROM "package" AS "p1" INNER JOIN (
    SELECT "repo_id", "arch", "name", MAX("package"."id") AS "max_id" 
    FROM "package"
    GROUP BY "repo_id", "arch", "name"
) AS "p2" ON "p1"."id" = "p2"."max_id"
         WHERE "p1"."repo_id" = 1 AND "p1"."arch" IN ('x86_64', 'any') AND "p1"."state" <> 2

For each (repo, arch, name) tuple, we find the largest ID and select it, but only if its state is not 2, which means pending deletion. Knowing what old packages to remove is then a similar query to this, where we instead select all packages that are marked as pending deletion or whose ID is less than the currently committed package.

This design not only seamlessly supports any order of packages being added; it also paves the way for implementing repository mirroring down the line. This allows me to atomically update a repository, a feature that I'll be using for the mirroring system. I'll simply queue new packages and only regenerate the package archives once all packages have successfully synced to the server.

Simplifying things

During my development of the repository system, I realized how complex I was making some things. For example, repositories are grouped into distros, and this structure was also visible in the codebase. Each distro had its own "disto manager" that managed packages for that repository. However, this was a needless overcomplication, as distros are solely an aesthetic feature. Each repository has a unique ID in the database anyways, so this extra level of complexity was completely unnecessary.

Package organisation on disk is also still overly complex right now. Each repository has its own folder with its own packages, but this once again is an excessive layer as packages have unique IDs anyways. The database tracks which packages are part of which repositories, so I'll switch to storing all packages next to each other instead. This might also pave the way for some cool features down the line, such as staging repositories.

I've been needlessly holding on to how I've done things with Vieter, while I can make completely new choices in Rieter. The file system for Rieter doesn't need to resemble the Vieter file system at all, nor should it follow any notion of how Arch repositories usually look. If need be, I can add an export utility to convert the directory structure into a more classic layout, but I shouldn't bother keeping it in mind while developing Rieter.

Configuration

I switched configuration from environment variables and CLI arguments to a dedicated config file. The former would've been too simplistic for the later configuration options I'll be adding, so I opted for a configuration file instead.

api_key = "test"
pkg_workers = 2
log_level = "rieterd=debug"

[fs]
type = "local"
data_dir = "./data"

[db]
type = "postgres"
host = "localhost"
db = "rieter"
user = "rieter"
password = "rieter"

This will allow me a lot more flexibility in the future.

First release

There's still some polish to be done, but I'm definitely nearing an initial 0.1 release for this project. I'm looking forward to announcing it!

As usual, thanks for reading, and having a nice day.

6.5 KiB Raw Permalink Blame History