site/content/dev/rieter/devlog-2.md

141 lines
6.5 KiB
Markdown
Raw Permalink Normal View History

2024-06-18 18:17:12 +02:00
---
title: "Progress on concurrent repositories"
date: 2024-06-18
---
During the last devlog I was working on a system for concurrent repositories.
After a lot of trying, I've found a system that should work pretty well, even
on larger scales. In doing so, the overall complexity of the system has
actually decreased on several points as well! Let me explain.
## Concurrent repositories
I went through a lot of ideas before settling on the current implementation.
Initially both the parsing of packages and the regeneration of the package
archives happened inside the request handler, without any form of
synchronisation. This had several unwanted effects. For one, multiple packages
could quickly overload the CPU as they would all be processed in parallel.
These would then also try to generate the package archives in parallel, causing
writes to the same files which was a mess of its own. Because all work was
performed inside the request handlers, the time it took for the server to
respond was dependent on how congested the system was, which wasn't acceptable
for me. Something definitely had to change.
My first solution heavily utilized the Tokio async runtime that Rieter is built
on. Each package that gets uploaded would spawn a new task that competes for a
semaphore, allowing me to control how many packages get parsed in parallel.
Important to note here is that the request handler no longer needs to wait
until a package is finished parsing. The parse task is handled asynchronously,
allowing the server to respond immediately with a [`202
Accepted`](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/202). This
way, clients no longer need to wait unnecessarily long for a task that can be
performed asynchronously on the server. Each parse task would then regenerate
the package archives if it was able to successfully parse a package.
Because each task regenerates the package archives, this approach performed a
lot of extra work. The constant spawning of Tokio tasks also didn't sit right
with me, so I tried another design, which ended up being the current version.
### Current design
I settled on a much more classic design: worker threads, or rather, Tokio
tasks. On startup, Rieter launches `n` worker tasks that listen for messages on
an [mpsc](https://docs.rs/tokio/latest/tokio/sync/mpsc/index.html) channel. The
receiver is shared between the workers using a mutex, so each message only gets
picked up by one of the workers. Each request first uploads its respective
package to a temporary file, and sends a tuple `(repo, path)` to the channel,
notifying one of the workers a new package is to be parsed. Each time the queue
for a repository is empty, the package archives get regenerated, effectively
batching this operation. This technique is so much simpler and works wonders.
### Package queueing
I did have some fun designing the intrinsics of this system. My goal was to
have a repository seamlessly handle any number of packages being uploaded, even
different versions of the same package. To achieve this I leveraged the
database.
Each parsed package's information gets added to the database with a unique
monotonically increasing ID. Each repository can only have one version of a
package present for each of its architectures. For each package name, the
relevant package to add to the package archives is thus the one with the
largest ID. This resulted in this (in my opinion rather elegant) query:
```sql
SELECT * FROM "package" AS "p1" INNER JOIN (
SELECT "repo_id", "arch", "name", MAX("package"."id") AS "max_id"
FROM "package"
GROUP BY "repo_id", "arch", "name"
) AS "p2" ON "p1"."id" = "p2"."max_id"
WHERE "p1"."repo_id" = 1 AND "p1"."arch" IN ('x86_64', 'any') AND "p1"."state" <> 2
```
For each `(repo, arch, name)` tuple, we find the largest ID and select it, but
only if its state is not `2`, which means *pending deletion*. Knowing what old
packages to remove is then a similar query to this, where we instead select all
packages that are marked as *pending deletion* or whose ID is less than the
currently committed package.
This design not only seamlessly supports any order of packages being added; it
also paves the way for implementing repository mirroring down the line. This
allows me to atomically update a repository, a feature that I'll be using for
the mirroring system. I'll simply queue new packages and only regenerate the
package archives once all packages have successfully synced to the server.
## Simplifying things
During my development of the repository system, I realized how complex I was
making some things. For example, repositories are grouped into distros, and
this structure was also visible in the codebase. Each distro had its own "disto
manager" that managed packages for that repository. However, this was a
needless overcomplication, as distros are solely an aesthetic feature. Each
repository has a unique ID in the database anyways, so this extra level of
complexity was completely unnecessary.
Package organisation on disk is also still overly complex right now. Each
repository has its own folder with its own packages, but this once again is an
excessive layer as packages have unique IDs anyways. The database tracks which
packages are part of which repositories, so I'll switch to storing all packages
next to each other instead. This might also pave the way for some cool features
down the line, such as staging repositories.
I've been needlessly holding on to how I've done things with Vieter, while I
can make completely new choices in Rieter. The file system for Rieter doesn't
need to resemble the Vieter file system at all, nor should it follow any notion
of how Arch repositories usually look. If need be, I can add an export utility
to convert the directory structure into a more classic layout, but I shouldn't
bother keeping it in mind while developing Rieter.
## Configuration
I switched configuration from environment variables and CLI arguments to a
dedicated config file. The former would've been too simplistic for the later
configuration options I'll be adding, so I opted for a configuration file
instead.
```toml
api_key = "test"
pkg_workers = 2
log_level = "rieterd=debug"
[fs]
type = "local"
data_dir = "./data"
[db]
type = "postgres"
host = "localhost"
db = "rieter"
user = "rieter"
password = "rieter"
```
This will allow me a lot more flexibility in the future.
## First release
There's still some polish to be done, but I'm definitely nearing an initial 0.1
release for this project. I'm looking forward to announcing it!
As usual, thanks for reading, and having a nice day.