141 lines
6.5 KiB
Markdown
141 lines
6.5 KiB
Markdown
---
|
|
title: "Progress on concurrent repositories"
|
|
date: 2024-06-18
|
|
---
|
|
|
|
During the last devlog I was working on a system for concurrent repositories.
|
|
After a lot of trying, I've found a system that should work pretty well, even
|
|
on larger scales. In doing so, the overall complexity of the system has
|
|
actually decreased on several points as well! Let me explain.
|
|
|
|
## Concurrent repositories
|
|
|
|
I went through a lot of ideas before settling on the current implementation.
|
|
Initially both the parsing of packages and the regeneration of the package
|
|
archives happened inside the request handler, without any form of
|
|
synchronisation. This had several unwanted effects. For one, multiple packages
|
|
could quickly overload the CPU as they would all be processed in parallel.
|
|
These would then also try to generate the package archives in parallel, causing
|
|
writes to the same files which was a mess of its own. Because all work was
|
|
performed inside the request handlers, the time it took for the server to
|
|
respond was dependent on how congested the system was, which wasn't acceptable
|
|
for me. Something definitely had to change.
|
|
|
|
My first solution heavily utilized the Tokio async runtime that Rieter is built
|
|
on. Each package that gets uploaded would spawn a new task that competes for a
|
|
semaphore, allowing me to control how many packages get parsed in parallel.
|
|
Important to note here is that the request handler no longer needs to wait
|
|
until a package is finished parsing. The parse task is handled asynchronously,
|
|
allowing the server to respond immediately with a [`202
|
|
Accepted`](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/202). This
|
|
way, clients no longer need to wait unnecessarily long for a task that can be
|
|
performed asynchronously on the server. Each parse task would then regenerate
|
|
the package archives if it was able to successfully parse a package.
|
|
|
|
Because each task regenerates the package archives, this approach performed a
|
|
lot of extra work. The constant spawning of Tokio tasks also didn't sit right
|
|
with me, so I tried another design, which ended up being the current version.
|
|
|
|
### Current design
|
|
|
|
I settled on a much more classic design: worker threads, or rather, Tokio
|
|
tasks. On startup, Rieter launches `n` worker tasks that listen for messages on
|
|
an [mpsc](https://docs.rs/tokio/latest/tokio/sync/mpsc/index.html) channel. The
|
|
receiver is shared between the workers using a mutex, so each message only gets
|
|
picked up by one of the workers. Each request first uploads its respective
|
|
package to a temporary file, and sends a tuple `(repo, path)` to the channel,
|
|
notifying one of the workers a new package is to be parsed. Each time the queue
|
|
for a repository is empty, the package archives get regenerated, effectively
|
|
batching this operation. This technique is so much simpler and works wonders.
|
|
|
|
### Package queueing
|
|
|
|
I did have some fun designing the intrinsics of this system. My goal was to
|
|
have a repository seamlessly handle any number of packages being uploaded, even
|
|
different versions of the same package. To achieve this I leveraged the
|
|
database.
|
|
|
|
Each parsed package's information gets added to the database with a unique
|
|
monotonically increasing ID. Each repository can only have one version of a
|
|
package present for each of its architectures. For each package name, the
|
|
relevant package to add to the package archives is thus the one with the
|
|
largest ID. This resulted in this (in my opinion rather elegant) query:
|
|
|
|
```sql
|
|
SELECT * FROM "package" AS "p1" INNER JOIN (
|
|
SELECT "repo_id", "arch", "name", MAX("package"."id") AS "max_id"
|
|
FROM "package"
|
|
GROUP BY "repo_id", "arch", "name"
|
|
) AS "p2" ON "p1"."id" = "p2"."max_id"
|
|
WHERE "p1"."repo_id" = 1 AND "p1"."arch" IN ('x86_64', 'any') AND "p1"."state" <> 2
|
|
```
|
|
|
|
For each `(repo, arch, name)` tuple, we find the largest ID and select it, but
|
|
only if its state is not `2`, which means *pending deletion*. Knowing what old
|
|
packages to remove is then a similar query to this, where we instead select all
|
|
packages that are marked as *pending deletion* or whose ID is less than the
|
|
currently committed package.
|
|
|
|
This design not only seamlessly supports any order of packages being added; it
|
|
also paves the way for implementing repository mirroring down the line. This
|
|
allows me to atomically update a repository, a feature that I'll be using for
|
|
the mirroring system. I'll simply queue new packages and only regenerate the
|
|
package archives once all packages have successfully synced to the server.
|
|
|
|
## Simplifying things
|
|
|
|
During my development of the repository system, I realized how complex I was
|
|
making some things. For example, repositories are grouped into distros, and
|
|
this structure was also visible in the codebase. Each distro had its own "disto
|
|
manager" that managed packages for that repository. However, this was a
|
|
needless overcomplication, as distros are solely an aesthetic feature. Each
|
|
repository has a unique ID in the database anyways, so this extra level of
|
|
complexity was completely unnecessary.
|
|
|
|
Package organisation on disk is also still overly complex right now. Each
|
|
repository has its own folder with its own packages, but this once again is an
|
|
excessive layer as packages have unique IDs anyways. The database tracks which
|
|
packages are part of which repositories, so I'll switch to storing all packages
|
|
next to each other instead. This might also pave the way for some cool features
|
|
down the line, such as staging repositories.
|
|
|
|
I've been needlessly holding on to how I've done things with Vieter, while I
|
|
can make completely new choices in Rieter. The file system for Rieter doesn't
|
|
need to resemble the Vieter file system at all, nor should it follow any notion
|
|
of how Arch repositories usually look. If need be, I can add an export utility
|
|
to convert the directory structure into a more classic layout, but I shouldn't
|
|
bother keeping it in mind while developing Rieter.
|
|
|
|
## Configuration
|
|
|
|
I switched configuration from environment variables and CLI arguments to a
|
|
dedicated config file. The former would've been too simplistic for the later
|
|
configuration options I'll be adding, so I opted for a configuration file
|
|
instead.
|
|
|
|
```toml
|
|
api_key = "test"
|
|
pkg_workers = 2
|
|
log_level = "rieterd=debug"
|
|
|
|
[fs]
|
|
type = "local"
|
|
data_dir = "./data"
|
|
|
|
[db]
|
|
type = "postgres"
|
|
host = "localhost"
|
|
db = "rieter"
|
|
user = "rieter"
|
|
password = "rieter"
|
|
```
|
|
|
|
This will allow me a lot more flexibility in the future.
|
|
|
|
## First release
|
|
|
|
There's still some polish to be done, but I'm definitely nearing an initial 0.1
|
|
release for this project. I'm looking forward to announcing it!
|
|
|
|
As usual, thanks for reading, and having a nice day.
|