rieter: devlog-2
ci/woodpecker/push/woodpecker Pipeline was successful
Details
ci/woodpecker/push/woodpecker Pipeline was successful
Details
parent
083db2bfc8
commit
c27f2e058a
|
@ -0,0 +1,140 @@
|
|||
---
|
||||
title: "Progress on concurrent repositories"
|
||||
date: 2024-06-18
|
||||
---
|
||||
|
||||
During the last devlog I was working on a system for concurrent repositories.
|
||||
After a lot of trying, I've found a system that should work pretty well, even
|
||||
on larger scales. In doing so, the overall complexity of the system has
|
||||
actually decreased on several points as well! Let me explain.
|
||||
|
||||
## Concurrent repositories
|
||||
|
||||
I went through a lot of ideas before settling on the current implementation.
|
||||
Initially both the parsing of packages and the regeneration of the package
|
||||
archives happened inside the request handler, without any form of
|
||||
synchronisation. This had several unwanted effects. For one, multiple packages
|
||||
could quickly overload the CPU as they would all be processed in parallel.
|
||||
These would then also try to generate the package archives in parallel, causing
|
||||
writes to the same files which was a mess of its own. Because all work was
|
||||
performed inside the request handlers, the time it took for the server to
|
||||
respond was dependent on how congested the system was, which wasn't acceptable
|
||||
for me. Something definitely had to change.
|
||||
|
||||
My first solution heavily utilized the Tokio async runtime that Rieter is built
|
||||
on. Each package that gets uploaded would spawn a new task that competes for a
|
||||
semaphore, allowing me to control how many packages get parsed in parallel.
|
||||
Important to note here is that the request handler no longer needs to wait
|
||||
until a package is finished parsing. The parse task is handled asynchronously,
|
||||
allowing the server to respond immediately with a [`202
|
||||
Accepted`](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/202). This
|
||||
way, clients no longer need to wait unnecessarily long for a task that can be
|
||||
performed asynchronously on the server. Each parse task would then regenerate
|
||||
the package archives if it was able to successfully parse a package.
|
||||
|
||||
Because each task regenerates the package archives, this approach performed a
|
||||
lot of extra work. The constant spawning of Tokio tasks also didn't sit right
|
||||
with me, so I tried another design, which ended up being the current version.
|
||||
|
||||
### Current design
|
||||
|
||||
I settled on a much more classic design: worker threads, or rather, Tokio
|
||||
tasks. On startup, Rieter launches `n` worker tasks that listen for messages on
|
||||
an [mpsc](https://docs.rs/tokio/latest/tokio/sync/mpsc/index.html) channel. The
|
||||
receiver is shared between the workers using a mutex, so each message only gets
|
||||
picked up by one of the workers. Each request first uploads its respective
|
||||
package to a temporary file, and sends a tuple `(repo, path)` to the channel,
|
||||
notifying one of the workers a new package is to be parsed. Each time the queue
|
||||
for a repository is empty, the package archives get regenerated, effectively
|
||||
batching this operation. This technique is so much simpler and works wonders.
|
||||
|
||||
### Package queueing
|
||||
|
||||
I did have some fun designing the intrinsics of this system. My goal was to
|
||||
have a repository seamlessly handle any number of packages being uploaded, even
|
||||
different versions of the same package. To achieve this I leveraged the
|
||||
database.
|
||||
|
||||
Each parsed package's information gets added to the database with a unique
|
||||
monotonically increasing ID. Each repository can only have one version of a
|
||||
package present for each of its architectures. For each package name, the
|
||||
relevant package to add to the package archives is thus the one with the
|
||||
largest ID. This resulted in this (in my opinion rather elegant) query:
|
||||
|
||||
```sql
|
||||
SELECT * FROM "package" AS "p1" INNER JOIN (
|
||||
SELECT "repo_id", "arch", "name", MAX("package"."id") AS "max_id"
|
||||
FROM "package"
|
||||
GROUP BY "repo_id", "arch", "name"
|
||||
) AS "p2" ON "p1"."id" = "p2"."max_id"
|
||||
WHERE "p1"."repo_id" = 1 AND "p1"."arch" IN ('x86_64', 'any') AND "p1"."state" <> 2
|
||||
```
|
||||
|
||||
For each `(repo, arch, name)` tuple, we find the largest ID and select it, but
|
||||
only if its state is not `2`, which means *pending deletion*. Knowing what old
|
||||
packages to remove is then a similar query to this, where we instead select all
|
||||
packages that are marked as *pending deletion* or whose ID is less than the
|
||||
currently committed package.
|
||||
|
||||
This design not only seamlessly supports any order of packages being added; it
|
||||
also paves the way for implementing repository mirroring down the line. This
|
||||
allows me to atomically update a repository, a feature that I'll be using for
|
||||
the mirroring system. I'll simply queue new packages and only regenerate the
|
||||
package archives once all packages have successfully synced to the server.
|
||||
|
||||
## Simplifying things
|
||||
|
||||
During my development of the repository system, I realized how complex I was
|
||||
making some things. For example, repositories are grouped into distros, and
|
||||
this structure was also visible in the codebase. Each distro had its own "disto
|
||||
manager" that managed packages for that repository. However, this was a
|
||||
needless overcomplication, as distros are solely an aesthetic feature. Each
|
||||
repository has a unique ID in the database anyways, so this extra level of
|
||||
complexity was completely unnecessary.
|
||||
|
||||
Package organisation on disk is also still overly complex right now. Each
|
||||
repository has its own folder with its own packages, but this once again is an
|
||||
excessive layer as packages have unique IDs anyways. The database tracks which
|
||||
packages are part of which repositories, so I'll switch to storing all packages
|
||||
next to each other instead. This might also pave the way for some cool features
|
||||
down the line, such as staging repositories.
|
||||
|
||||
I've been needlessly holding on to how I've done things with Vieter, while I
|
||||
can make completely new choices in Rieter. The file system for Rieter doesn't
|
||||
need to resemble the Vieter file system at all, nor should it follow any notion
|
||||
of how Arch repositories usually look. If need be, I can add an export utility
|
||||
to convert the directory structure into a more classic layout, but I shouldn't
|
||||
bother keeping it in mind while developing Rieter.
|
||||
|
||||
## Configuration
|
||||
|
||||
I switched configuration from environment variables and CLI arguments to a
|
||||
dedicated config file. The former would've been too simplistic for the later
|
||||
configuration options I'll be adding, so I opted for a configuration file
|
||||
instead.
|
||||
|
||||
```toml
|
||||
api_key = "test"
|
||||
pkg_workers = 2
|
||||
log_level = "rieterd=debug"
|
||||
|
||||
[fs]
|
||||
type = "local"
|
||||
data_dir = "./data"
|
||||
|
||||
[db]
|
||||
type = "postgres"
|
||||
host = "localhost"
|
||||
db = "rieter"
|
||||
user = "rieter"
|
||||
password = "rieter"
|
||||
```
|
||||
|
||||
This will allow me a lot more flexibility in the future.
|
||||
|
||||
## First release
|
||||
|
||||
There's still some polish to be done, but I'm definitely nearing an initial 0.1
|
||||
release for this project. I'm looking forward to announcing it!
|
||||
|
||||
As usual, thanks for reading, and having a nice day.
|
Loading…
Reference in New Issue