rieter: devlog-2
	
		
			
	
		
	
	
		
			
				
	
				ci/woodpecker/push/woodpecker Pipeline was successful
				
					Details
				
			
		
	
				
					
				
			
				
	
				ci/woodpecker/push/woodpecker Pipeline was successful
				
					Details
				
			
		
	
							parent
							
								
									083db2bfc8
								
							
						
					
					
						commit
						c27f2e058a
					
				| 
						 | 
				
			
			@ -0,0 +1,140 @@
 | 
			
		|||
---
 | 
			
		||||
title: "Progress on concurrent repositories"
 | 
			
		||||
date: 2024-06-18
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
During the last devlog I was working on a system for concurrent repositories.
 | 
			
		||||
After a lot of trying, I've found a system that should work pretty well, even
 | 
			
		||||
on larger scales. In doing so, the overall complexity of the system has
 | 
			
		||||
actually decreased on several points as well! Let me explain.
 | 
			
		||||
 | 
			
		||||
## Concurrent repositories
 | 
			
		||||
 | 
			
		||||
I went through a lot of ideas before settling on the current implementation.
 | 
			
		||||
Initially both the parsing of packages and the regeneration of the package
 | 
			
		||||
archives happened inside the request handler, without any form of
 | 
			
		||||
synchronisation. This had several unwanted effects. For one, multiple packages
 | 
			
		||||
could quickly overload the CPU as they would all be processed in parallel.
 | 
			
		||||
These would then also try to generate the package archives in parallel, causing
 | 
			
		||||
writes to the same files which was a mess of its own. Because all work was
 | 
			
		||||
performed inside the request handlers, the time it took for the server to
 | 
			
		||||
respond was dependent on how congested the system was, which wasn't acceptable
 | 
			
		||||
for me. Something definitely had to change.
 | 
			
		||||
 | 
			
		||||
My first solution heavily utilized the Tokio async runtime that Rieter is built
 | 
			
		||||
on. Each package that gets uploaded would spawn a new task that competes for a
 | 
			
		||||
semaphore, allowing me to control how many packages get parsed in parallel.
 | 
			
		||||
Important to note here is that the request handler no longer needs to wait
 | 
			
		||||
until a package is finished parsing. The parse task is handled asynchronously,
 | 
			
		||||
allowing the server to respond immediately with a [`202
 | 
			
		||||
Accepted`](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/202). This
 | 
			
		||||
way, clients no longer need to wait unnecessarily long for a task that can be
 | 
			
		||||
performed asynchronously on the server. Each parse task would then regenerate
 | 
			
		||||
the package archives if it was able to successfully parse a package.
 | 
			
		||||
 | 
			
		||||
Because each task regenerates the package archives, this approach performed a
 | 
			
		||||
lot of extra work. The constant spawning of Tokio tasks also didn't sit right
 | 
			
		||||
with me, so I tried another design, which ended up being the current version.
 | 
			
		||||
 | 
			
		||||
### Current design
 | 
			
		||||
 | 
			
		||||
I settled on a much more classic design: worker threads, or rather, Tokio
 | 
			
		||||
tasks. On startup, Rieter launches `n` worker tasks that listen for messages on
 | 
			
		||||
an [mpsc](https://docs.rs/tokio/latest/tokio/sync/mpsc/index.html) channel. The
 | 
			
		||||
receiver is shared between the workers using a mutex, so each message only gets
 | 
			
		||||
picked up by one of the workers. Each request first uploads its respective
 | 
			
		||||
package to a temporary file, and sends a tuple `(repo, path)` to the channel,
 | 
			
		||||
notifying one of the workers a new package is to be parsed. Each time the queue
 | 
			
		||||
for a repository is empty, the package archives get regenerated, effectively
 | 
			
		||||
batching this operation. This technique is so much simpler and works wonders.
 | 
			
		||||
 | 
			
		||||
### Package queueing
 | 
			
		||||
 | 
			
		||||
I did have some fun designing the intrinsics of this system. My goal was to
 | 
			
		||||
have a repository seamlessly handle any number of packages being uploaded, even
 | 
			
		||||
different versions of the same package. To achieve this I leveraged the
 | 
			
		||||
database.
 | 
			
		||||
 | 
			
		||||
Each parsed package's information gets added to the database with a unique
 | 
			
		||||
monotonically increasing ID. Each repository can only have one version of a
 | 
			
		||||
package present for each of its architectures. For each package name, the
 | 
			
		||||
relevant package to add to the package archives is thus the one with the
 | 
			
		||||
largest ID. This resulted in this (in my opinion rather elegant) query:
 | 
			
		||||
 | 
			
		||||
```sql
 | 
			
		||||
SELECT * FROM "package" AS "p1" INNER JOIN (
 | 
			
		||||
    SELECT "repo_id", "arch", "name", MAX("package"."id") AS "max_id" 
 | 
			
		||||
    FROM "package"
 | 
			
		||||
    GROUP BY "repo_id", "arch", "name"
 | 
			
		||||
) AS "p2" ON "p1"."id" = "p2"."max_id"
 | 
			
		||||
         WHERE "p1"."repo_id" = 1 AND "p1"."arch" IN ('x86_64', 'any') AND "p1"."state" <> 2
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
For each `(repo, arch, name)` tuple, we find the largest ID and select it, but
 | 
			
		||||
only if its state is not `2`, which means *pending deletion*. Knowing what old
 | 
			
		||||
packages to remove is then a similar query to this, where we instead select all
 | 
			
		||||
packages that are marked as *pending deletion* or whose ID is less than the
 | 
			
		||||
currently committed package.
 | 
			
		||||
 | 
			
		||||
This design not only seamlessly supports any order of packages being added; it
 | 
			
		||||
also paves the way for implementing repository mirroring down the line. This
 | 
			
		||||
allows me to atomically update a repository, a feature that I'll be using for
 | 
			
		||||
the mirroring system. I'll simply queue new packages and only regenerate the
 | 
			
		||||
package archives once all packages have successfully synced to the server.
 | 
			
		||||
 | 
			
		||||
## Simplifying things
 | 
			
		||||
 | 
			
		||||
During my development of the repository system, I realized how complex I was
 | 
			
		||||
making some things. For example, repositories are grouped into distros, and
 | 
			
		||||
this structure was also visible in the codebase. Each distro had its own "disto
 | 
			
		||||
manager" that managed packages for that repository. However, this was a
 | 
			
		||||
needless overcomplication, as distros are solely an aesthetic feature. Each
 | 
			
		||||
repository has a unique ID in the database anyways, so this extra level of
 | 
			
		||||
complexity was completely unnecessary.
 | 
			
		||||
 | 
			
		||||
Package organisation on disk is also still overly complex right now. Each
 | 
			
		||||
repository has its own folder with its own packages, but this once again is an
 | 
			
		||||
excessive layer as packages have unique IDs anyways. The database tracks which
 | 
			
		||||
packages are part of which repositories, so I'll switch to storing all packages
 | 
			
		||||
next to each other instead. This might also pave the way for some cool features
 | 
			
		||||
down the line, such as staging repositories.
 | 
			
		||||
 | 
			
		||||
I've been needlessly holding on to how I've done things with Vieter, while I
 | 
			
		||||
can make completely new choices in Rieter. The file system for Rieter doesn't
 | 
			
		||||
need to resemble the Vieter file system at all, nor should it follow any notion
 | 
			
		||||
of how Arch repositories usually look. If need be, I can add an export utility
 | 
			
		||||
to convert the directory structure into a more classic layout, but I shouldn't
 | 
			
		||||
bother keeping it in mind while developing Rieter.
 | 
			
		||||
 | 
			
		||||
## Configuration
 | 
			
		||||
 | 
			
		||||
I switched configuration from environment variables and CLI arguments to a
 | 
			
		||||
dedicated config file. The former would've been too simplistic for the later
 | 
			
		||||
configuration options I'll be adding, so I opted for a configuration file
 | 
			
		||||
instead.
 | 
			
		||||
 | 
			
		||||
```toml
 | 
			
		||||
api_key = "test"
 | 
			
		||||
pkg_workers = 2
 | 
			
		||||
log_level = "rieterd=debug"
 | 
			
		||||
 | 
			
		||||
[fs]
 | 
			
		||||
type = "local"
 | 
			
		||||
data_dir = "./data"
 | 
			
		||||
 | 
			
		||||
[db]
 | 
			
		||||
type = "postgres"
 | 
			
		||||
host = "localhost"
 | 
			
		||||
db = "rieter"
 | 
			
		||||
user = "rieter"
 | 
			
		||||
password = "rieter"
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
This will allow me a lot more flexibility in the future.
 | 
			
		||||
 | 
			
		||||
## First release
 | 
			
		||||
 | 
			
		||||
There's still some polish to be done, but I'm definitely nearing an initial 0.1
 | 
			
		||||
release for this project. I'm looking forward to announcing it!
 | 
			
		||||
 | 
			
		||||
As usual, thanks for reading, and having a nice day.
 | 
			
		||||
		Loading…
	
		Reference in New Issue