From 0897a275eecce750cec968a1ca10731cee75daad Mon Sep 17 00:00:00 2001 From: Chewing_Bever Date: Sat, 14 Oct 2023 21:40:45 +0200 Subject: [PATCH] lander: added post --- config.toml | 2 +- content/links/index.md | 14 +++- content/posts/lander/index.md | 138 ++++++++++++++++++++++++++++++++++ 3 files changed, 149 insertions(+), 5 deletions(-) create mode 100644 content/posts/lander/index.md diff --git a/config.toml b/config.toml index ae1ec38..43b4704 100644 --- a/config.toml +++ b/config.toml @@ -8,7 +8,7 @@ pygmentsUseClasses = true [params] description = "The Rusty Bever" - copyright = "Copyright © 2022 Jef Roosens" + copyright = "Copyright © 2023 Jef Roosens" dark = "auto" highlight = true diff --git a/content/links/index.md b/content/links/index.md index 486252c..4889d12 100644 --- a/content/links/index.md +++ b/content/links/index.md @@ -4,8 +4,8 @@ title: "Links" ### Vieter -Vieter is an implementation of an Arch repository server written in V, combined -with a build system. +An implementation of an Arch repository server combined with a build system, +written in V. * [Source](https://git.rustybever.be/vieter-v/vieter) * [Docs](/docs/vieter) @@ -13,8 +13,8 @@ with a build system. ### Alex -Alex is a Rust program that wraps a Minecraft server process and automates -creating incremental backups. +Minecraft server process wrapper that automates creating (incremental) backups, +written in Rust. * [Source](https://git.rustybever.be/Chewing_Bever/alex) @@ -25,3 +25,9 @@ that I've designed to update the hosted files using POST requests from my CI. * [Backend Source](https://git.rustybever.be/Chewing_Bever/site-backend) * [Blog Source](https://git.rustybever.be/Chewing_Bever/site) + +## Lander + +My home-grown URL shortener & pastebin, written from the ground up in C. + +* [Source](https://git.rustybever.be/Chewing_Bever/lander) diff --git a/content/posts/lander/index.md b/content/posts/lander/index.md new file mode 100644 index 0000000..acd2631 --- /dev/null +++ b/content/posts/lander/index.md @@ -0,0 +1,138 @@ +--- +title: "Designing my own URL shortener" +date: 2023-10-14 +--- + +One of the projects I've always found to be a good choice for a side project is +a URL shortener. The core idea is simple and fairly easily to implement, yet it +allows for a lot of creativity in how you implement it. Once you're done with +the core idea, you can start expanding the project as you wish: expiring links, +password protection, or perhaps a management API. The possibilities are +endless! + +Naturally, this post talks about my own version of a URL shortener: +[Lander](https://git.rustybever.be/Chewing_Bever/lander). In order to add some +extra challenge to the project, I've chosen to write it from the ground up in C +by implementing my own event loop, and building an HTTP server on top to use as +the base for the URL shortener. + +## The event loop + +Lander consists of three layers: the event loop, the HTTP loop and finally the +Lander-specific code. Each of these layers utilizes the layer below it, with +the event loop being the bottom-most layer. This layer directly interacts with +the networking stack and ensures bytes are received from and written to the +client. The book [Build Your Own Redis](https://build-your-own.org/redis/) by +James Smith was an excellent starting point, and I highly recommend checking it +out! This book taught me everything I needed to know to start this project. + +Now for a slightly more techical dive into the inner workings of the event +loop. The event loop is the layer that listens on the listening TCP socket for +incoming connections and directly processes requests. In each iteration of the +event loop, the following steps are taken: + +1. For each of the open connections: + 1. Perform network I/O + 2. Execute data processing code, provided by the upper layers + 3. Close finished connections +2. Accept a new connection if needed + +The event loop runs on a single thread, and constantly goes through this cycle +to process requests. Here, the "data processing code" is a set of function +pointers passed to the event loop that get executed at specific times. This is +how the HTTP loop is able to inject its functionality into the event loop. + +In the event loop, a connection can be in one of three states: `request`, +`response`, or `end`. In `request` mode, the event loop tries to read incoming +data from the client into the read buffer. This read buffer is then used by the +data processing code's data handler. In `response` mode, the data processing +code's data writer is called, which populates the write buffer. This buffer is +then written to the connection socket. Finally, the `end` state simply tells +the event loop that the connection should be closed without any further +processing. A connection can switch between `request` and `response` mode as +many times as needed, allowing connections to be reused for multiple requests +from the same client. + +The event loop provides all the necessary building blocks needed to build a +client-server type application. These are then used to implement the next +layer: the HTTP loop. + +## The HTTP loop + +Before we can design a specific HTTP-based application, we need a base to build +on. This base is the HTTP loop. It handles both serializing and deserializing +of HTTP requests & responses, along with providing commonly used functionality, +such as bearer authentication and reading & writing files to & from disk. The +request parser is provided by the excellent +[picohttpparser](https://github.com/h2o/picohttpparser) library. The parsed +request is stored in the request's data struct, providing access to this data +for all necessary functions. + +The HTTP loop defines a request handler function which is passed to the event +loop as the data handler. This function first tries to parse the request, +before routing it accordingly. For routing, literal string matches or +RegEx-based routing is available. + +Each route consists of one or more steps. Each of these steps is a function +that tries to advance the processing of the current request. The return value +of these steps tells the HTTP loop whether the step has finished its task or if +it's still waiting for I/O. The latter instructs the HTTP loop to skip this +request for now, delaying its processing until the next cycle of the HTTP loop. +In each cycle of the HTTP loop (or rather, the event loop), a request will try +to advance its processing by as much as possible by executing as many steps as +possible, in order. This means that very small requests can be completely +processed within a single cycle of the HTTP loop. Common functionality is +provided as predefined steps. One example is the `http_loop_step_body_to_buf` +step, which reads the request body into a buffer. + +The HTTP loop also provides the data writer functionality, which will stream an +HTTP response to the write buffer. The contents of the response are tracked in +the request's data struct, and these data structs are recycled between requests +using the same connection, preventing unnecessary allocations. + +## Lander + +Above the HTTP loop layer, we finally reach the code specific to Lander. It +might not surprise you that this layer is the smallest of the three, as the +abstractions below allow it to focus on the task at hand: serving and storing +HTTP redirects (and pastes). The way these are stored however is, in my +opinion, rather interesting. + +For our Algorithms & Datastructures 3 course, we had to design three different +trie implementations in C: a Patricia trie, a ternary trie and a "custom" trie, +where we were allowed to experiment with different ideas. For those unfamiliar, +a trie is a tree-like datastructure used for storing strings. The keys used in +this tree are the strings themselves, with each character causing the tree to +branch off. Each string is stored at depth `m`, with `m` being the length of +the string. This also means that the search depth of a string is not bounded by +the size of the trie, but rather the size of the string! This allows for +extremely fast lookup times for short keys, even if we have a large number of +entries. + +My design ended up being a combination of both a Patricia and a ternary trie: a +ternary trie that supports skips the way a Patricia trie does. I ended up +taking this final design and modifying it for this project by optimising it (or +at least try to) for shorter keys. This trie structure is stored completely in +memory, allowing for very low response times for redirects. Pastes are served +from disk, but their lookup is also performed using the same in-memory trie. + +## What's next? + +Hopefully the above explanation provides some insight into the inner workings +of Lander. For those interested, the source code is of course available +[here](https://git.rustybever.be/Chewing_Bever/lander). I'm not quite done with +this project though. + +My current vision is to have Lander be my personal URL shortener, pastebin & +file-sharing service. Considering a pastebin is basically a file-sharing +service for text files specifically, I'd like to combine these into a single +concept. The goal is to rework the storage system to support arbitrarily large +files, and to allow storing generic metadata for each entry. The initial +usecase for this metadata would be storing the content type for uploaded files, +allowing this header to be correctly served when retrieving the files. This +combined with supporting large files turns Lander into a WeTransfer +alternative! Besides this, password protection and expiration of pastes is on +my to-do list as well. The data structure currently doesn't support removing +elements either, so this would need to be added in order to support expiration. + +Hopefully a follow-up post announcing these changes will come soon ;)