site/content/posts/lander/index.md

7.5 KiB

title date
Designing my own URL shortener 2023-10-14

One of the projects I've always found to be a good choice for a side project is a URL shortener. The core idea is simple and fairly easily to implement, yet it allows for a lot of creativity in how you implement it. Once you're done with the core idea, you can start expanding the project as you wish: expiring links, password protection, or perhaps a management API. The possibilities are endless!

Naturally, this post talks about my own version of a URL shortener: Lander. In order to add some extra challenge to the project, I've chosen to write it from the ground up in C by implementing my own event loop, and building an HTTP server on top to use as the base for the URL shortener.

The event loop

Lander consists of three layers: the event loop, the HTTP loop and finally the Lander-specific code. Each of these layers utilizes the layer below it, with the event loop being the bottom-most layer. This layer directly interacts with the networking stack and ensures bytes are received from and written to the client. The book Build Your Own Redis by James Smith was an excellent starting point, and I highly recommend checking it out! This book taught me everything I needed to know to start this project.

Now for a slightly more techical dive into the inner workings of the event loop. The event loop is the layer that listens on the listening TCP socket for incoming connections and directly processes requests. In each iteration of the event loop, the following steps are taken:

  1. For each of the open connections:
    1. Perform network I/O
    2. Execute data processing code, provided by the upper layers
    3. Close finished connections
  2. Accept a new connection if needed

The event loop runs on a single thread, and constantly goes through this cycle to process requests. Here, the "data processing code" is a set of function pointers passed to the event loop that get executed at specific times. This is how the HTTP loop is able to inject its functionality into the event loop.

In the event loop, a connection can be in one of three states: request, response, or end. In request mode, the event loop tries to read incoming data from the client into the read buffer. This read buffer is then used by the data processing code's data handler. In response mode, the data processing code's data writer is called, which populates the write buffer. This buffer is then written to the connection socket. Finally, the end state simply tells the event loop that the connection should be closed without any further processing. A connection can switch between request and response mode as many times as needed, allowing connections to be reused for multiple requests from the same client.

The event loop provides all the necessary building blocks needed to build a client-server type application. These are then used to implement the next layer: the HTTP loop.

The HTTP loop

Before we can design a specific HTTP-based application, we need a base to build on. This base is the HTTP loop. It handles both serializing and deserializing of HTTP requests & responses, along with providing commonly used functionality, such as bearer authentication and reading & writing files to & from disk. The request parser is provided by the excellent picohttpparser library. The parsed request is stored in the request's data struct, providing access to this data for all necessary functions.

The HTTP loop defines a request handler function which is passed to the event loop as the data handler. This function first tries to parse the request, before routing it accordingly. For routing, literal string matches or RegEx-based routing is available.

Each route consists of one or more steps. Each of these steps is a function that tries to advance the processing of the current request. The return value of these steps tells the HTTP loop whether the step has finished its task or if it's still waiting for I/O. The latter instructs the HTTP loop to skip this request for now, delaying its processing until the next cycle of the HTTP loop. In each cycle of the HTTP loop (or rather, the event loop), a request will try to advance its processing by as much as possible by executing as many steps as possible, in order. This means that very small requests can be completely processed within a single cycle of the HTTP loop. Common functionality is provided as predefined steps. One example is the http_loop_step_body_to_buf step, which reads the request body into a buffer.

The HTTP loop also provides the data writer functionality, which will stream an HTTP response to the write buffer. The contents of the response are tracked in the request's data struct, and these data structs are recycled between requests using the same connection, preventing unnecessary allocations.

Lander

Above the HTTP loop layer, we finally reach the code specific to Lander. It might not surprise you that this layer is the smallest of the three, as the abstractions below allow it to focus on the task at hand: serving and storing HTTP redirects (and pastes). The way these are stored however is, in my opinion, rather interesting.

For our Algorithms & Datastructures 3 course, we had to design three different trie implementations in C: a Patricia trie, a ternary trie and a "custom" trie, where we were allowed to experiment with different ideas. For those unfamiliar, a trie is a tree-like datastructure used for storing strings. The keys used in this tree are the strings themselves, with each character causing the tree to branch off. Each string is stored at depth m, with m being the length of the string. This also means that the search depth of a string is not bounded by the size of the trie, but rather the size of the string! This allows for extremely fast lookup times for short keys, even if we have a large number of entries.

My design ended up being a combination of both a Patricia and a ternary trie: a ternary trie that supports skips the way a Patricia trie does. I ended up taking this final design and modifying it for this project by optimising it (or at least try to) for shorter keys. This trie structure is stored completely in memory, allowing for very low response times for redirects. Pastes are served from disk, but their lookup is also performed using the same in-memory trie.

What's next?

Hopefully the above explanation provides some insight into the inner workings of Lander. For those interested, the source code is of course available here. I'm not quite done with this project though.

My current vision is to have Lander be my personal URL shortener, pastebin & file-sharing service. Considering a pastebin is basically a file-sharing service for text files specifically, I'd like to combine these into a single concept. The goal is to rework the storage system to support arbitrarily large files, and to allow storing generic metadata for each entry. The initial usecase for this metadata would be storing the content type for uploaded files, allowing this header to be correctly served when retrieving the files. This combined with supporting large files turns Lander into a WeTransfer alternative! Besides this, password protection and expiration of pastes is on my to-do list as well. The data structure currently doesn't support removing elements either, so this would need to be added in order to support expiration.

Hopefully a follow-up post announcing these changes will come soon ;)