site/content/posts/lander/index.md

139 lines
7.5 KiB
Markdown

---
title: "Designing my own URL shortener"
date: 2023-10-14
---
One of the projects I've always found to be a good choice for a side project is
a URL shortener. The core idea is simple and fairly easily to implement, yet it
allows for a lot of creativity in how you implement it. Once you're done with
the core idea, you can start expanding the project as you wish: expiring links,
password protection, or perhaps a management API. The possibilities are
endless!
Naturally, this post talks about my own version of a URL shortener:
[Lander](https://git.rustybever.be/Chewing_Bever/lander). In order to add some
extra challenge to the project, I've chosen to write it from the ground up in C
by implementing my own event loop, and building an HTTP server on top to use as
the base for the URL shortener.
## The event loop
Lander consists of three layers: the event loop, the HTTP loop and finally the
Lander-specific code. Each of these layers utilizes the layer below it, with
the event loop being the bottom-most layer. This layer directly interacts with
the networking stack and ensures bytes are received from and written to the
client. The book [Build Your Own Redis](https://build-your-own.org/redis/) by
James Smith was an excellent starting point, and I highly recommend checking it
out! This book taught me everything I needed to know to start this project.
Now for a slightly more techical dive into the inner workings of the event
loop. The event loop is the layer that listens on the listening TCP socket for
incoming connections and directly processes requests. In each iteration of the
event loop, the following steps are taken:
1. For each of the open connections:
1. Perform network I/O
2. Execute data processing code, provided by the upper layers
3. Close finished connections
2. Accept a new connection if needed
The event loop runs on a single thread, and constantly goes through this cycle
to process requests. Here, the "data processing code" is a set of function
pointers passed to the event loop that get executed at specific times. This is
how the HTTP loop is able to inject its functionality into the event loop.
In the event loop, a connection can be in one of three states: `request`,
`response`, or `end`. In `request` mode, the event loop tries to read incoming
data from the client into the read buffer. This read buffer is then used by the
data processing code's data handler. In `response` mode, the data processing
code's data writer is called, which populates the write buffer. This buffer is
then written to the connection socket. Finally, the `end` state simply tells
the event loop that the connection should be closed without any further
processing. A connection can switch between `request` and `response` mode as
many times as needed, allowing connections to be reused for multiple requests
from the same client.
The event loop provides all the necessary building blocks needed to build a
client-server type application. These are then used to implement the next
layer: the HTTP loop.
## The HTTP loop
Before we can design a specific HTTP-based application, we need a base to build
on. This base is the HTTP loop. It handles both serializing and deserializing
of HTTP requests & responses, along with providing commonly used functionality,
such as bearer authentication and reading & writing files to & from disk. The
request parser is provided by the excellent
[picohttpparser](https://github.com/h2o/picohttpparser) library. The parsed
request is stored in the request's data struct, providing access to this data
for all necessary functions.
The HTTP loop defines a request handler function which is passed to the event
loop as the data handler. This function first tries to parse the request,
before routing it accordingly. For routing, literal string matches or
RegEx-based routing is available.
Each route consists of one or more steps. Each of these steps is a function
that tries to advance the processing of the current request. The return value
of these steps tells the HTTP loop whether the step has finished its task or if
it's still waiting for I/O. The latter instructs the HTTP loop to skip this
request for now, delaying its processing until the next cycle of the HTTP loop.
In each cycle of the HTTP loop (or rather, the event loop), a request will try
to advance its processing by as much as possible by executing as many steps as
possible, in order. This means that very small requests can be completely
processed within a single cycle of the HTTP loop. Common functionality is
provided as predefined steps. One example is the `http_loop_step_body_to_buf`
step, which reads the request body into a buffer.
The HTTP loop also provides the data writer functionality, which will stream an
HTTP response to the write buffer. The contents of the response are tracked in
the request's data struct, and these data structs are recycled between requests
using the same connection, preventing unnecessary allocations.
## Lander
Above the HTTP loop layer, we finally reach the code specific to Lander. It
might not surprise you that this layer is the smallest of the three, as the
abstractions below allow it to focus on the task at hand: serving and storing
HTTP redirects (and pastes). The way these are stored however is, in my
opinion, rather interesting.
For our Algorithms & Datastructures 3 course, we had to design three different
trie implementations in C: a Patricia trie, a ternary trie and a "custom" trie,
where we were allowed to experiment with different ideas. For those unfamiliar,
a trie is a tree-like datastructure used for storing strings. The keys used in
this tree are the strings themselves, with each character causing the tree to
branch off. Each string is stored at depth `m`, with `m` being the length of
the string. This also means that the search depth of a string is not bounded by
the size of the trie, but rather the size of the string! This allows for
extremely fast lookup times for short keys, even if we have a large number of
entries.
My design ended up being a combination of both a Patricia and a ternary trie: a
ternary trie that supports skips the way a Patricia trie does. I ended up
taking this final design and modifying it for this project by optimising it (or
at least try to) for shorter keys. This trie structure is stored completely in
memory, allowing for very low response times for redirects. Pastes are served
from disk, but their lookup is also performed using the same in-memory trie.
## What's next?
Hopefully the above explanation provides some insight into the inner workings
of Lander. For those interested, the source code is of course available
[here](https://git.rustybever.be/Chewing_Bever/lander). I'm not quite done with
this project though.
My current vision is to have Lander be my personal URL shortener, pastebin &
file-sharing service. Considering a pastebin is basically a file-sharing
service for text files specifically, I'd like to combine these into a single
concept. The goal is to rework the storage system to support arbitrarily large
files, and to allow storing generic metadata for each entry. The initial
usecase for this metadata would be storing the content type for uploaded files,
allowing this header to be correctly served when retrieving the files. This
combined with supporting large files turns Lander into a WeTransfer
alternative! Besides this, password protection and expiration of pastes is on
my to-do list as well. The data structure currently doesn't support removing
elements either, so this would need to be added in order to support expiration.
Hopefully a follow-up post announcing these changes will come soon ;)