139 lines
7.5 KiB
Markdown
139 lines
7.5 KiB
Markdown
|
---
|
||
|
title: "Designing my own URL shortener"
|
||
|
date: 2023-10-14
|
||
|
---
|
||
|
|
||
|
One of the projects I've always found to be a good choice for a side project is
|
||
|
a URL shortener. The core idea is simple and fairly easily to implement, yet it
|
||
|
allows for a lot of creativity in how you implement it. Once you're done with
|
||
|
the core idea, you can start expanding the project as you wish: expiring links,
|
||
|
password protection, or perhaps a management API. The possibilities are
|
||
|
endless!
|
||
|
|
||
|
Naturally, this post talks about my own version of a URL shortener:
|
||
|
[Lander](https://git.rustybever.be/Chewing_Bever/lander). In order to add some
|
||
|
extra challenge to the project, I've chosen to write it from the ground up in C
|
||
|
by implementing my own event loop, and building an HTTP server on top to use as
|
||
|
the base for the URL shortener.
|
||
|
|
||
|
## The event loop
|
||
|
|
||
|
Lander consists of three layers: the event loop, the HTTP loop and finally the
|
||
|
Lander-specific code. Each of these layers utilizes the layer below it, with
|
||
|
the event loop being the bottom-most layer. This layer directly interacts with
|
||
|
the networking stack and ensures bytes are received from and written to the
|
||
|
client. The book [Build Your Own Redis](https://build-your-own.org/redis/) by
|
||
|
James Smith was an excellent starting point, and I highly recommend checking it
|
||
|
out! This book taught me everything I needed to know to start this project.
|
||
|
|
||
|
Now for a slightly more techical dive into the inner workings of the event
|
||
|
loop. The event loop is the layer that listens on the listening TCP socket for
|
||
|
incoming connections and directly processes requests. In each iteration of the
|
||
|
event loop, the following steps are taken:
|
||
|
|
||
|
1. For each of the open connections:
|
||
|
1. Perform network I/O
|
||
|
2. Execute data processing code, provided by the upper layers
|
||
|
3. Close finished connections
|
||
|
2. Accept a new connection if needed
|
||
|
|
||
|
The event loop runs on a single thread, and constantly goes through this cycle
|
||
|
to process requests. Here, the "data processing code" is a set of function
|
||
|
pointers passed to the event loop that get executed at specific times. This is
|
||
|
how the HTTP loop is able to inject its functionality into the event loop.
|
||
|
|
||
|
In the event loop, a connection can be in one of three states: `request`,
|
||
|
`response`, or `end`. In `request` mode, the event loop tries to read incoming
|
||
|
data from the client into the read buffer. This read buffer is then used by the
|
||
|
data processing code's data handler. In `response` mode, the data processing
|
||
|
code's data writer is called, which populates the write buffer. This buffer is
|
||
|
then written to the connection socket. Finally, the `end` state simply tells
|
||
|
the event loop that the connection should be closed without any further
|
||
|
processing. A connection can switch between `request` and `response` mode as
|
||
|
many times as needed, allowing connections to be reused for multiple requests
|
||
|
from the same client.
|
||
|
|
||
|
The event loop provides all the necessary building blocks needed to build a
|
||
|
client-server type application. These are then used to implement the next
|
||
|
layer: the HTTP loop.
|
||
|
|
||
|
## The HTTP loop
|
||
|
|
||
|
Before we can design a specific HTTP-based application, we need a base to build
|
||
|
on. This base is the HTTP loop. It handles both serializing and deserializing
|
||
|
of HTTP requests & responses, along with providing commonly used functionality,
|
||
|
such as bearer authentication and reading & writing files to & from disk. The
|
||
|
request parser is provided by the excellent
|
||
|
[picohttpparser](https://github.com/h2o/picohttpparser) library. The parsed
|
||
|
request is stored in the request's data struct, providing access to this data
|
||
|
for all necessary functions.
|
||
|
|
||
|
The HTTP loop defines a request handler function which is passed to the event
|
||
|
loop as the data handler. This function first tries to parse the request,
|
||
|
before routing it accordingly. For routing, literal string matches or
|
||
|
RegEx-based routing is available.
|
||
|
|
||
|
Each route consists of one or more steps. Each of these steps is a function
|
||
|
that tries to advance the processing of the current request. The return value
|
||
|
of these steps tells the HTTP loop whether the step has finished its task or if
|
||
|
it's still waiting for I/O. The latter instructs the HTTP loop to skip this
|
||
|
request for now, delaying its processing until the next cycle of the HTTP loop.
|
||
|
In each cycle of the HTTP loop (or rather, the event loop), a request will try
|
||
|
to advance its processing by as much as possible by executing as many steps as
|
||
|
possible, in order. This means that very small requests can be completely
|
||
|
processed within a single cycle of the HTTP loop. Common functionality is
|
||
|
provided as predefined steps. One example is the `http_loop_step_body_to_buf`
|
||
|
step, which reads the request body into a buffer.
|
||
|
|
||
|
The HTTP loop also provides the data writer functionality, which will stream an
|
||
|
HTTP response to the write buffer. The contents of the response are tracked in
|
||
|
the request's data struct, and these data structs are recycled between requests
|
||
|
using the same connection, preventing unnecessary allocations.
|
||
|
|
||
|
## Lander
|
||
|
|
||
|
Above the HTTP loop layer, we finally reach the code specific to Lander. It
|
||
|
might not surprise you that this layer is the smallest of the three, as the
|
||
|
abstractions below allow it to focus on the task at hand: serving and storing
|
||
|
HTTP redirects (and pastes). The way these are stored however is, in my
|
||
|
opinion, rather interesting.
|
||
|
|
||
|
For our Algorithms & Datastructures 3 course, we had to design three different
|
||
|
trie implementations in C: a Patricia trie, a ternary trie and a "custom" trie,
|
||
|
where we were allowed to experiment with different ideas. For those unfamiliar,
|
||
|
a trie is a tree-like datastructure used for storing strings. The keys used in
|
||
|
this tree are the strings themselves, with each character causing the tree to
|
||
|
branch off. Each string is stored at depth `m`, with `m` being the length of
|
||
|
the string. This also means that the search depth of a string is not bounded by
|
||
|
the size of the trie, but rather the size of the string! This allows for
|
||
|
extremely fast lookup times for short keys, even if we have a large number of
|
||
|
entries.
|
||
|
|
||
|
My design ended up being a combination of both a Patricia and a ternary trie: a
|
||
|
ternary trie that supports skips the way a Patricia trie does. I ended up
|
||
|
taking this final design and modifying it for this project by optimising it (or
|
||
|
at least try to) for shorter keys. This trie structure is stored completely in
|
||
|
memory, allowing for very low response times for redirects. Pastes are served
|
||
|
from disk, but their lookup is also performed using the same in-memory trie.
|
||
|
|
||
|
## What's next?
|
||
|
|
||
|
Hopefully the above explanation provides some insight into the inner workings
|
||
|
of Lander. For those interested, the source code is of course available
|
||
|
[here](https://git.rustybever.be/Chewing_Bever/lander). I'm not quite done with
|
||
|
this project though.
|
||
|
|
||
|
My current vision is to have Lander be my personal URL shortener, pastebin &
|
||
|
file-sharing service. Considering a pastebin is basically a file-sharing
|
||
|
service for text files specifically, I'd like to combine these into a single
|
||
|
concept. The goal is to rework the storage system to support arbitrarily large
|
||
|
files, and to allow storing generic metadata for each entry. The initial
|
||
|
usecase for this metadata would be storing the content type for uploaded files,
|
||
|
allowing this header to be correctly served when retrieving the files. This
|
||
|
combined with supporting large files turns Lander into a WeTransfer
|
||
|
alternative! Besides this, password protection and expiration of pastes is on
|
||
|
my to-do list as well. The data structure currently doesn't support removing
|
||
|
elements either, so this would need to be added in order to support expiration.
|
||
|
|
||
|
Hopefully a follow-up post announcing these changes will come soon ;)
|