< back

Archivist: A post-mortem

Finn O'Leary

In the wake of 2017, I had several firefox instances spread across several machines. Because of the instant search feature, I use browser history instead of bookmarks. Because it's history-related information, I wanted to both store the dates websites I had accessed on my machine chronologically (even if they were duplicate), and quickly access and search through them later. Firefox, for efficiency reasons, has a maximum limit on the amount of website history that it stores in the SQL database. I felt that this wasn't enough.

At the same time, I wished that my slack messages could be preserved, I had written some good arguments on some topics, and I figure that the data might be useful to a later version of myself. I like digging through old computer directories and IRC logs, it helps me remember who I was, creates a sense of where I am going, and it serves as a suppliment to my memory (which turns out to be pretty bad). I also figured it would be really neat if I could one day plot my word usage by date.

In general, I feel that too much useful information is lost. At the same time, I don't think that Google, Amazon, etc. are good sources to trust with that information. It would be useful if I could pull this data down, in both realtime and in single bulk copies, from all sorts of different sources, and create One Authoritative Source, so I didn't have to trust Slack, Facebook, etc. to hold this data for me, in what are usually not very useful or searchable formats.

Thus Archivist was born.

In early/mid 2018, I decided to Build this beast. Because of the difficulty of accessing different sources, I figured it would be useful to have the central structure of it, as a single program (what I'll call the Overseer, I guess) that accesses the database, and then have several clients that process, understand, and map the data to useful information, and then send that information to the Overseer. It was at around this point I sketched out a rough (fluid) structure for it. At the time I wasn't sure whether it was better to have multiple processes under the Overseer name owning or writing to different 'database files', or whether it was better to have a single Overseer that owns *all* the database files.

Basically, the structure I had drawn up was something that looks like this:

Because I'm a fan of software that is written once, bug-free, and that needs minimal maintenance, I chose C/POSIX as the base for the software (Yes, I'm aware of the potential irony here). However, after reading a lot about UNIX/POSIX IPC mechanisms, I realised that because it's all essentially plain text, I would probably need some form of serialization between the clients and the Overseer. At around the same time, I also discovered that hard disks and kernels are not reliable with regards to saving state to disk -- for example, the OS X kernel is known to lie that it has written data to disk when it has not. SSDs are also known to lie that they have written data to disk, when it has only touched the relatively ephemeral cache. After toying around with some ideas, (notably with ZeroMQ, and nng) I decided to switch to Erlang, because it supports both message passing and UNIX IPC, which I felt would make the job easier.

At around this time (mid to late 2018), I was also deciding what archive format to store it in. At the beginning I thought plain text would be appropriate, but after considering the potential size of entries, I decided to look into compression formats. I wondered if there was a good form of archive that guaranteed protection against bit-rot, etc. After doing a 'deep dive' into the academic work it turns out that there really isn't "one good" archival format -- indeed, there doesn't seem to be _any_ archival-first format. The latest work I found reference to was in the late 90s, encoding the resulting data using a turbo encoder for error protection. It turns out the most advanced (that is, the closest we've got so far to the Shannon limit -- the maximum rate of data we can write through a 'noisy'/error-prone communication channel) encodings available are LDPC (Low Density Parity Check) codes. While these were discovered in the 1960s, they still haven't really reached any areas outside of data transmission (IIRC, WiFi and 5G both use LDPC encodings). It was at this point I realised that I was out of my depth mathematically, so after puzzling through a GNU Radio implementation of LDPC encoding / decoding, I decided to table it in favour of other projects -- at least until I had improved my linear algebra.

Additionally, something that would have been extremely helpful that I very surprisingly do not remember finding any research on, was how bit rot affects various data storage mediums! After all, if only large chunks of the data are missing then a burst ECC would have been more appropriate. One prior work that was worth noting was a format designed for redundancy with respect to hard disk failure -- as I understand it, with regards to hard disks bit errors are less likely than entire sectors 'going bad'. So parcelling files to the size of a sector makes it less likely that the entire file or a significant part of the file is rendered unreadable (say, the magic number was overwritten, or something), and makes it easier to find and piece back together after accidental file deletion. Unfortunately I can't find the project's name at this point. I'll update it when I can.

In retrospect, there were several rabbit holes I didn't need to go down. For example, plain text TSV would have been 'good enough' and if the resulting system was properly designed then it would be possible to swap out the db layer at a later date.