Storage Wars: Web Edition – or, how we learned to store binary data effectively

Local storage, IndexedDB, web workers… Which storage option reigns supreme in the world of web applications? We embark on a quest to find out!

Aare September 19, 2024 23 mins read
app store

Prologue

All modern browsers offer ways to store data locally. Our world would look rather different if they didn’t. Since you’re reading this, I’m sure you already have some idea of what local storage in the browser is. Still, let’s quickly go over the basics.

I’m writing this in Google Docs. I figure it stores my data locally before syncing it to Google’s servers. This is a good thing, because it makes sure that any changes I make are immediately stored on my device, and that Docs doesn’t fully depend on an external server being available.

Cookies are the oldest way to store data locally – however, they’re ripe for abuse by third parties. They’re also intended for communication with servers anyway, so they are a story for another time.

Then we have session storage and local storage proper. They’re almost exactly the same, except for their scope and lifetime.

Local storage stores data per origin, meaning whenever and from whichever tab you visit the same URL, you have the same data stored locally.

Session storage, on the other hand, works per instance, meaning its lifecycle is bound to a single browser tab. It’s intended to allow separate instances of the same web app to run in different windows without interfering with each other, which is a nifty feature. However, it does not exactly fit our requirements.

Wait, which requirements? What exactly are we looking for?

We’re here to research storage options for the Scanbot Web SDK, you see.

This SDK has a complex API that deals with various different types of data. When you scan a document with it, you don’t just get a bunch of strings and numbers. For one, the result data also includes an array with the coordinates of the document’s location on the scanned image. For another, you also get the image itself as binary data (more on this later). All of this needs to be properly stored and managed as conveniently for the user as possible.

Here’s an example of the data object we need to store (I’ve removed annotations and irrelevant types for the sake of brevity):

class CroppedDetectionResult {
    croppedImage: RawImage | null;
    originalImage: Image;
    status: DetectionStatus = "NOT_ACQUIRED";
    detectionScores: DetectionScores;
    points: Point[];
    horizontalLines: LineSegment[];
    verticalLines: LineSegment[];
    aspectRatio: number;
    averageBrightness: number = 0;
}

As you can see, it’s not overly complicated. Some numbers and strings, some classes as types.

There are only two parts that are actually problematic: RawImage, which contains Uint8Array data, and Image, which is a type that can refer to any of the interfaces ImageData, ArrayBuffer, Uint8Array and RawImage itself.

How can we store all this, then?

Local storage

Let’s start with the basics, the OG of storage: localStorage. It has been available since before I could call myself a software developer, all the way back in 2009. Ah, simpler times. localStorage is still used in all kinds of applications and the API has remained relatively stable throughout the ages, so I will jump right into the implementation.

Let’s create a separate storage class for our data object. Just the bare bones for storing an item and retrieving it without worrying about unique keys or anything.

JSON parsing is necessary because localStorage only accepts strings as values, another indication that this is not meant for larger data sets. And here it is:

export default class SBStorage {

    static storeDocument(source: CroppedDetectionResult): void {
        localStorage.setItem("document", JSON.stringify(source));
    }

    static getDocument(): CroppedDetectionResult {
        const item = localStorage.getItem("document");
        return item ? JSON.parse(item) : null;
    }
}

Limitations

At this point, I just wanted to show you what a retrieved document looked like, but my React starter project immediately threw an exception:

sb-storage.ts:7 Uncaught (in promise) DOMException: Failed to execute 'setItem' on 'Storage': Setting the value of 'document' exceeded the quota.
    at SBStorage.storeDocument (

There is this nifty online tool that shows you how many characters can be held in localStorage: https://arty.name/localstorage.html. As of writing this, my limit seems to be 5,200,000. This number will vary based on the browser and version you’re using.

The site’s author even explains:

Chrome 6.0.472.36 beta let me save 2600-2700 thousands of characters, Firefox 3.6.8 – 5200-5300k, Explorer 8 – 4900-5000k, and Opera 10.70 build 9013 popped a dialog, that allowed me to give the script unlimited storage. Spec arbitrarily recommends 5 megabytes of storage, but doesn’t say a thing about actual characters, so in UTF-16 you get twice less.

So, let’s check how many characters’ worth of data we’re trying to store. A quick modification to the existing snippet …

const value = JSON.stringify(source);
console.log("Characters:", value.length);
localStorage.setItem("document", value);

… and the result we get is:

Characters: 68046772

As it turns out, localStorage can’t even store a tenth of that.

Even though we’ve already proven that it’s not a viable option here, I briefly want to touch on the fact that localStorage does not support binary data. That’s what I was going to get at in the first place, but instead we immediately ran into the quota issue.

I just want to quickly show you what serialization does to a binary data format. To begin with, the format of our croppedImage data is an Uint8Array. It looks more or less like what you would expect (this array is actually 1,504,272 characters long, but here’s the start of it):

[169, 195, 195, 169, 195, 195, 168, 194, 194, 168, ...

After serialization, the same array looks like the following:

{"0":169,"1":195,"2":195,"3":169,"4":195,"5":195,"6":168,"7":194,"8":194,"9":168, ...

We’ll get into why this is in a later section, but what we need to know right now is that localStorage only accepts strings and therefore doesn’t support binary formats out of the box.

There are ways to get around this – you can probably already imagine an algorithm to turn this map of keys and values back into an array of integers. It’s always going to be a hack, though. You just shouldn’t have to manipulate binary data manually.

Base64

The correct way to do this would be to convert the image data to a Base64 image first, which we can then store properly.

Base64 is a family of binary-to-text encoding schemes that represent binary data in an ASCII string format by transforming it into a radix-64 representation.

That may sound complicated, but it’s really a straightforward process. You simply create a reader object, pass your array as a blob, and read it in as a data URL using the native JavaScript function. If necessary, also remove the descriptor part of the string. Here’s the convenience function we use:

public static async toBase64(data: Uint8Array): Promise<string> {
    // See https://stackoverflow.com/a/39452119
    const base64url = await new Promise<string>(resolve => {
        const reader = new FileReader()
        reader.onload = () => resolve(reader.result as string)
        reader.readAsDataURL(new Blob([data]))
    });
    // remove the `data:...;base64,` part from the start
    return base64url.slice(base64url.indexOf(',') + 1);
}

And then the following, if not too long, can be effectively held in local storage:

const base64 = await ImageUtils.toBase64(source.croppedImage.data);
// Base64: wsLQw8LQw8LQw8TRxMXSxMXTxcfUxcjVxMjVxcnXx8vaztHh1dbm19np2dvr2Nv...
console.log("Base64: ", base64);

However, any kind of operation of this magnitude is computationally heavy, be it the direct manipulation of a key-value map or the Base64 version, and we’re trying to be optimal here. So all that just won’t cut it. We need to be able to work with binary data directly.

And this brings us to …

Binary data

In web development, we meet binary data mostly when dealing with files (creating, uploading, downloading). Another typical use case is image processing – which is what we’re doing here.

There are many classes for various implementations of binary data in JavaScript, such as ArrayBuffer, Uint8Array, DataView, Blob and File, but these different nuances of implementation aren’t important right now.

Anyway, the core object is ArrayBuffer – the root of everything, the raw binary data. And that is what we will be working with. Whichever of the aforementioned wrapper classes you use, it will contain a variation of an ArrayBuffer.

Everything is an object (almost)

Now, the explanation as to why stringification of ArrayBuffer or Uint8Array turns it into a bit of a nightmare is actually pretty interesting. You see, everything in JavaScript is actually an object, except strings. Strings are still strings. But with arrays, it means you can do the following:

const a  = [1, 2, 3]; a.foo = "bar"console.log(a); // [ 1, 2, 3, foo: 'bar']

That’s because natively, an array is still an object prototype, but it does stringify into looking like a simple array simply because its toString() function has been overwritten for the convenience of developers. Because it’s often needed.

And that is also exactly why stringifying an Uint8Array turns it into an object of key-value pairs (remember the {"0":169,"1":195,"2":195,"3":169 …} example from earlier) and why we need additional steps to properly serialize it.

Serialization

At this point, it’s also worth it to elaborate on a point I made previously: that binary data is not serializable. While that is still fundamentally true, JavaScript does have some native support for serialization these days, with the latest and greatest tool being TextDecoder. Take this example:

const buffer = new TextEncoder().encode('banana');
const testObj = { encodedText: buffer.toString() };

const serializedTestObj = JSON.stringify(testObj);
console.log(serializedTestObj); // {"encodedText":"98,97,110,97,110,97"}

const deserializedTestObj = JSON.parse(serializedTestObj);
const deserializedBuffer = deserializedTestObj.encodedText.split(',');
const newBuffer = Uint8Array.from(deserializedBuffer);
const str = new TextDecoder().decode(newBuffer);
console.log(str); // banana

Thank you to Will Munslow from dev.to for this snippet!

While it is relatively straightforward, you’ll notice right away that this encoding method requires:

  • For the data to initially be a string that’s serialized with the coder class, which is a relatively uncommon use case (in most cases you start off with binary data).
  • String splitting, which is an expensive operation.
  • At the end, both buffer deserialization and text decoding again.

Let’s now move on to the next web storage method.

IndexedDB

So, what is IndexedDB? I would’ve gladly given my own short overview of the concept, but MDN says it best:

IndexedDB is a low-level API for client-side storage of significant amounts of structured data, including files/blobs. This API uses indexes to enable high-performance searches of this data.

What that actually means for us is a convenient way to store the aforementioned binary data of ArrayBuffer format. Yay, no more of this complex, nonsensical and resource-hungry encoding and decoding logic!

To quickly reiterate, the main problems with localStorage are that:

  • It can only store strings.
  • It has a very small and unstable data cap.

IndexedDB easily solves both of these issues. On top of that, it has some additional benefits, such as the fact that it can be used inside web workers (more on this later). It’s also asynchronous, meaning the render cycle is not blocked while saving/loading.

The only thing localStorage has going for it is the simplicity of its API. It’s just localStorage.setItem and localStorage.getItem. There are a couple of additional convenience functions, but that’s roughly it. Done!

Messy API & solutions

IndexedDB isn’t perfect, either: The API is famously a fabulous mess. It’s such a mess, in fact, that a whole ecosystem of libraries has been developed around it just so that your average developer doesn’t have to deal with its nonsense.

So, before we finally move on to our actual implementation, let’s take a look at what’s already out there to help you with this.

There’s use-indexeddb, which promises to be the most lightweight one, but that’s because it only wraps a few of the basics. It’s not all that convenient to use, and it also features some intrusive logging that cannot be disabled.

Next up is Dexie, a more fully-fledged storage framework that looks pretty promising. Its API is very straightforward. It’s super intuitive to create a base, write queries and add entries:

const db = new Dexie('MyDatabase');

// Declare tables, IDs and indexes
db.version(1).stores({ friends: '++id, name, age' });

// Find some old friends
const oldFriends = await db.friends.where('age').above(75).toArray();

// or make a new one
await db.friends.add({
    name: 'Camilla',
    age: 25,
    street: 'East 13:th Street',
    picture: await getBlob('camilla.png')
});

At first glance, it seems to be inspired by C#’s LINQ statements, and it also supports live queries. That means you can combine it with, for example, Redux and completely handle state management through that. While that’s a nice feature, Dexie is actually a bit too elaborate for our use case.

Another one is idb, which claims to mostly mirror the IndexedDB API, but with some notable usability improvements. But really, looking at the implementation, it just seems to wrap the API in the Promise class and not much else. It’s definitely not worth the hassle of adding a third-party dependency for such a minor increase in convenience.

The final popular wrapper framework is called JsStore, which wraps the IndexedDB API into an SQL-like API.

var insertCount = await connection.insert({ into: 'Product', values: [value] });

var results = await connection.select({ from: 'Product', where: { id: 5 } });

Which, in my humble opinion, is not much of an improvement. SQL is still pretty complex and it’s easy to create sub-optimal transactions. It’s not raw SQL, so maybe it’s not too bad, but I’ve been burned by SQL too many times to trust anything like that again.

Plus, as with Dexie, JsStore seems to be overkill for our use case as well.

To be honest, I knew in advance that I wasn’t going to be impressed by any of these libraries. The plan was always to get our hands dirty, the Spartan way, and not to tie ourselves down by any more dependencies. I just wanted to give a quick overview of the options available. If our solution were more complex, I’d go with Dexie.

The Spartan approach

However, we’re building an SDK for clients, so we do not want to rely on any external libraries. It is possible, and we do use them, but we do not add them lightly, because:

  • They always come with a license, which we must adhere to and disclose.
  • They add overhead to future development and maintenance (I’m sure you know the pain of upgrading dependencies all too well).
  • They add additional points of failure to our project.

If possible and within reason, we want to be in control of what happens inside our SDK.

So, for our storage logic, let’s roll our own!

In terms of use cases, ours is relatively straightforward. The data model itself is complex, but the database transactions are not, so I was not expecting too much of a hassle. The basic setup is simple enough:

const request = indexedDB.open("SBSDKBase", 1);

We create the request, think of a name for the database and add a version number of the database we’re requesting from. Having to specify the version number right away may seem unnecessary, but the documentation makes it pretty clear why: database models change all the time and having an easy option to keep track of version history seems intuitive enough.

The request object has different callbacks: onerror, onupgradeneeded, and onsuccess. Right now, we can ignore errors, as we’re just trying to add some basic documents to the base and retrieve them. There’s no model changes, no indexing, we only use readwrite mode and increment automatically (let’s hope this doesn’t come back to bite us).

All we’re interested in at the moment is the success callback.

request.onsuccess = (event) => {
    const db = (event.target as IDBOpenDBRequest).result;
    db.createObjectStore("documents", { keyPath: "id", autoIncrement: true });

    const transaction = db.transaction("documents", "readwrite");
    const store = transaction.objectStore("documents");
    store.add(source);

    console.log("Added item to store");

    const getRequest = store.get(1);
    getRequest.onsuccess = (event) => {
        const result = (event.target as IDBRequest).result;
        console.log("Got item from store", result);
    };
};

This is the most bare-bones code snippet I thought I’d need to add a document to storage. Open the database, create a document table, put an object in it. I’ve been doing this for a long time, I’ve worked on various databases and thought I had a handle on it. I thought this would work.

Nope. I got bitten right away. The error:

Uncaught DOMException: Failed to execute 'createObjectStore' on 'IDBDatabase': The database is not running a version change transaction.

Why would I be unable to create a table when opening the database? I’m barely getting started and I already get the feeling that writing raw SQL is more intuitive than this. There, you can at least execute queries once you open the database.

Let’s hope it gets better from here on! At least the error message itself is not cryptic. We’ll try again by implementing onupgradeneeded (don’t forget to upgrade your db version):

request.onupgradeneeded = (event) => {
    const db = (event.target as IDBOpenDBRequest).result;
    db.createObjectStore("documents", { keyPath: "id", autoIncrement: true });
    console.log("IndexedDB upgraded", event);
};

Now we can also remove the create block from our success callback and voilá, we’ve created a database, added our first item to it, and also retrieved it:

request.onsuccess = (event) => {
    const db = (event.target as IDBOpenDBRequest).result;
    const transaction = db.transaction("documents", "readwrite");
    const store = transaction.objectStore("documents");
    store.add(source);
    console.log("Added item to store");

    const getRequest = store.get(1);
    getRequest.onsuccess = (event) => {
        const result = (event.target as IDBRequest).result;
        console.log("Got item from store", result);
    };
};

And the retrieved object still contains our image data in the correct Uint8Array format. Juicy!

Got item from store {
    status: 'OK', 
    detectionScores: {...}, 
    points: Array(4), 
    horizontalLines: Array(5), 
    verticalLines: Array(5), …}
    aspectRatio: 0.698217078300424averageBrightness: 183
    croppedImage: {..., format: 'BGR', data: Uint8Array(1859760)
    ...
}

Asynchronous improvements

Time to make the entire process a bit more convenient. Let’s wrap the database opening into its own asynchronous block:

private static openDB(): Promise<IDBDatabase> {
    return new Promise((resolve, reject) => {

        const DB_NAME = "SBSDKBase";
        const DB_VERSION = 4;
        const request = indexedDB.open(DB_NAME, DB_VERSION);

        request.onsuccess = (event) => {
            resolve((event.target as IDBOpenDBRequest).result);
        };

        request.onerror = (event) => { reject(event); };
        request.onupgradeneeded = (event) => {
            // Update object stores when necessary
        };
    });
}

And another block to get the store itself, so we don’t have to duplicate that any further:

private static async getDocumentStore(): Promise<IDBObjectStore> {
    const db = await SBStorage.openDB();
    const transaction = db.transaction("documents", "readwrite");
    return transaction.objectStore("documents");
}

And now we have two super intuitive public functions for our storage service to store a document and retrieve all of them:

static async storeDocument(source: CroppedDetectionResult): Promise<void> {
    const store = await SBStorage.getDocumentStore();
    store.add(source);
}

static async getDocuments(): Promise<CroppedDetectionResult[]> {
    const store = await SBStorage.getDocumentStore();
    return new Promise((resolve, reject) => {
        store.getAll().onsuccess = (event) => {
            const result = (event.target as IDBRequest).result;
            resolve(result);
        };
    });
}

Eventually we’ll have to add proper error handling, but as you can see, with a couple of wrapper functions, it’s relatively easy to implement IndexedDB on your own, no third-party libraries required.

This asynchronous approach with Promise already utilizes callbacks and does not block the main thread. JavaScript is pretty nice, you see. We just implemented some basic syntactic sugar to make the code look nicer, and that ended up being the optimal approach anyway.

Timing transactions

It’s possible to optimize this even further, and we’ll get to that in a second. But first, let’s check out how much time all these transactions take. You should always know exactly what you’re optimizing and why. Performance metrics are key.

The most robust way to do this is to just go back to our initial implementation and measure how many milliseconds have passed after each meaningful operation. For example:

const currentMS = new Date().getTime();
const request = indexedDB.open("SBSDKBase", 4);
console.log("Opening indexedDB", new Date().getTime() - currentMS);

request.onsuccess = (event) => {
    const db = (event.target as IDBOpenDBRequest).result;
    const transaction = db.transaction("documents", "readwrite");
    const store = transaction.objectStore("documents");
    store.add(source);
    console.log("Added item to store", new Date().getTime() - currentMS);

    const getRequest = store.get(1);
    getRequest.onsuccess = (event) => {
        const result = (event.target as IDBRequest).result;
        console.log("Got item from store", new Date().getTime() - currentMS);
    };
};
Opening indexedDB 0
Added item to store 17
Got item from store 30

That’s not too bad, but it’s also not nothing. Considering that I’m currently working on a higher-end MacBook Pro with an M2 chip and an SSD, this operation could easily take 20–50 times as long on a lower-end Android or iOS device.

We have already implemented basic asynchronous IndexedDB usage with Promises, and in some use cases this would be good enough. But since we’re dealing with large chunks of data, though, it’s worth investing in true parallelism and moving everything off the main thread.

And this brings us to …

Web workers

Web workers are a simple means for web content to run scripts in background threads. The worker thread can perform tasks without interfering with the user interface. (MDN)

How web workers differ from promises/callbacks is rather fundamental.

I won’t dive too deep into threading, but to put it briefly, a JavaScript environment is fundamentally single-threaded. In its simplest form, asynchronous is just syntactic sugar for promise chains that allow for the user interface to be rendered before other long-running operations.

It’s also important to note that two jobs can never run in parallel in this environment. One async operation might run in between two jobs of another one, so in that sense they can run “concurrently.” For true parallelism, e.g. simultaneously rendering the user interface and storing data, we need threads.

Luckily, all modern browsers give us worker threads, allowing for this kind of operation.

Basic implementation & problems

Let’s see how we can set up worker threads for our specific use case, following the same example code used previously.

Spawning a worker is a relatively trivial task. From MDN’s examples, you just need to create a worker JavaScript file:

onmessage = function(e) {
  console.log('Worker: Message received from main script');
  const result = e.data[0] * e.data[1];
  if (isNaN(result)) {
    postMessage('Please write two numbers');
  } else {
    const workerResult = 'Result: ' + result;
    console.log('Worker: Posting message back to main script');
    postMessage(workerResult);
  }
}

And then you’re able to import it in your main JavaScript file and pass data to it as follows:

if (window.Worker) {
  const myWorker = new Worker("worker.js");

  [first, second].forEach(input => {
    input.onchange = function() {
      myWorker.postMessage([first.value, second.value]);
      ...

As you can see, running a piece of JavaScript code on another thread is not as trivial as running it asynchronously. You need to spawn a dedicated worker from a separate file. While that seems relatively simple, once your build system becomes more complex, so does this.

Where do you put your worker file in a Vite environment? Where does it go when you’re using Webpack? Does your environment use bare-bones Webpack that is configurable, or is it wrapped behind another build system? What if you also have automated tests running – are they easily able to pick up your worker file?

Or, to give a more specific example: We use Webpack to transpile the SDK, and it has a separate worker-loader plugin that can be used to ensure workers are bundled into the end product. That’s pretty straightforward. However, our development environment uses the SDK’s source code directly, so it’s necessary to have another option to make sure this worker file is properly compiled into that environment as well.

Loading a file as data

So, the easiest solution turns out to be to simply create the file as you normally would, in your source code, and then simply use JavaScript itself to package it into a blob. That blob is then loaded into the worker, exactly as another external JavaScript file would be (and thank you to Martins Olumide on dev.to for this hint).

It may sound convoluted, but it’s a pretty solid solution to start off with. I bet it’s good enough for production, but before shipping this, we will definitely delve into additional performance metrics to see whether using Webpack to pre-compile it into the bundle is better or not.

First, the basic worker file will look like this:

const workerFunction = function () {
   // Perform every operation we want in this function
   self.onmessage = (event: MessageEvent) => {
      postMessage('Message received!');
   };
};

// This stringifies the whole function
let codeToString = workerFunction.toString();
// This brings out the code in the bracket in string
let mainCode = codeToString.substring(codeToString.indexOf('{') + 1, codeToString.LastIndexOf('}'));
// Convert the code into a raw data
let blob = new Blob([mainCode], { type: 'application/javascript' });
// A url is made out of the blob object and we're good to go
let worker_script = URL.createObjectURL(blob);

export default worker_script;

The comments in the block above should make pretty clear what we intend to do: We’re basically mock loading a separate worker file by simply stringifying the entire worker and returning it as a blob.

Putting it all together

After that, we just need to refactor our previous code a bit to make it fit into that JavaScript function and we have our worker up and running. We move the previously defined wrapper functions inside that function (because, thankfully, in JavaScript you can have inner functions, else this would be a mess) and do exactly what we did previously, only now we need to post the message instead of returning the data:

const workerFunction = function () {

    function openDB(): Promise<IDBDatabase> {
        ...
    }

    async function getDocumentStore(): Promise<IDBObjectStore> {
        ...
    }

    self.onmessage = (event: MessageEvent) => {
        console.log("Received in worker thread:", event.data);
        if (event.data.type === 'storeDocument') {
            console.log("Storing document in worker thread");
            getDocumentStore().then((store) => {
                store.add(event.data.data);
                self.postMessage(true);
            });
        } else if (event.data.type === 'getDocuments') {
            console.log("Getting documents in worker thread");
            getDocumentStore().then((store) => {
                const request = store.getAll();
                request.onsuccess = (event) => {
                    self.postMessage((event.target as IDBRequest).result);
                };
            });
        }
    };
};

Now that we’ve moved all of the IndexedDB API implementation to the worker, our main storage service file ends up looking pretty bare:

import { CroppedDetectionResult } from "../../core/worker/ScanbotSDK.Core";
import worker_script from './worker/sb-storage.worker';

export default class SBStorage {

    private static readonly worker: Worker = new Worker(worker_script);

    static async storeDocument(source: CroppedDetectionResult): Promise<boolean> {
        this.worker.postMessage({ type: 'storeDocument', data: source });

        return new Promise<boolean>((resolve, reject) => {
            this.worker.onmessage = (event) => {
                resolve(true);
            }
        });
    }

    static async getDocuments(): Promise<CroppedDetectionResult[]> {
        this.worker.postMessage({ type: 'getDocuments' });
        return new Promise<CroppedDetectionResult[]>((resolve, reject) => {
            this.worker.onmessage = (event) => {
                resolve(event.data);
            }
        });
    }
}

All that’s left in this file is the code to import the blob, initialize the worker itself, post messages to the worker and resolve the promise once a message has been received.

And that’s it! Some additional error handling needs to be implemented, but that’s busywork at this point. The takeaway: In just a few steps, we’ve built ourselves an optimal storage solution for complex objects – using only IndexedDB and web workers, no third-party libraries needed.