I'm always inspired by SQLite. Overall I like it, but if you're not doing writes it's really overkill.
So I made a format that will never surpass SQLite, except that it's extremely lighter and faster and works on zstd compressed files. It has really small indexes and can contain binaries or text just like SQLite.
The wasm part that decompresses and reads and searches the databases is only 38kb (uncompressed (maybe 16kb gzipped)). Compare that to SQLite's 1.2mb of wasm and glue code it's 3% the size but searching and loading is much faster. My program isn't really column based and isn't suitable for managing spreadsheets, but it's great for dictionaries and file archives of images and audio.
I ported the jbig2 decoder as a 17kb wasm module, so I can load monochrome scans that are 8kb per page and still legible.
> Compare that to SQLite's 1.2mb of wasm and glue code
The current trunk is actually 1.7mb in its canonical unminified form (which includes very nearly as much docs as JS code), split almost evenly between the WASM and JS pieces :/. Edit: it is 1.2mb in minified form, though.
Many comments here to your creation, PeakSlab, but not yet a dedicated praise. I didn't know it but I have to say it is really cool and innovative! The performance of the dictionary is indeed superb and I will definitely bookmark this for future reuse. So, in a nutshell: thanks for sharing!
I think actually this competes with the old BerkeleyDB: https://en.wikipedia.org/wiki/Berkeley_DB - which I now see is no longer BSD-licensed, and in any case has been rendered almost extinct by SQLite. It was used for basic on-disk key-value store work.
It seems more like SSTables, which are widely used by open-source software like LevelDB, HBase, and Cassandra (and Google's BigTable) but AFAIK don't have a standard open-source reader (unless you want to pull the relevant source file out of Cassandra or LevelDB).
Yes this does seem almost exactly like what PeakSlab is doing. I'll have to investigate more how SSTables and memtables work to see if there's any improvements or anything to glean from them.
I'm storing null delimited sorted strings with an index that's an offset to where that entry begins, so if I wanted to do memtables for writing I would have to have two memtables and merge them on write? I've been considering this exact approach.
Even BerkeleyDB tries to be mutable. What I'm doing doesn't need the mutability so it's much more similar to dictionary formats (though probably simpler) than it is to a database. Though a lot of people do use full databases for immutable dictionary key-value stuff. I just couldn't get any database to work well enough for a pwa dictionary.
SQLite is simple in its own way and I like the design principle of their SQL dialect.
"Right joins are just left joins in the wrong direction, you don't need that crap"
Of course it always gets simpler or more specialised. I think many apps using databases would run with SQLite just as well. And some would probably run just as well with a textfile instead of any db like SQLite.
For the love of god, don't do blank textiles anymore. In the end you have a software that has 20 (or more) individual files for each programs section, which works fine until you want the files to be consistent. Boom. And then you add a lock to fix it and suddenly your whole program can only run sequentially. And then your customers ask why it's so slow in ingress. I won't name any names here, but this is a real commercial product.
We use a cheap invoicing program. It works fine except it gets very slow when dealing with large numbers if invoices. Turns out each invoice (or payment record, or customer record, or whatever) is a separate text file with form-urlencoded data. No indices.
I'll definitely have to look at this. I'm using binary search right now because it's fast enough and I understand it better, but I might replace it with hashes at some point.
Overkill in what way exactly? The LOC of the project shouldn't have any bearing on most people's usage of the project. SQLite is one of the well tested and mature projects in the world. What exactly would motivate someone to use PeakSlab instead? What problem are you solving?
I'm solving a simpler problem. Just making cross platform dictionary progressive web apps with indexes and full text search and HTML tags and uppercase letters inserted back into the text on render so they don't interfere with search.
SQLite is 1.2mb in combined wasm and JavaScript and not really designed for my use case, so I would have to add all the things i need anyway like compression and HTML tag insertion. For my use case which is just for pwas SQLite takes too long to load and the files are too big and the search isn't tailored. So I made something else in 38kb instead
Even if you aren't doing writes SQLite should be the default option as a file dependent format or even durable IPC. And it isn't going to fall away (i.e. BerkeleyDB, Tokyo Cabinet). I recently dug into WAL mode and it is pretty incredible for multiple readers https://github.com/InterNetNews/inn/pull/338.
I've had to deal with image scans and my own approach lately has been to quantize to 16-color (grayscale via pngquant) and oxipng with zopfli set to max. Mostly so that the output can be easily used in a browser UI.
Will definitely have to look into your solution with JBIG2 next time. Though I may keep with png just for compatibility sake.
Yeah, I had Claude generate me the jbig2 to 1-bit png wasm module from the jbig2 source so that there'd be no dependencies on zlib or libpng. Then I aggressively removed functions from 97kb down to 17kb. So if it's missing functions you need you may need to reset the files to the .bak and recompile.
I have a system that builds SQLite databases and uploads them to S3. Once they're in S3, they are never changed. The program that builds the databases only does writes, and the program that queries the databases only does reads. It uses a VFS to query the database in-place with HTTP range requests.
This is indeed not an optimal setup. A more careful design from first principles would not require seeking around the file as much as SQLite does, we'd do a better job on reading exactly the correct range of bytes for a given query since we know ahead of time what the access patterns are, and we could do reads in parallel. With SQLite we have to be very careful about the schema design to ensure it won't have to seek too many times to answer a query. But SQLite was expedient, and I'm confident I'll always be able to read the files. That's less certain for a custom file format.
For this use case we need the ability to do an indexed query and extract a small number of rows from a large database. It's a traditional row-oriented database workload. I'm sure other solutions would also work, but SQLite's design melds well with the data. The migration from partitioned SQL Server tables to a collection of SQLite database files was straightforward.
If it's that small, why not compile it directly into the application? The compiler might even be able to optimize a good chunk of data away, or optimize the code based on the data.
Nothing small about it. This is terabytes of data and new files are added daily with new data. That's why we had to move it into S3. It was costing too much to store in SQL Server.
I only reach for SQLite when it's big. Otherwise, I'd prefer to just use the existing SQL Server infrastructure, or .csv.gz files in S3. Internally, I use the term "bulk data" for data sets that are stored in S3 because they were too big for SQL Server.
> but if you're not doing writes it's really overkill.
SQLite is not just a B-Tree+ file format manager. It's also a powerful CLI with tons of built in data manipulation and extraction tools which can all be scripted.
Finally, if your data is _relational_, then a format that pereserves those relations, even if they aren't intended to be updated, is exceptionally powerfull.
> This is a custom format built to be very fast to load (cast to a c-struct and done) and very fast to search with 3 binary search indexes built in.
Little endian. x86 and wasm have the same endianness so it's very straightforward. If I wanted to enable support for reading and writing on big endian systems I could patch a few lines very easily, but it seems to be a consideration unnecessary for most uses.
Instead of doing multiple columns, there's some of that with secondary and tertiary indexes, but mostly it's just key value.
To make the data more relational I find it's more natural to just have additional "columns" in separate peak files and when the peak files are queried they can be combined by headword. This is kinda like what parquet does putting all of one column together.
I could make PeakSlab have columns and have it query based off specific columns fairly easily, but it was a design decision to make things more tag based and less column based. I like that it's more flexible and less rigid in structure.
SQLite keeps everything relational and their raw filesizes are 50% larger than a .peak file, not sure what's exactly to blame, but I'm optimizing for small filesizes.
Doesn’t even apply unless someone says that (1) there are too many “standards”, and (2) so we are making this standard (neither apply here). Someone made something.
We should really consider eventually retiring memes because they just end up as thought-terminating cliches.
This is of course referring to xkcd #927. How do I know that?
So I made a format that will never surpass SQLite, except that it's extremely lighter and faster and works on zstd compressed files. It has really small indexes and can contain binaries or text just like SQLite.
The wasm part that decompresses and reads and searches the databases is only 38kb (uncompressed (maybe 16kb gzipped)). Compare that to SQLite's 1.2mb of wasm and glue code it's 3% the size but searching and loading is much faster. My program isn't really column based and isn't suitable for managing spreadsheets, but it's great for dictionaries and file archives of images and audio.
I ported the jbig2 decoder as a 17kb wasm module, so I can load monochrome scans that are 8kb per page and still legible.
https://github.com/tnelsond/peakslab
SQLite is very well engineered, PeakSlab is very simple.