Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm always inspired by SQLite. Overall I like it, but if you're not doing writes it's really overkill.

So I made a format that will never surpass SQLite, except that it's extremely lighter and faster and works on zstd compressed files. It has really small indexes and can contain binaries or text just like SQLite.

The wasm part that decompresses and reads and searches the databases is only 38kb (uncompressed (maybe 16kb gzipped)). Compare that to SQLite's 1.2mb of wasm and glue code it's 3% the size but searching and loading is much faster. My program isn't really column based and isn't suitable for managing spreadsheets, but it's great for dictionaries and file archives of images and audio.

I ported the jbig2 decoder as a 17kb wasm module, so I can load monochrome scans that are 8kb per page and still legible.

https://github.com/tnelsond/peakslab

SQLite is very well engineered, PeakSlab is very simple.



> Compare that to SQLite's 1.2mb of wasm and glue code

The current trunk is actually 1.7mb in its canonical unminified form (which includes very nearly as much docs as JS code), split almost evenly between the WASM and JS pieces :/. Edit: it is 1.2mb in minified form, though.

Disclosure: i'm its maintainer.

Edit: current trunk, for the sake of trivia:

    sqlite3.wasm 896745
    sqlite3.mjs  816270 # unminified w/ docs
    sqlite3.mjs  431388 # unminified w/o docs
    sqlite3.mjs  310975 # minified


Many comments here to your creation, PeakSlab, but not yet a dedicated praise. I didn't know it but I have to say it is really cool and innovative! The performance of the dictionary is indeed superb and I will definitely bookmark this for future reuse. So, in a nutshell: thanks for sharing!


I think actually this competes with the old BerkeleyDB: https://en.wikipedia.org/wiki/Berkeley_DB - which I now see is no longer BSD-licensed, and in any case has been rendered almost extinct by SQLite. It was used for basic on-disk key-value store work.


It seems more like SSTables, which are widely used by open-source software like LevelDB, HBase, and Cassandra (and Google's BigTable) but AFAIK don't have a standard open-source reader (unless you want to pull the relevant source file out of Cassandra or LevelDB).

https://www.igvita.com/2012/02/06/sstable-and-log-structured...


Yes this does seem almost exactly like what PeakSlab is doing. I'll have to investigate more how SSTables and memtables work to see if there's any improvements or anything to glean from them.

I'm storing null delimited sorted strings with an index that's an offset to where that entry begins, so if I wanted to do memtables for writing I would have to have two memtables and merge them on write? I've been considering this exact approach.


Even BerkeleyDB tries to be mutable. What I'm doing doesn't need the mutability so it's much more similar to dictionary formats (though probably simpler) than it is to a database. Though a lot of people do use full databases for immutable dictionary key-value stuff. I just couldn't get any database to work well enough for a pwa dictionary.


I don't think it has had a BSD license this century, Sleepy Cat was selling licenses in the 90s before Oracle bought them.


SQLite is simple in its own way and I like the design principle of their SQL dialect.

"Right joins are just left joins in the wrong direction, you don't need that crap"

Of course it always gets simpler or more specialised. I think many apps using databases would run with SQLite just as well. And some would probably run just as well with a textfile instead of any db like SQLite.


> "Right joins are just left joins in the wrong direction, you don't need that crap"

SQLite has supported all types of joins since version 3.39 in 2022.


I must've messed something up, but I remember some joins (was it full outer join?) being unbelievably slow? Was I doing something wrong?


Too vague of a question to give you an answer you'll likely sound satisfactory :)

You probably just needed to create indexes over your data to speed things up.


Well, look at that, now it is downhill from here!


For the love of god, don't do blank textiles anymore. In the end you have a software that has 20 (or more) individual files for each programs section, which works fine until you want the files to be consistent. Boom. And then you add a lock to fix it and suddenly your whole program can only run sequentially. And then your customers ask why it's so slow in ingress. I won't name any names here, but this is a real commercial product.


We use a cheap invoicing program. It works fine except it gets very slow when dealing with large numbers if invoices. Turns out each invoice (or payment record, or customer record, or whatever) is a separate text file with form-urlencoded data. No indices.


A more standard solution would be cdb.[0] Although that doesn't support compressed data.

[0] https://cdb.cr.yp.to/ , https://en.wikipedia.org/wiki/Cdb_(software)


I'll definitely have to look at this. I'm using binary search right now because it's fast enough and I understand it better, but I might replace it with hashes at some point.


Overkill in what way exactly? The LOC of the project shouldn't have any bearing on most people's usage of the project. SQLite is one of the well tested and mature projects in the world. What exactly would motivate someone to use PeakSlab instead? What problem are you solving?


Read the comment. He's using it in WASM form and doesn't want users to have to download 1.2MB of SQLite every time they visit the page.


Client caches are a thing, so this is most relevant for cold-start customers. In that case PeakSlab’s download size is an advantage.

Fwiw LocalStorage is a SQLite db on most browsers, with a kv api. It’s be interesting to have the actual API available.


Even on warm start PeakSlab is twice as fast. It's not just download size, it's execution speed, zero copy, database decompression, etc.

That's why PeakSlab is written in c, because what's faster than casting the whole database to a struct? ;-P


I think web sqlite was originally an (experimental) thing


I'm solving a simpler problem. Just making cross platform dictionary progressive web apps with indexes and full text search and HTML tags and uppercase letters inserted back into the text on render so they don't interfere with search.

SQLite is 1.2mb in combined wasm and JavaScript and not really designed for my use case, so I would have to add all the things i need anyway like compression and HTML tag insertion. For my use case which is just for pwas SQLite takes too long to load and the files are too big and the search isn't tailored. So I made something else in 38kb instead


Got it, thanks.


Even if you aren't doing writes SQLite should be the default option as a file dependent format or even durable IPC. And it isn't going to fall away (i.e. BerkeleyDB, Tokyo Cabinet). I recently dug into WAL mode and it is pretty incredible for multiple readers https://github.com/InterNetNews/inn/pull/338.


I've had to deal with image scans and my own approach lately has been to quantize to 16-color (grayscale via pngquant) and oxipng with zopfli set to max. Mostly so that the output can be easily used in a browser UI.

Will definitely have to look into your solution with JBIG2 next time. Though I may keep with png just for compatibility sake.


Yeah, I had Claude generate me the jbig2 to 1-bit png wasm module from the jbig2 source so that there'd be no dependencies on zlib or libpng. Then I aggressively removed functions from 97kb down to 17kb. So if it's missing functions you need you may need to reset the files to the .bak and recompile.


Perhaps a dumb question, but how do you get data into it if you’re not doing writes


I have a system that builds SQLite databases and uploads them to S3. Once they're in S3, they are never changed. The program that builds the databases only does writes, and the program that queries the databases only does reads. It uses a VFS to query the database in-place with HTTP range requests.

This is indeed not an optimal setup. A more careful design from first principles would not require seeking around the file as much as SQLite does, we'd do a better job on reading exactly the correct range of bytes for a given query since we know ahead of time what the access patterns are, and we could do reads in parallel. With SQLite we have to be very careful about the schema design to ensure it won't have to seek too many times to answer a query. But SQLite was expedient, and I'm confident I'll always be able to read the files. That's less certain for a custom file format.


If it's going to be read-only, why not make it a Parquet file instead? It should result in a smaller file size due to columnar compression.

DuckDB has built-in capability to read Parquet files with HTTP range requests.


For this use case we need the ability to do an indexed query and extract a small number of rows from a large database. It's a traditional row-oriented database workload. I'm sure other solutions would also work, but SQLite's design melds well with the data. The migration from partitioned SQL Server tables to a collection of SQLite database files was straightforward.


If it's that small, why not compile it directly into the application? The compiler might even be able to optimize a good chunk of data away, or optimize the code based on the data.


Nothing small about it. This is terabytes of data and new files are added daily with new data. That's why we had to move it into S3. It was costing too much to store in SQL Server.

I only reach for SQLite when it's big. Otherwise, I'd prefer to just use the existing SQL Server infrastructure, or .csv.gz files in S3. Internally, I use the term "bulk data" for data sets that are stored in S3 because they were too big for SQL Server.


I think it's just immutable once you've generated it. No need to update indexes or check consistency on writes, no need for transactions, etc.


Generate it one time from a source tsv file or folder of media.


Think historical records of, say, share values for past years. You might have a single db for 1900-2000, for instance. Things like that.

Not everything needs to be real-time updated.


It’s an RODB. Ship the preindexed data blob.


> but if you're not doing writes it's really overkill.

SQLite is not just a B-Tree+ file format manager. It's also a powerful CLI with tons of built in data manipulation and extraction tools which can all be scripted.

Finally, if your data is _relational_, then a format that pereserves those relations, even if they aren't intended to be updated, is exceptionally powerfull.

> This is a custom format built to be very fast to load (cast to a c-struct and done) and very fast to search with 3 binary search indexes built in.

What is the endianess of your format?


Little endian. x86 and wasm have the same endianness so it's very straightforward. If I wanted to enable support for reading and writing on big endian systems I could patch a few lines very easily, but it seems to be a consideration unnecessary for most uses.

Instead of doing multiple columns, there's some of that with secondary and tertiary indexes, but mostly it's just key value.

To make the data more relational I find it's more natural to just have additional "columns" in separate peak files and when the peak files are queried they can be combined by headword. This is kinda like what parquet does putting all of one column together.

I could make PeakSlab have columns and have it query based off specific columns fairly easily, but it was a design decision to make things more tag based and less column based. I like that it's more flexible and less rigid in structure.

SQLite keeps everything relational and their raw filesizes are 50% larger than a .peak file, not sure what's exactly to blame, but I'm optimizing for small filesizes.


If you're not modifying data, whatever system is using the data doesn't need a database at all, it just needs a data export.


It is crashing Safari.


Works on my wife's old iphone. I don't have a mac to test things on.


something something XKCD competing standards something something


Creating something new for a different use case isn't pointless. It's like comparing inline skates to ice skates.


Believe me, I tried sticking to SQLite or aard2 or stardict, they just were fundamentally inadequate with no good pwa cross platform tooling.


Does this remain true now that SQLite has a WASM build?


Yes, because originally when I started PeakSlab it used the SQLite wasm build.


Doesn’t even apply unless someone says that (1) there are too many “standards”, and (2) so we are making this standard (neither apply here). Someone made something.

We should really consider eventually retiring memes because they just end up as thought-terminating cliches.

This is of course referring to xkcd #927. How do I know that?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: