LMDB Preliminary Tech Report
Summary
- General description of lmdb
- Differences with LevelDB
- lmdb API
- lmdb caveats
- lmdb++ caveats
- Benchmarks
- Conclusion
NB: This is the result of a preliminary study of the lmdb database library in the context of maintaining a persistent database of UTXOs in BCHN node software.
General description of lmdb
lmdb is a memory-mapped key-value btree-based database library.
- Memory-mapped: Data is stored on disk and loading data into memory is delegated to the OS via memory-mapping. This is said to be faster than manually managing memory.
- Key-value: self descriptive
- Btree-based: data is stored on a btree (by key). This seems to optimize for read latency.
Differences with LevelDB
BCHN currently uses LevelDB to store UTXOs.
- LevelDB is log-based (append only), so it's more friendly to HDD type memory.
- LevelDB requires periodic compaction of data, while lmdb requires no ongoing maintenance (the btree rebalances directly on writes).
- LevelDB has lots of configuration knobs (RocksDB even more), while lmdb has a handful; the defaults on lmdb are intentionally restricting, so that devs familiarize with them and add what they need.
- lmdb uses significantly more disk-space (Preliminary test showed a 4x).
lmdb API
The API is comfortably straightforward.
The initialization is a little exotic, but well documented. After that the db interaction is with straightforward get
, put
and
del
function calls.
There is also the ability to open a cursor
to streamline reads. The documentation also mentions that cursor
s can streamline writes if the keys to be written
are pre-ordered for bulk preloading of data. It is unclear if this works on batch-writing to an existing database (TODO).
lmdb
has ACID semantics, which we don't care about much, since we do either reads or writes.
For ease of use I used the C++ wrapper lmdb++. It is comfortable because it provides RAII semantics for lmdb objects. It has its own set of caveats.
lmdb
caveats
- It can be considerably slower than LevelDB on HDD memory (because of frequent random access).
- It can also be slower on low-end machines.
- Being a btree database, performance is stellar when database is small. Be sure to benchmark on real-life sizes for the db. Disregard benchmarks that are done on small datasets (eg. https://mozilla.github.io/firefox-browser-architecture/text/0015-rkv.html)
- I have noticed a distinctive drop in performance once the db size exceeds RAM. Commit times roughly double. I was not able to mitigate this with configuration flags (but needs more research - TODO).
- Database size needs to be set at initialization. This means that additional code needs to be maintained for when a limit is hit and the db needs to be resized. Dagur had implemented such code for BitcoinXT in 2018 here, which might make sense to port over.
- Write batches need to be small-ish, smaller than available RAM, else swap will step in. Disk flushes can be deferred with the
NOSYNC
flag to do smaller commits, but only one disk write at the end. Inmaster
the default batch size in IDB is currently some 900MB, which is big. - The
WRITEMAP
flag is documented to improve performance on databases smaller than RAM. I have not been able to confirm a speedup with this flag. - The
MAPASYNC
flag is documented to improve write performance, on databases smaller than RAM and ifWRITEMAP
is enabled. I did see a considerable improvement on writing speeds with this flag. - The semantics of a crash with the
MAPASYNC
flag are unclear: will it leave the db in a previous consistent state, or an eintirely inconsistent state?
lmdb++ caveats
- The RAII semantics are comfortable.
- the
lmdb::val
object is central to interacting with the wrapper. -
the
lmdb++
APIs are wonky in how they accept thelmdb::val
objects. The different overloads in the API might take over and break things, without a warning. For example the following will work:lmdb::val key(stream.data(), stream.size()), value; dbi.get(txn, key, value);
But this might choose a wrong overload:
dbi.get(txn, make_val_from(stream), value);
The advisory here is to not use the overloaded functions at all, and use the the
lmdb::dbi_*
methods (which all acceptMDB_val *
, simply converted to fromlmdb::val
s).
Benchmarks
A minimum viable changeset to benchmark lmdb performance was implemented in https://gitlab.com/matricz/bitcoin-cash-node/-/tree/lmdb. NB: The LevelDB code has not even been ripped out as it's just a prototype. In the future, I would prefer to port Dagur's work on BitcoinXT, instead of working on top of this.
Preliminary tests were done on my laptop, and Digital Ocean instances with standard and NVMe SSDs.
I have a Nitro5 laptop with a very fast NVMe SSD, running the lmdb branch on WSL1/Debian. This is suboptimal and only to be regarded as anecdotal, since WSL can scramble IO times. This being said, I could obtain write rates 40% better on databases sizes smaller than RAM versus LevelDB, which is a nice improvement.
I did a full IBD with default options on a Digital Ocean VM instance (Storage Optimized, 32GB, 4CPUs, 600GB NVMe, $250/mo) both with master and this lmdb branch.
The lmdb build completed mainnet IBD in 3 hours and 6 minutes, while the master build completed mainnet IBD in 3 hours and 11 minutes. The write times on master were consistently around 12 seconds, while on lmdb they went from 15 seconds, to a steep jump to 30 seconds (at a point I presume to be the RAM limit), to slowly building up to 40 seconds.
Nontheless, lmdb finished faster, which could lead to faster read times. Still, this is not statistically significant.
Future work
We spend only a small amount of time writing to the UTXO database and reading from it is singlethreaded. Hence, any performance gains in the UTXO database, no matter how significant, will bring only a modest global performance improvement to the node as a whole.
On the other hand, if/when we are able to bring better parallelism to block validation, then db performance improvements could become proportionally more significant. lmdb is known to improve read latency in multithreaded applications, which would suit our usecase nicely.
We cannot remove LevelDB backend entirely without ruling out people who run the node on HDDs. A mechanism for chosing the db backend (between current LevelDB and lmdb) should be provided (as implemented by Dagur for BitcoinXT).
Future work: since lmdb maps the data to (system managed) memory, and effectively using it as an in-RAM cache, we could possibly remove the dbcache
entirely, IF read times prove to be fast and
reliable enough.
Links:
- Repository link of this announcement: GitLab