Skip to content

Ideas for a query language #75

Open
@dominictarr

Description

@dominictarr

This is an idea for a level module, but am posting it here because it's too early in the morning/late at night to implement and this will notify exactly the right people.

level is pretty useful as a straight key value store, but sooner or later you need to query something else,
so you add an index. A lot of leveldb modules are about indexes. For example, level-seach which just indexes everything.

The trouble with all these indexes, is to manage them all - which index do you need for which query and when do you add a new index? there is a design trade off here.

It would be so much easier if you could just not think about indexes at all, just have filters,
just forEach (scan) over all the data and return only what you want. totally functional and elegant,
just not performant.

But what if you could use that interface, and it would transparently use the right indexes when possible? (or even better, generate them when necessary)

Lets say we have the npm registry, and we want to retrieve all the modules written by @substack,
even though substack has written many modules, it is still less than 1% of the registry, so a full scan
would be 99% wasted effort. Lets say that the average registry document is 5k, that is 95000*0.99*5000=470mb of unnecessary reads. 475mb total.

But lets say that we indexed all the maintainers? for simplicity lets say that the average length of both module names and user names is 8 characters. Say we create an index like {username}:{modulename} and index this for every module, 95000*(8+8+1) (1 for the separator) = 1.6mb of index data. but now we just need to read through 1% of the index (since they are ordered by username we can just jump straight to substack) 950*17 = 15200 that is only 15k of unnecessary data.
So, now we only read 15200 + 950*5000 = 4.7mb

Using this index we increase storage requirements by 1.6 mb but increase read perf by 100x.

But what about a more complex query? suppose we want to find modules written by substack that depend (directly) on async? now, a simple way to do this would be scan through the modules written by substack and check whether async is in the deps. Lets say there was a time when substack used async, in 9 modules. 1% of the time. the scan costs us 4.7 mb, like before, but we discarded 99% of it.

Okay, so lets create a filter for this. udm:{username}:{dep}:{module} where module depends on dep
this index is more expensive because it has needs another 9 characters, plus we need a tag to distinguish it from the other indexes. 3+1+3*(8+1) = 29 so this index would need 2.75mb.
how much data would it save us? now we could directly read out the 9 substack modules that use async, 29*9+ 9*5000 = 45k this is 1000 times less! but here is the problem: how do we know we want to perform this query? what if we want to know all the users that depend on optimist? we'd also need a dum:{dep}:{username}:{module} index - and so on, we want our database to be really really fast,
so lets just index every pair of properties twice!

Lets say there are 20 properties on average, how many combinations of 2 properties?
(20!/(2!*(20-2)!) which cancels down to (20*19)/2 = 275), times 2 because we need 2 indexes,
gives us: 550*29 = 15950, that is 3 times the size of the document. if there where 30 properties it would be 40*39*29 = 45240. doubling the size of the document increases the size of the index 3 times,
now 9 times the size of the original document.

Now, at some point, it's better to do a scan than have another index, because the indexes take up too much space, and optimize queries that you never use. Clearly this is dependant on what sort of queries you do in practice, possibly you could have a system that automagically creates indexes when you start to use a query more. maybe even could it know when you have an expensive query?

But is there a nice way that you can gracefully transition between those?
decouple the query interface from the indexes, so you can add and remove indexes
without changing the queries at all. And measuring the efficiency of the queries,
so that you know when new indexes are needed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions