Monday, July 18, 2011

MongoDB and JOINs

MongoDB's Great!

MongoDB's a popular NoSQL database that i've heard of for the last 4 months or so but never had any time to consider doing something until last week. I guess the lack of enthusiasm has a lot to do with the fact that 'traditional' databases like Oracle can be quite a RPITA to work with but they've withstood the test of time - though i admit things are changing pretty fast these days. MongoDB is really a good database tool but it's got a catch!

MongoDB's Great BUT ...

This catch was a direct result of my fictitious assumption that MongoDB supports the concept of multiple table queries a.k.a SQL JOIN e.g. a SQL expression below illustrates

'SELECT CAR.plate, CAR.owner, Fines.owed from CAR, Fines where CAR.owner = Fines.doneBy'

which would retrieve all records done by a particular owner when he/she ran a traffic fine. Contrived really but i quickly realized it couldn't be done with MongoDB, at least not now anyway. I was kind of annoyed but forgiving for reasons that i've come to expect the same lot with Oracle Databases but realized MongoDB != OracleDB.

MongoDB is different. Accept that. MongoDB is simpler. Accept that too.

For someone whose played with MongoDB would realize that its operations evolved from the concept of collection & document; w.r.t SQL parlance its table and row respectively. You thought this is the only difference it has? You're wrong. There're others i've discovered which includes Stored Javascript (akin to Stored Procedures), Indexing etc which are freaking cool and the data structures that represent all forms of data you can store or retrieved from it is based in BSON. For web folks, you know what this means don't ya? ;) for everyone else, it simply means you can get your apps into market at a faster rate.


Back to what you were saying about JOIN and MongoDB ...

JOINs are pretty important in a lot of applications since everything we know is relational, in SQL parlance. SQL databases have taught us that all data is relational and we need to think in terms of tables and rows. If a relationship needs to established, then we need to select data from multiple tables and derive it.

Question is whether we can get around it? Spending 2 hrs mucking around MongoDB's documentation revealed two things (1) MongoDB's MapReduce (2) write one yourself.

Obviously i attempted the easier solution and a member on stackoverflow seem to think its possible. But i ran aground. To understand why, the MapReducer's reducer function has a primary purpose and it is to reduce the size of key value pairs the system has to handle at any point in time and hence it prohibits code like

// aggregating values to the array
function(k,v) {
  var ret = v;
  return v;
}
or
// a.k.a bypass
function(k, v) {
}

MongoDB will complain that it doesn't like the fact your reducer does nothing, and also aggregating values to an array since it does absolutely nothing to reduce the key value pairs. There might be other ways to do this but at the time of this writing, i'm not aware of any (Please drop me a mail if you do). Turns out somebody has filed a feature support ticket to sort out this JOIN. Yay!

Parser for MongoDB multi-collection query

I had no choice but to attempt option (2) which was fun! cos i get to built a simple SQL parser and translate the typical SQL syntax to a multi-collection/table query. The approach i took is something about the line that i like to use a SQL-like syntax that supports multi-table/collection (depending on your parlance) and allows me/user to write something like

select a.name, a.data, b.name, b.data, c.name, c.data from a, b, c where a.name = b.name and b.name = c.name

while underneath the machinery will conduct the query against MongoDB. The implementation approach is pretty simple where i parse the expression and isolate all expressions belonging to each table and keep it in a hashtable like data structure. What happens next is that each table query will be sent against Mongo and retrieved records are stored in RAM.

As Mongo can't do multi-joins yet, the approach is to conduct a O(N^N) where N's the number of tables operation where the JOIN query is run against each known bunch of records, in turn i.e. for a for b for c, and finally apply the SQL LIMIT expression to return the final collection of records.

There are obvious short comings of this, i don't claim to have the answers to them. One shortcoming is that it doesn't scale very well as its limited by the RAM/DRAM available and its a compute intensive job. And if you're interested please visit https://github.com/raygit/m2query All feed back is welcome!

0 comments: