Saturday, August 27, 2011

Mongodb Tip #1 : dumping bson into legit json objects

Well, it's a bit embarrassing to start off with a hack, but what the hey... it made my life instantly easier.
So here was my challenge... to dump to file 40,000 BSON (JSON-like) objects from my mongodb so that I could slurp them up with Groovy 1.8 and its new JsonSlurper.  Then I'd be able to use dot notation to get the values I want.

Here's the sequence of what I did. BTW, in case you didn't know, there's no native method within a mongo session to just say db.hats.dumpAll()... at least I haven't found it yet. So you need to get outside the normal mongodb command line session and use these utilities that come with the mongodb distribution.

/var/lib/mongodb-linux-i686-1.8.2/bin/mongodump --db mydb --port 27017 --collection hats --query '{ }' --out dumper 
/var/lib/mongodb-linux-i686-1.8.2/bin/bsondump dumper/mydb/leads.bson

First, mongodump dumps the whole collection I called "hats".  But it's not in human readable form.  You need bsondump for that.  bsondump, by default, converts every row of mongodump output to a BSON string.

So, I thought I could just write a few lines of Groovy to convert each JSON string into an object and use dot notation to dereference the values I wanted.  But I forgot I was still dealing with a BSON string.  My problem was this:

{ "_id" : ObjectId( "4e56d1c780acbde57e951402" ), "size" : "8", "color...

As you can see, the ObjectId string is not itself in quotes and therefore messes up JsonSlurper. So I submit the following solution that worked for me.

def f = new File(fname)
def lineCount = 0
f.eachLine { line ->
        def line2 = line.replaceFirst('ObjectId\\(','') .replaceFirst('\\),',',')
        try {
        def slurper = new JsonSlurper()
        def res = slurper.parseText(line2)
        println res.size + "|" + res.color
        lineCount++
        } catch (Exception e) {
                e.printStackTrace()
        }
}
println lineCount + " lines encountered."

I used replaceFirst instead of replaceAll() to avoid overkill and generally screwing up other innocent content that might contain the right matching parentheses with comma.  Unlikely, but safe(r).

By the way, the reason for the exception handling is that bsondump outputs some stats, not in BSON format,  every now and then.  Odd.  Obviously, I could have handled that more gracefully, but the exception handling did the job and the lines encountered gave me the target number I was looking for.

That's it.  Got the job done.  Enjoy.

David

Sunday, August 21, 2011

mongodb goodness, so far...

Lately, I've been working pretty extensively with mongodb.  I classify it as a "JIT DB", as in Just-In-Time Schema Database.  It's perfect for lazy moments when you're writing some code and it dawns on you that you need an additional field or even an additional table (called "collections" in Mongo).

"Lazy" is the wrong word.  mongoDB is in a class of technologies and strategies that foster inspired notions and reduce barriers (like time and patience) to assert your ideas. SQL doesn't do that for me.  The level of required schema pre-work and retrofitting has nipped some cool ideas in the bud... mongoDB encourages me to do it right now because I don't see any impedance!  Throw together shell scripting, Groovy and mongoDB and let's just do it!

Here's a quick example that will hopefully illustrate for you the low impedance of mongoDB (and lots of other unSQL databases)...

Let's sort a table called myData by a timestamp field.

db.mydata.find().sort({timeStamp:-1}).

This is equivalent to

select * from mydata order by timeStamp desc;

mongodb comes back and says something to the effect of "can't do a big sort like this without an index."  Well there you go... so you type

db.mydata.ensureIndex({timeStamp:1})

You try the sort again and it works. You've just experienced something like a conversation with your database!  "I can't do this... you know what to do..."

In full disclosure, I acutally use Groovy for all my Java-style development now.  I've completely lost interest in Java because 1) Groovy is way more satisfying and productive and 2) I, currently, have no
 need to use Java for squeezing max performance out of code.  I mostly use Groovy for batch-style work, updating Salesforce.com via the Web Services API and such.

With mongoDB there's a nasty little conceptual hurdle to jump over, especially switching back and forth between using native javascript commands and Java driver programming.
In Java, there are at least 2, 3 or more ways to construct a db operation

def doc = new BasicDBObject().append("lastName","Smith").append("firstName","Jack")
myData.query(doc)

...versus...

def person = [:] // Groovy syntax
person.lastName = "Smith"
person.firstName= "Jack"
def doc = new BasicDBObject(person)
myData.query(doc)

Straightforward enough. In the native mongo language, the query looks something like this...

db.myData.find({lastName:"Smith",firstName:"Jack"})

which is equivalent to

select * from myData where lastName = 'Smith' and firstName = 'Jack';

Now, because I'm a Linux guy and I love the power of intermingling shell scripts to glue Java/Groovy together as needed, here's one way way you might integrate that mongoDB (javascript) script language in a bash shell script using a here document.

function findUser {
        lastName=$1
        firstName=$2
        mongo <<EOF
        use employeeDB

        var criteria = {lastName:"${lastName}",firstName:"${firstName}"}
        var answer = db.leads.find(criteria)
        answer.count()

EOF
}


findUser Smith Joe


Enough for now.  For the next few weeks, I'll post some of the mongodb commands and concepts I found most useful.