Trending hashtags on twitter in 3 steps with MongoDB

by Nosh | Apr. 1, 2012, 10:51 PM | comments

One of the upcoming features of MongoDB is the new aggregation framework.
Here is an quick example to get you started. In 3 simple steps, you can be calculating the trending hashtags on twitter.

Step 1: If you haven’t already, download and install MongoDB v2.1
Here is what I do on my mac:

Noshs-MacBook-Air:~ nosh$ curl http://downloads.mongodb.org/osx/mongodb-osx-x86_64-2.1.0.tgz > mongo.tgz
Noshs-MacBook-Air:~ nosh$ tar xzf mongo.tgz
Noshs-MacBook-Air:~ nosh$ sudo mkdir -p /data/db/
Noshs-MacBook-Air:~ nosh$ sudo chown `id -u` /data/db
Noshs-MacBook-Air:~ nosh$ ./mongodb-osx-x86_64-2.1.0/bin/mongod

Now your MongoDB 2.1.0 server is up and running


Step 2: Get some twitter data into MongoDB
This is surprisingly simple. Since twitter’s streaming API outputs JSON, you can pipe it directly into MongoDB with mongoimport. Just run this command:

curl https://stream.twitter.com/1/statuses/sample.json -uUSERNAME:PASSWORD | ./mongoimport -d twitter1 -c tweets

(substitute your twitter username and password)

This will start streaming a sample (about 1%) of twitter status updates into a MongoDB collection called ‘tweets’ in a database called ‘twitter1’. There are a lot of fields in each doc. Check out the Twitter API documentation to see what gets sent with each status update


Step 3: Startup the MongoDB shell and run this query

Noshs-MacBook-Air:~ nosh$ ./mongodb-osx-x86_64-2.1.0/bin/mongo
MongoDB shell version: 2.1.0
connecting to: test
> use twitter1
> db.tweets.aggregate({$sort:{"_id":-1}}, {$match: {"entities.hashtags.text":{$exists:true}}}, {$limit:10000},{$unwind:"$entities.hashtags"}, {$project : {"entities.hashtags.text":1,"_id":0}}, {$group:{"_id":{$toLower:"$entities.hashtags.text"}, count : { $sum : 1 }}}, {$sort:{"count":-1}}, {$limit:5})


That looks a bit complicated. Its actually not. Here is what is happening:

db.tweets.aggregate(

//let's take the 10,000 most recent tweets with hashtags
{$sort:{"_id":-1}}), {$match: {"entities.hashtags.text":{$exists:true}}},{$limit:10000}, 

//hashtags are stored in an array, so separate them out              
{$unwind:"$entities.hashtags"}, 

// use the text of the hashtag
{$project : {"entities.hashtags.text":1,"_id":0}}, 

//group on the hashtag and add 1 for every occurrence
{$group:{"_id":{$toLower:"$entities.hashtags.text"}, count : { $sum : 1 }}}, 

//finally sort the result and only show me the top 5
{$sort:{"count":1}}, {$limit:5}); 


And here is what I get

{
        "result" : [
                {
                        "_id" : "wrestlemania",
                        "count" : 682
                },
                {
                        "_id" : "acms",
                        "count" : 246
                },
                {
                        "_id" : "promocaocdbrasil",
                        "count" : 194
                },
                {
                        "_id" : "paniconaband",
                        "count" : 139
                },
                {
                        "_id" : "teamfollowback",
                        "count" : 122
                }
        ],
        "ok" : 1
}
> 

I guess Wrestlemania is pretty popular right now!
Pretty simple, right? Let me know if you come up with some more examples.




blog comments powered by Disqus

I live in New York City. I work at 10gen on MongoDB. This blog is a collection of my thoughts on technology, open source, cloud computing, and random things I come across.

 Subscribe in a reader




Most Viewed

  1. Its been an interesting year for open source software
  2. Open source advantages - lessons from 10gen and MongoDB
  3. Why I work at 10gen on MongoDB (and why you probably should too)
  4. Some (unscientific) evidence of NoSQL adoption
Archive