Trending hashtags on twitter in 3 steps with MongoDB
by Nosh | Apr. 1, 2012, 10:51 PM | comments
| Tweet |
One of the upcoming features of MongoDB is the new aggregation framework.
Here is an quick example to get you started. In 3 simple steps, you can be calculating the trending hashtags on twitter.
Step 1: If you haven’t already, download and install MongoDB v2.1
Here is what I do on my mac:
Noshs-MacBook-Air:~ nosh$ curl http://downloads.mongodb.org/osx/mongodb-osx-x86_64-2.1.0.tgz > mongo.tgz Noshs-MacBook-Air:~ nosh$ tar xzf mongo.tgz Noshs-MacBook-Air:~ nosh$ sudo mkdir -p /data/db/ Noshs-MacBook-Air:~ nosh$ sudo chown `id -u` /data/db Noshs-MacBook-Air:~ nosh$ ./mongodb-osx-x86_64-2.1.0/bin/mongod
Now your MongoDB 2.1.0 server is up and running
Step 2: Get some twitter data into MongoDB
This is surprisingly simple. Since twitter’s streaming API outputs JSON, you can pipe it directly into MongoDB with mongoimport.
Just run this command:
curl https://stream.twitter.com/1/statuses/sample.json -uUSERNAME:PASSWORD | ./mongoimport -d twitter1 -c tweets
(substitute your twitter username and password)
This will start streaming a sample (about 1%) of twitter status updates into a MongoDB collection called ‘tweets’ in a database called ‘twitter1’. There are a lot of fields in each doc. Check out the Twitter API documentation to see what gets sent with each status update
Step 3: Startup the MongoDB shell and run this query
Noshs-MacBook-Air:~ nosh$ ./mongodb-osx-x86_64-2.1.0/bin/mongo
MongoDB shell version: 2.1.0
connecting to: test
> use twitter1
> db.tweets.aggregate({$sort:{"_id":-1}}, {$match: {"entities.hashtags.text":{$exists:true}}}, {$limit:10000},{$unwind:"$entities.hashtags"}, {$project : {"entities.hashtags.text":1,"_id":0}}, {$group:{"_id":{$toLower:"$entities.hashtags.text"}, count : { $sum : 1 }}}, {$sort:{"count":-1}}, {$limit:5})
That looks a bit complicated. Its actually not. Here is what is happening:
db.tweets.aggregate(
//let's take the 10,000 most recent tweets with hashtags
{$sort:{"_id":-1}}), {$match: {"entities.hashtags.text":{$exists:true}}},{$limit:10000},
//hashtags are stored in an array, so separate them out
{$unwind:"$entities.hashtags"},
// use the text of the hashtag
{$project : {"entities.hashtags.text":1,"_id":0}},
//group on the hashtag and add 1 for every occurrence
{$group:{"_id":{$toLower:"$entities.hashtags.text"}, count : { $sum : 1 }}},
//finally sort the result and only show me the top 5
{$sort:{"count":1}}, {$limit:5});
And here is what I get
{
"result" : [
{
"_id" : "wrestlemania",
"count" : 682
},
{
"_id" : "acms",
"count" : 246
},
{
"_id" : "promocaocdbrasil",
"count" : 194
},
{
"_id" : "paniconaband",
"count" : 139
},
{
"_id" : "teamfollowback",
"count" : 122
}
],
"ok" : 1
}
>
I guess Wrestlemania is pretty popular right now!
Pretty simple, right? Let me know if you come up with some more examples.
blog comments powered by Disqus
I live in New York City. I work at 10gen on MongoDB. This blog is a collection of my thoughts on technology, open source, cloud computing, and random things I come across.
Follow @nosh_pMost Viewed
- Its been an interesting year for open source software
- Open source advantages - lessons from 10gen and MongoDB
- Why I work at 10gen on MongoDB (and why you probably should too)
- Some (unscientific) evidence of NoSQL adoption