My last day at 10gen

by Nosh | Jul. 3, 2012, 2:37 PM | comments

Friday was my last day with 10gen working on MongoDB.

My first day at 10gen was December 1, 2009. The stable release of MongoDB was v1.0 (v1.2 would be released 9 days later).  10gen was Dwight, Eliot, Mike, Kristina, Mathias and Kyle huddled around a desk in a shared office on the 8th floor of 17 W 18th Street (and Aaron based out of the Bay Area). Sourceforge had started talking about how they were migrating large portions of their site to MongoDB. 

10gen and MongoDB have come some ways since then. More than any milestone though, what sticks out for me are the thousands (quite literally!) of interactions I’ve had with MongoDB users over the past 2 years and 7 months. These users were taking a bet on an cutting-edge product. They were energized that databases can be fun to work with. And they were eager to be part a group that is upending traditional notions of what enterprise software can be. The same  energy and camaraderie exists today - among a developer community that now numbers probably over a few hundred thousand. That is quite amazing and I’m sure that there are many great days ahead for 10gen.

I’ve written about what it is like to work at 10gen, about community building, and a lot about open source software. Being part of 10gen over the past 3 years has encompassed all of that - and its been unique combination of product + timing + people + company culture + business model. I’m going to miss it. I’ve learned tremendously, though, and I’m excited to apply that to new problems, challenges, and opportunities.  

For those of you who only have my old 10gen contact info - I can be reached on email, linkedin, and twitter. If you are cooking up something in the cloud/data/mongodb areas, or just want to say hi, drop me a line!




What’s happening in the enterprise storage market?

by Nosh | Jun. 21, 2012, 4:01 PM | comments

The enterprise storage market is huge. The the top 5 players in the space (EMC, NetApp, IBM, HP, Hitachi) sell about $20 billion a year of storage systems. Although that revenue isn’t going anywhere quickly, its interesting to look at some recent trends that are sure to impact this segment over time.

Flash everywhere

For many applications, the days of spinning hard drives are numbered. The MacBook Air I’m running on doesn’t have a spinning drive. Neither does your phone. Its all solid state storage. Server hardware is going in that direction as well with various permuations of flash/solid state memory. With falling prices per GB, increasing capacities, and higher reliability- making the case for solid-state storage is increasingly easy choice for applications that demand high throughput I/O or low-latency access to data. But its not just a raw GB cost vs performance argument. Along with the performance gains, solid state storage also gives a huge power consumption and space argument - which at scale makes a huge difference. Wired just published an article about this trend, which is worth a read

In the component market, you have “commodity” SSDs which are showing up increasingly. You then have “high end” solid state component players such as fusion-io and virident. Vendors such as Violin Memory are building out storage arrays, built only on (and optimized for) flash memory. And of course, traditional storage such as EMC are incorporating flash into their disk-based storage arrays and coming out with various flash based products, and acquiring vendors in this area.

The one place where flash adoption has lagged is in large public clouds - most likely because of cost and reliability concerns. That’s changing, however, with Rackspace, HP, and Microsoft Azure have all announced SSD-based storage options in the last few months. And its a good guess that we’ll see solid state options in most clouds over this year.

Bring processing closer to the data

Starting in the 90’s, the trend has been to centralize storage on appliances- and companies such as NetApp and EMC have benefitted greatly from this trend. From an administrative standpoint, this makes it easier to manage, backup, and provide high service levels around data. However, with data volumes growing faster than network capacity, and the necessity more real-time data processing, this trend is undergoing a small (but growing) reversal. Systems such hadoop closely couple compute and storage - as an explicit part of their design. In most cases, this means moving storage back into the server (where the compute lives), rather than try and centralize it in an appliance over the network. Some flash component vendors, such as Fusion-io, are pushing this approach as well - with specialized APIs that allow applications to treat solid-state devices as if it is an extension of RAM

Scale-out architectures & commodity hardware

Most database and file systems that have been designed in the past few years, have moved to a model that eschews using complex, high-end hardware and instead run on (possibly virtualized) commodity servers. HDFS and GlusterFS are examples in the filesystem world, and MongoDB, Cassandra, and Riak are examples of this trend in the database market. Unlike systems like Oracle RAC which require shared storage, most of these systems can work equally well (and sometimes better) on standard server hardware with direct-attached (solid-state or regular) disks. Instead of relying a single high-end server and a single centralized piece of storage hardware, these systems put the ‘smarts’ for things like high-availability into the software layer so that the database or filesystem spread across a large number of servers appears as a single unit to an application. This gives a theoretically infinite amount of scalability, as you can increase storage (and processing) capacity by adding additional servers into the clusters. This model is a very good fit for cloud-like environments where the emphasis is on standardized virtualized machines, rather than specialized hardware.  

So what does this all mean? As solid-state storage becomes the default, I think we’ll see a lot more software optimized specifically for this mode of storage. With spinning disks, there is a huge difference between random and sequential I/O (for each random I/O the disks needs to physically spin to the right spot). However, with solid-state storage, this disparity is effectively eliminated. There are other differences as well for e.g. erases take an order of magnitude long than writes on most SSDs. So there is going to be a lot of innovation as software developers adapt to solid-state storage becoming the default. On the hardware side, incorporating more flash-based options into traditional storage appliances will definitely extend the capabilities of appliances. However, the trend towards coupling storage and compute on commodity hardware will pose more of challenge to enterprise storage vendors. We are still a ways away from having distributed databases or filesystems that run on commodity hardware with all the bells, whistles, tools, and management capabilities that the current generation of storage appliances have - but the this area is rapidly evolving.

What is your take on these trends?




cloud = employee enablement

by Nosh | Apr. 21, 2012, 4:37 PM | comments

Dear Amazon,

A few weeks ago, you published a whitepaper, “The Total Cost of (Non) Ownership
of a NoSQL Database Cloud Service
”. The paper compares the cost of running an “open source NoSQL database” in a datacenter/co-lo vs. running it on EC2 and EBS vs. using DynamoDB. No surprises here - the TCO of using DynamoDB was the lowest followed by running on AWS, while running it in a private data center was the most expensive. I’m going to leave aside the fact this is an apples-to-oranges comparison (you can’t run dynamo in your own data center, SSDs for dynamo while EBS for other software, etc). Instead, I’m going to go out on a limb and say that you shouldn’t be making cloud = lower cost argument at at all.

The real benefit of moving application development to ‘the cloud’ comes from increased agility and flexibility. Need to get a new application up and running quickly? You can just boot up some machines. Think there is some value in your log files? Instantiate a couple of hundred bucks worth of EMR nodes and take a shot at analyzing them. Building on cloud-like infrastructure allows developers to iterate and experiment, without the headaches of things like long hardware procurement cycles. Pair that with great open source software or managed services and you have lowered the barrier for developers to create something that is potentially great. That is something you can’t put a price on.

I know you know this, but here is what I think you should be telling companies:
- The way traditional IT organizations operate causes friction
- Developers don’t like friction
- You need to empower your developers to build, iterate and experiment
- Moving to AWS helps you do that
- Along with instant compute resources, you have great open source software you can use 
- Or, we build and manage our own software for you (DynamoDB, Elastic MapReduce, Simple Workflow service, etc)
- By not moving to the cloud you are slowing down your employees
- Embrace it or be left behind
(and btw, you may save a few bucks in the process)

You should be shouting this from the rooftops, rather than trying to make the age old argument of lower TCO. Will you piss off some CIOs or a few large enterprise software companies? Yes, maybe. But fuck it. You are Amazon. You defined e-commerce. And now you are defining what the cloud means to hundreds of thousands of software developers. 

Sincerely,

Nosh




That’s a Billion with a ‘B’ (and more to come)

by Nosh | Apr. 16, 2012, 9:11 AM | comments

A couple of weeks ago Red Hat announced that it had crossed a billion dollars in revenue over its fiscal year that ended in March. This marked the first time a company solely focussed on selling open source software crossed a billion dollars in annual revenue.

A lot written over the last few weeks about Red Hat, but I thought that some older articles more accurately reflected what Red Hat (and many open source companies) are doing. This ComputerWorld article from June 2010 is instructive because it quotes Red Hat’s CEO, Jim Whitehurst, on how open source software is displacing revenues of traditional enterprise software vendors and thereby collapsing the size of existing markets. Another interesting one is this one from the New York Times where Whitehurst talks about the impact of open source on company culture.

It took Red Hat a long time to get to the billion dollar mark, but I’m fairly confident that there will be many more to follow. There are a few fundamental shifts that are underway.
- The shift to cloud-like (IaaS, PaaS, etc) architectures is giving software developers an opportunity to move more quickly with minimal interference and involvement from traditional IT (despite how blasphemous this may sound to a CIO)
- As a result of this increased agility, choices for software are being increasingly decided from the bottom up by developers themselves, rather than being pushed from the top down.
- When choosing the software to use, developers default to the choices that enable them to accomplish their task with minimal friction- which most often is open source software

This cycle is going to lead to lot more companies able to build strong business creating and selling open source software, most likely at the expense of traditional enterprise software vendors.




Trending hashtags on twitter in 3 steps with MongoDB

by Nosh | Apr. 1, 2012, 10:51 PM | comments

One of the upcoming features of MongoDB is the new aggregation framework.
Here is an quick example to get you started. In 3 simple steps, you can be calculating the trending hashtags on twitter.

Step 1: If you haven’t already, download and install MongoDB v2.1
Here is what I do on my mac:

Noshs-MacBook-Air:~ nosh$ curl http://downloads.mongodb.org/osx/mongodb-osx-x86_64-2.1.0.tgz > mongo.tgz
Noshs-MacBook-Air:~ nosh$ tar xzf mongo.tgz
Noshs-MacBook-Air:~ nosh$ sudo mkdir -p /data/db/
Noshs-MacBook-Air:~ nosh$ sudo chown `id -u` /data/db
Noshs-MacBook-Air:~ nosh$ ./mongodb-osx-x86_64-2.1.0/bin/mongod

Now your MongoDB 2.1.0 server is up and running


Step 2: Get some twitter data into MongoDB
This is surprisingly simple. Since twitter’s streaming API outputs JSON, you can pipe it directly into MongoDB with mongoimport. Just run this command:

curl https://stream.twitter.com/1/statuses/sample.json -uUSERNAME:PASSWORD | ./mongoimport -d twitter1 -c tweets

(substitute your twitter username and password)

This will start streaming a sample (about 1%) of twitter status updates into a MongoDB collection called ‘tweets’ in a database called ‘twitter1’. There are a lot of fields in each doc. Check out the Twitter API documentation to see what gets sent with each status update


Step 3: Startup the MongoDB shell and run this query

Noshs-MacBook-Air:~ nosh$ ./mongodb-osx-x86_64-2.1.0/bin/mongo
MongoDB shell version: 2.1.0
connecting to: test
> use twitter1
> db.tweets.aggregate({$sort:{"_id":-1}}, {$match: {"entities.hashtags.text":{$exists:true}}}, {$limit:10000},{$unwind:"$entities.hashtags"}, {$project : {"entities.hashtags.text":1,"_id":0}}, {$group:{"_id":{$toLower:"$entities.hashtags.text"}, count : { $sum : 1 }}}, {$sort:{"count":-1}}, {$limit:5})


That looks a bit complicated. Its actually not. Here is what is happening:

db.tweets.aggregate(

//let's take the 10,000 most recent tweets with hashtags
{$sort:{"_id":-1}}), {$match: {"entities.hashtags.text":{$exists:true}}},{$limit:10000}, 

//hashtags are stored in an array, so separate them out              
{$unwind:"$entities.hashtags"}, 

// use the text of the hashtag
{$project : {"entities.hashtags.text":1,"_id":0}}, 

//group on the hashtag and add 1 for every occurrence
{$group:{"_id":{$toLower:"$entities.hashtags.text"}, count : { $sum : 1 }}}, 

//finally sort the result and only show me the top 5
{$sort:{"count":1}}, {$limit:5}); 


And here is what I get

{
        "result" : [
                {
                        "_id" : "wrestlemania",
                        "count" : 682
                },
                {
                        "_id" : "acms",
                        "count" : 246
                },
                {
                        "_id" : "promocaocdbrasil",
                        "count" : 194
                },
                {
                        "_id" : "paniconaband",
                        "count" : 139
                },
                {
                        "_id" : "teamfollowback",
                        "count" : 122
                }
        ],
        "ok" : 1
}
> 

I guess Wrestlemania is pretty popular right now!
Pretty simple, right? Let me know if you come up with some more examples.




I live in New York City. I work at 10gen on MongoDB. This blog is a collection of my thoughts on technology, open source, cloud computing, and random things I come across.

 Subscribe in a reader




Most Viewed

  1. Its been an interesting year for open source software
  2. Open source advantages - lessons from 10gen and MongoDB
  3. Why I work at 10gen on MongoDB (and why you probably should too)
  4. Some (unscientific) evidence of NoSQL adoption
Archive