Java technology zone technical podcast series: Season 4 Episode date: 05-15-2012

Java technology zone technical podcast series: Season 4 Episode date: 05-15-2012
Java technology zone technical podcast series: Season 4
10gen's Steve Francia talks MongoDB
Episode date: 05-15-2012
I'm Andy Glover, and this is the Java technical
series of the developerWorks podcasts. My guest this time is
Steve Francia. He is the chief solutions architect at 10gen.
If you don't know, 10gen are the guys and gals behind
MongoDB. And Steve also recently published a book about
MongoDB for O'Reilly.
And I thought we'd kick off the conversation, Steve, by ...
On this podcast, we got to talk to Eliot [Horowitz] it seems
like two years ago. So a lot has probably changed for 10gen
and MongoDB, so I thought we'd just start off by saying:
What's new? What's going on in the wonderful world of Mongo
and 10gen?
FRANCIA: Thanks, Andrew. Just a bit of background. It's
been a really interesting last year and a half for both
MongoDB, 10gen, and kind of the developer world as a whole.
A lot of changes are happening in the landscape; lots of new
technologies are coming out that enable developers to do new
things. And MongoDB is at the forefront, really leading what
we'd like to look at as the data revolution that's taking
Wow, okay.
FRANCIA: Over the last year, 10gen has grown in line with
what we've seen as the adoption growth of MongoDB. At the
beginning of 2011, our company was still small -- about 25
people and one primary office in New York City. At the
beginning of 2012, we were over 120 people, offices on both
coasts -- one in Palo Alto, one in New York City -- and then
offices in Dublin and London and expanding into Asia in
And so, it's rapid growth. We've seen the market -- the
entire alternative-to-SQL market, whether it's graph
databases, document databases, key/value stores -- all of
them have been growing, but MongoDB's growth has outpaced
the growth of the field in general.
So, we're doing millions of downloads from MongoDB in 2011,
and we saw two material releases since last we talked. The
two big releases we've had, the most recent one focused a
lot on better performance and better sharding capabilities.
And so over the last couple of years, it's been really our
objective to cover as many use cases as broad as we can and
be able to handle as many critical needs of companies as we
We've also seen some large companies make some significant
adoptions of MongoDB. A few that I think are pretty notable
is craigslist stores the majority of their data using
MongoDB, having switched from MySQL. We've been partnered
with Microsoft to be one of the core technologies as part of
the Azure platform.
eBay has rolled out their new X.commerce platform, and
MongoDB is a core technology -- the only database available
as part of X.commerce. And we've seen within eBay the growth
has expanded internally. So they started X.commerce, and now
they're using it for a bunch of the internal needs.
So, let me ask a question. So you mentioned
craigslist, and I'm sure pretty much all of our listeners
are familiar with what craigslist is. Would it be possible
for you to kind of elaborate on why did Craigslist decide to
move to Mongo? What is it that Mongo is doing that -- I
think you said they moved off of MySQL -- what wasn't MySQL
capable of doing that Mongo is?
FRANCIA: This is a great story. It's interesting, if you
think of Craigslist from a technology perspective, you think
of a couple of things. One is that they've been around for a
long, long, long time on the Internet. And two is they've
always been very conservative, technically speaking. For
instance, their home page looks about the same as it did 12
years ago.
FRANCIA: And so what they ended up doing was, they hired a
guy named Jeremy Zawadney, who if you're not familiar, is
known throughout the world as really "the" MySQL expert. He
wrote the two books on scaling MySQL that are effectively
the bible of web scale.
And they hired him to address some of their large data needs
that they were having. And what Jeremy ended up doing was
looking at what they had. Because of the way craigslist
works, they have a very small percentage of their data
that's kind of their daily operating data -- the working
data that people interacting directly with the website
posting and responding to, et cetera, have.
And then you have everything that's ever happened
previously. They have the live data for about 60 to 90 days,
and then they archive it. And all of the archive data was
just really a SQL copy of the live data. And all of the data
is still accessible for the website; they just have it in a
different system.
And so what they were doing was with all this archive data,
which is well over 95 percent of their data: They had a
policy internally that they'd do ALTER TABLE statements -they'd change their schema -- only four times a year. So
every three months they'd have a window where they'd be able
to do some ALTER TABLE statements.
And they got to the point where, because of how many records
they had, the last time they ran an ALTER TABLE statement,
it took over two months to run.
FRANCIA: And if you're familiar with MySQL, the ALTER TABLE
statement locks everything while it's running. And so that
was two months where the archive was unavailable. And it's
obvious to see what's going to happen is as they grow more
and more, that ALTER TABLE statement is going to take three
months, and then they're never going to be able to actually
access their data. Or revise the policy. But at that point,
they're spending more time altering -- locking their tables
-- than they were doing anything else.
Jeremy knows MySQL as well as anybody does. And his solution
to solving this problem was to switch to MongoDB. What
MongoDB does with its flexible schema is that they were able
to never have to write an ALTER TABLE statement again but be
able to continue to change the database however the
developer saw fit, without ever having to worry about
changing all of the archive data. So it gave them that
flexibility that they really needed.
Interesting. Very interesting. You also mentioned
-- as far as what's new in Mongo, you mentioned sharding.
I've had multiple conversations with different NoSQL and
relational folks about what's sharding, and I thought maybe
it would be helpful if you could expand on what does
sharding in the MongoDB world mean.
FRANCIA: Certainly. So, with MongoDB, the approach that
we've taken to horizontal scale -- meaning that as your
database grows, as your capacity needs increase -- you're
able to meet those needs by just adding more servers to a
And the way that we do this in MongoDB is the technique
called sharding. One of the neat things that's done with
MongoDB is that it's done entirely in the database. Your
application doesn't have any awareness that it's
communicating to a sharded database or a regular database.
Now, that's a powerful thing for developers, because it
means that when you're developing an application, you're
able to focus on building the application and not worrying
so much about how it's going to scale, or performance,
because MongoDB gives you so much of that out of the box.
And then as you need to scale, you're able with MongoDB to
go from a single node or a replicated node to a sharded node
-- a sharded cluster -- with no downtime, without changing
your application.
And under the covers, what it really means is that MongoDB
will chunk up your data into ranges based on a key -- a
shard key that you give it. And then MongoDB will worry
about where all those are on every ... when you're querying,
it will worry about where to find the pieces of data you
need to, what servers they're actually located on. It'll
worry about balancing and distributing them and making sure
all of that takes place properly.
And we've got clients, one of our customers ... one of our
big customers is Disney. They have hundreds and hundreds of
nodes in big clusters. It works really well to scale to a
very high capacity.
Regarding sharding -- in particular when you say
it's helping with kind of the performance and scalability
and the horizontal standpoint -- it takes a large
collection, let's say? And I'd be interested if you could
tell us more -- maybe about Disney or something -- but it
takes a large collection or I guess a table, right? A table
is a collection in the Mongo world.
And breaks it into multiple copies? I mean, so
that now you have, let's say, instead of one giant, let's
say, customer's collection, you have now maybe three or four
customers' collections all kind of broken up by perhaps
region. And therefore searches are quicker, because you're
going against a smaller data set? Is that the principle
FRANCIA: That's the overall principle, but let me explain a
little bit more clearly. So let's use a good example. An
example is a good way to kind of illustrate this. So I'll
talk about it in two different ways, actually. One of the
ways is what we have today, and one of the ways is what
we're building in for the next upcoming version of MongoDB,
which is 2.2.
So the simpler case is that let's say you have a bunch of
customers, and you're just talking about the customer's
table or collection, and it's just growing, it's growing
very large. And you want to be able to divide the work
that's happening across many different servers.
And you could be doing it for a couple of different reasons.
One is the amount of data is just too big to fit on one
machine; or, the amount of writing that you're doing. You
want to increase the throughout, so you want to divide the
writes across multiple machines.
And so, as far as it goes, it's always going to be viewed as
one collection, but that collection itself is going to -- to
get a little technical, what it's really doing is it's going
to break up all of this data into 64-MB chunks, and these
are going to be based on a range of data.
So let's say your shard key was based on sign-up date. That
may or may not be a good shard key depending on your user
behavior, but let's just say for an example that it's based
on sign-up date. So what it'll mean is that anybody ...
it'll break up all of that data into these 64-MB chunks and
distribute them evenly across the servers.
Let's say you have four years of data and four servers. It
doesn't mean everybody from year one is going to be on
server one, everyone from year two is going to be on server
two, et cetera, because it's going to break them up more
granularly than that into 64-MB chunks.
And so, what it'll end up giving you is a very even
distribution. And what we've found is, let's say you were
doing it in this chronology -- you're basing the range on
this sign-up date -- you're going to want it more evenly
distributed, because the way that user behavior is, often
your oldest users are the ones that are more stale, right?
And the newest users are the ones that are more active. And
so you're going to want to more evenly distribute those
across them. And so, that's what MongoDB does.
As far as the system treats it, as far as the application
treats it, it's one collection; it's always queried against
one collection. You never need to really be aware that it's
chunked in all these different ways. But the server itself - the cluster itself -- is distributed across all these
different nodes.
I got you. And something that -- I think I seem to
remember talking about this in a previous podcast or perhaps
even in an article -- is that sharding in the relational
world is difficult when you need to do, let's say, joins,
but in Mongo there is no such thing as a join; however,
sometimes people can still kind of put related data -- let's
say, right? -- in different collections. How does sharding
affect that, or again, does it even matter?
FRANCIA: Yes, so, the whole reason Mongo doesn't have joins
is because doing distributed joins is really hard to do with
guaranteeing reasonable performance. And so from the
beginning we said, well, there's two things that will really
prevent horizontal scale in relational databases or in
traditional databases.
One was the ability to do joins, and the second was
distributed transactions. And those are the two major
components of SQL or relational databases that we
intentionally did not include in MongoDB.
Now, fundamentally, we took a different approach to how the
data is structured. One way I like to visualize it is, if
you think of a key/value store like memcached, it's a very
one-dimensional thing. It's almost like an array for
programmers, in that you've got a key, which is the location
of it, and then a single piece of data.
If you think of a relational database, it looks more like an
Excel spreadsheet, right, that there's two dimensions.
There's all the fields, and then there's all the rows. And
so it expands the capabilities of the one-dimensional that
you get in the key/value store -- it expands those
capabilities so now I can say, "I only want this piece of
data." Or, "I want to find something where this piece of
data matches something."
And then we came up with this idea of joining these
two-dimensional pieces together to give us a sense of being
able to give us richer data. But it's somewhat artificial in
that even something as simple as a blog -- you have blog
posts and comments -- it becomes somewhat artificial. It
gets even more artificial when you do something where you
have a many-to-many relationship, right?
Because now we need three tables, and we need this special
kind of table that doesn't really do anything but combine
the other two tables -- and it feels kind of weird. With
MongoDB, what we've done is more analogous to an associative
array or a hash with the ability to do nested associative
arrays or hashes.
And it's similar in a way to objects in that you've got a
lot of richness to it. And so, instead of having this
two-dimensional datastore, you've really got a
multidimensional one. And so you're able to include things - nest them together in logical ways.
In doing so, it makes querying very simple. Take the example
of a blog post. One of the big users of MongoDB is Business
Insider. And what they've been able to do is, they shrunk
down their schema from what would be in a relational schema
dozens and dozens of tables to really only a couple
And when they go to retrieve ... if you think about a blog
article, you're going to have categories, you're going to
have metadata, you're going to have images or video assets
as part of it. You're going to have tags. You might have
related articles. And then you're going to have the content,
the body. And in a relational world, you're going to have to
create a lot of tables to do all that. That's at least 10
Because all that data is going to be accessed at the same
time based on the blog post -- based on the URL that's in
the request -- it makes a lot of sense to put them all in a
single document. And this is a structured document. It's not
like a denormalized blog; it's very structured in that it's
organized. It's structured in a similar way to the objects
that you want to manipulate in your application are.
And so with Business Insider, they're able to store
everything that's part of that page load -- everything
that's part of that render -- in a single document. And so,
it's a very straightforward query for them to access. The
query is literally as simple as: look at the posts
collection and find any article that has the URL token or
the slug that matches the one in the request. That's the
entire query.
Right. And speaking of ... you mentioned querying,
and something that often comes up when we have conversations
about NoSQL is: how does one interact with Mongo? Obviously
SQL, S-Q-L, whatever you want to call it, for the relational
world has become like the lingua franca of the business
analyst, right? Everyone knows SQL.
But when you move to the NoSQL world, a particular
implementation may or may not support the SQL we've all come
to know -- I'll leave off "and love," right? So what is the
query language for Mongo? What's the level of effort? What
do I have to do to, like you said, query for a slug, right?
FRANCIA: Yes, so, every NoSQL solution takes a slightly
different approach, and I'll speak mostly about MongoDB. But
it's interesting if you think about what developers have had
to do for the last -- let's say, Web developers -- for the
last 20 years, 15 years.
They've had to program using languages that are really built
around objects. And then they also had to learn this other
language called SQL, and they had to figure out how to
translate this object data into these rows, tables, and
columns, which wasn't necessarily a natural translation
between the two.
And realizing this wasn't an easy thing to do or it wasn't a
very pleasant thing for developers to do, we've developed
this idea of: well, we'll write this library called an ORM
-- object-relational mapper -- that'll do all that work for
us, right?
Because we only really want to interact with objects,
because we write in an object-oriented programming language.
We don't have to worry about translating our objects into
rows, tables, and columns.
And so, SQL itself, it is that. It's that interface with
these rows, tables, and columns that ... and we interface in
programming languages through strings. It's very odd,
actually. It's not an API or anything. It's through these
strings, which feels very different from everything else we
do in our programming languages.
Now, granted, this is how we've all done it for our entire
careers, so we've gotten very accustomed to it. But if you
sit back and look at actually all that we're doing, it's
very strange. It's very strange to use these passing strings
into this function that will return data in a very different
format than we know how to handle it.
And so with MongoDB, we've put a huge amount of attention
into improving the developer experience and making it feel
like you're actually working with the language that you're
in. And so, with MongoDB, we don't pass strings. You don't
use a query language. You interface directly with objects.
And so for the collection, there's a collection object.
There's a method off of that called find. There's another
method called update. There's another method called insert.
And these are methods that are off of that collection
object, into which you're going to pass ... for the update
method, you can pass in an object. Or for the insert method,
you can pass in an object, and it will just persist that
object. Or you can pass in an array or a hash, and it'll
persist that.
In MongoDB the query is very similar to that. It's query by
example. And so, what you do you is you pass in, we'll use
the term object or array that you want to match, right? So
for the sort of matching the slug, you literally give it a
key slug -- the word "slug" -- and then pass in the value
you're expecting it to return to.
Our goal with MongoDB is that you forget you're using a
database, right? And just let you focus on building your
application and not worry about the database. In fact, we've
found that when people get into MongoDB, within a couple
weeks they start saying, "Where's the rest? This is too
And often people comment that they forgot they were even
using a database, because it just feels so natural -because they're just communicating directly with these APIs,
and there's no translation layer like there is in SQL. It's
just very natural, very performant. You don't have the big
overhead of the ORM.
And it's very consistent in every implementation regardless
if you're a Java programmer, a PHP programmer, Ruby, Python,
C. The implementation feels very natural to your language
and also consistent.
Interesting, interesting.
Earlier, before we started recording this, you and I were
talking about what's new, what's going on. And so, I'm
hoping you can share with us what's upcoming. If we're a
Mongo person, if we're using Mongo, what can we expect down
the pipe, and if we're looking at Mongo or we're interested
in learning more, what's out... You mentioned that you guys
had some performance improvements and some stuff around
sharding. What's on the roadmap?
FRANCIA: Certainly. I'll give you three big things to look
for. The first one -- a feature of MongoDB that we haven't
talked about in this talk yet -- is that MongoDB utilizes
memory in a very interesting way. It effectively bakes in
the notion of memcached into the database and does even more
than that.
So what it does is, everything that you read from the
database is read through memory. And everything that you
write to the database is written to memory first and then
persisted to the disk.
What that translates to you is that it approaches the speeds
of memcached for all reading and is significantly higher
than the speeds of anything you've experienced before for
writing, because it's writing right to memory.
And it does that completely transparently, just as you
interact with the database. It's doing that behind the
scenes. There's no configuration; that's just how the
database works.
One of the big features that we've been able to add into
MongoDB for the upcoming version is a better sense of
improvements to how this works to improve concurrency -- to
enable us to be able to handle even higher write loads than
we currently do, at virtually any scale. So that's one big
thing to look for, is better concurrency, which will give
you even better performance, particularly if you're in a
lower memory situation.
The second thing to look for in MongoDB -- I touched on it
earlier but didn't talk about it too much -- is geo-aware
sharding, and that's kind of what you were talking about. So
if I've got three datacenters -- let's say one in Europe,
one in Asia, one in North America -- I want high
availability across all of them, but I also want
performance, in that my users in Europe are going to
frequently be interacting with my data in Europe.
So I'm going to want to write to Europe locally and then
replicate that data to the other datacenters for high
availability. This is a hard thing to do. It's not easy,
really, in any database. But what we have, we're building
support for that into the next version of MongoDB.
So look for that. It should be like everything MongoDB: the
configuration is just so much easier than anything you've
experienced before. And you should look for that feature to
be coming in the next version of MongoDB, which is our 2.2
And then the last feature I wanted to talk about is our new
aggregation framework.
Tell me more.
FRANCIA: With MongoDB ... I'll start off by saying it's an
operational database. It's not strictly a data processor. It
enables you to do things -- you can query your data, you can
handle it -- but historically we haven't had a great
solution for doing things like groupings, averages,
max/mins, et cetera. And have largely depended on either
using our MapReduce functionality, which is capable, but
it's a very general-purpose framework for doing these
things. Or, you've been doing a lot of it in the
application. And SQL's had great support for things like
this. The GROUP BY command works very well in SQL.
And so what we've built is this new aggregation framework
that's built on the concept of pipes -- kind of like UNIX
pipes -- that you can string together a pipeline, where the
output of one becomes the input of the next.
And in doing so, we've enabled you to be able to do these
aggregate functions with significantly more power than the
SQL GROUP BY without adding any complexity to that query
language. But the interface is very familiar. When you start
using it, it feels very familiar. It feels kind of like a
mashup of UNIX pipes plus SQL GROUP BY.
And so it feels very familiar. It's optimized for these
aggregate things, so you're going to get significantly
better performance than a general-purpose tool like
MapReduce. When used in a sharded, clustered environment,
you're going to get that parallel processing similar to you
get from MapReduce.
So you're able to increase the throughput of these
aggregation commands by distributing the aggregate
calculations across all of the nodes. And so it gives you a
lot of the kind of benefits of MapReduce but with much less
complexity and an optimized path for these aggregate
Now, interesting, you're comparing and contrasting
this with MapReduce. So in Mongo land, you wouldn't
necessarily leverage MapReduce in a real-time basis, right?
It's kind of like a batch-oriented thing where you'd run
MapReduce and then query the output of that.
Is the aggregation framework much the same, in the sense of
you run a query and it puts its results in something else
that you then query? Or does it kind of inline it? Do you
get the results back right away?
FRANCIA: It can be used in a real-time situation. It's much
more analogous to SQL's GROUP BY than it is to MapReduce.
And it's interesting, as we go out in the industry, we're
talking to a bunch of people. We've found that most people
using MapReduce were using it to do ... what they were
trying to do was this real-time aggregation stuff.
Based on the customer and user feedback is where we got the
idea that well, then, we should optimize a path really just
to do that, because MapReduce is kind of a big hammer where
everything's a nail. And it's an effective hammer, right? I
mean, it definitely does the job. But there can be moreeffective ones.
And so we've done a couple things. One is, we've built this
aggregation framework. Two is we do have our own MapReduce
capability built into MongoDB that works well. And the third
thing that we've done is we've built connectors to work with
Hadoop MapReduce. We're building a connector right now to
work with Storm to be able to leverage those technologies
and infrastructures.
In a lot of bigger environments, they already have a Hadoop
cluster running. And so to be able to connect directly to
that and let them use the tools they're already using is a
really powerful feature. Of course, Hadoop is a batch
processor. That's what it's built to do. And so, for
real-time stuff you're not going to use Hadoop; you're going
to use MongoDB's aggregation.
If you wanted to do batch processing, Hadoop's a good fit
for that, and our connector lets you pull data from MongoDB
into Hadoop, do some processing, and then you can stream it
anywhere you want from there or output it anywhere you want.
You can output it back into a MongoDB collection, or you
could push it into HBase or any of the other compatible
Hadoop tools.
Or you can go the other direction. You can take it from
anywhere in the Hadoop ecosystem and pipe it into MongoDB.
We're also working with Storm, which is more of a real-time
data processor -- kind of more like a trendsetter that as
data comes in, you'd attach Storm to it and be able to
translate that or morph that data into something else. So,
people are doing really three different needs, and we're in
all three places. So we're really trying to address the use
cases and needs that our users have.
And so, all of this is going to be in, you said,
No, 2.2.
2.2. And when is 2.2 due out?
FRANCIA: 2.2 is due out soon. Our approach to releases is
that we have these two or three I like to call them tentpole features in each release, that when they're done and
baked is when we ship.
And then any of the other features that are done and ready
in time and well-tested will also be in that release. So,
today most of the work is done; we're very close to a
release. It will come in the next few weeks. And for anyone
who wants to start playing with this new functionality, you
can do it today.
You can go and download what we call the developer's
release, that has all this functionality baked in, but we're
still adding some polish to it. So the API is going to not
change. You can start building applications against it. But
it's not recommended for use in production yet.
Got it. Got it. Where can we find out more
information on Mongo and 10gen?
FRANCIA: The best ways to find out information about
MongoDB is We've got loads of presentations
there. We've got lots of use cases, examples, great
documentation. And that's really the best place to go, and
the MongoDB blog will always keep you up to date with the
latest happenings in the MongoDB world.
And then 10gen is the commercial backing behind MongoDB. We
have, and we're doing a lot with 10gen. Another
big thing is we've seen in the last year and a half, or
since we've talked last, that our community has just
exploded. And so, we now have, in 2012 we have over 30
conferences. We try to make them very local to make them
accessible for developers to attend. They're in every
We just had Mongo Sydney last month; we have Mongo New York
coming up in April. We've got MongoSF coming up in May.
We've got Mongo London coming in June. There's going to be a
Mongo in Japan. And we've got all these conferences
throughout the world, and we price them very affordably,
because we really want people to come and be a part of the
great things that we're doing -- really feel included.
And then, Steve, what about you? How can people
find you on the Internet?
FRANCIA: I'm very available on the Internet. Best places to
find me are my blog and Twitter. Both places are SPF13: my
blog's; on Twitter, SPF13. Or just Google and
you'll find me on anywhere from GitHub, where one of my
popular projects is a VIM distribution, to Coderwall to
LinkedIn, anywhere. Just search for SPF13 and you'll find
Awesome. Well, Steve, this has been very
informative. I know I speak for all our listeners when I say
thanks for spending time out of your busy schedule to keep
us abreast of all the good, cool things happening at Mongo.
FRANCIA: Excellent. Thank you for spending the time with me
and giving me this opportunity.
My guest, again, has been Steve Francia, and I'm
Andy Glover, and this is the Java technical series of the
developerWorks podcasts. Thanks for listening.
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF