Notes from the MongoBerlin conference
Kuba Suder 🇵🇱🇺🇦
October 10, 2010
This week I went to Berlin for 2 days to attend the MongoBerlin conference organized by 10gen, the creators of MongoDB. I took a lot of notes from the presentations, and I figured this may be useful for someone if they missed a presentation (or just couldn’t go to the conference).
Feel free to correct me if I got something wrong :)
“BRAINREPUBLIC: A high-scale web-application using MongoDB and RabbitMQ”
MongoDB vs CouchDB:
CouchDB - slower (because it uses a HTTP API), easier replication
MongoDB - faster
they needed high performance, so they chose MongoDB
Mongo has a rich query API that lets you do more than Couch’s map/reduce
they use RabbitMQ, Solr, Varnish proxies and Heartbeat
MongoDB is a swiss knife of NoSQL, very fast, reliable and very easy to use
NoSQL doesn’t mean you can be lazy about proper application and database design
current problems with MongoDB:
quite slow replication speed (someone from the audience said that replica sets are fast)
indexes can be very slow if they don’t fit into memory
simple authentication - sometimes too simple
map/reduce is single-threaded (locks the machine)
“MongoDB Internals: How It Works”
what does an insert do:
constructs Object ID (includes timestamp, machine/process ID, counter); two clients at the same time can’t generate the same ID, and IDs always increase with time, so order by id = order by time
converts data to BSON
sends a TCP message (using MongoDB Wire Protocol)
query:
returned as a cursor: first n results + cursor id, and then you can get next results using the cursor (n depends on the size of documents)
contrary to relational databases, cursors are not associated with sockets, and can be shared between connections; server cleans up cursors periodically if you don’t use them (this gives more flexibility in connection management on the client side)
authentication - should rarely be required, it’s recommended to protect the database at the network level (using firewalls)
query optimizer:
it’s empirical, i.e. at first it tries all possible ways to get the results, and then remembers which one works best (it runs all algorithms in parallel and finishes as soon as one of them finishes), then reuses that knowledge in future requests
if the selected algorithm becomes very slow, it tries all possible ways again
so first time a query is called, it might be quite slow
on the other hand, if something changes later, e.g. an index becomes slow, Mongo will work around that
commands:
return short values as results, not actual data (there’s a limit on how much data you can return from a command)
internally, they’re performed as finds on $cmd collection
all commands can be invoked via runCommand with the name as one of parameters
safe mode:
normally, commands return as soon as the request is written to the socket
if you want to wait until it arrives on the other side and is executed, you use safe mode
it uses getlasterror command, which waits until the operation is executed, and returns an error if there was an error, and nothing if the operation succeeded
getpreviouserror - returns last error that actually happened (not necessarily from last request)
database files:
data and indexes are stored in foo.0, foo.1 etc.
metadata about namespaces (collections) is stored in foo.ns
database files are preallocated (so don’t go crazy if the files get big - they may be mostly empty)
one set of files per database, all in one directory (there’s an option to put them in separate directories)
$freelist namespace - list of areas that can be reused (because data was deleted)
document record contains:
header (includes: size, links to previous/next, info about how fast the record grows to estimate how much padding is needed)
BSON data
padding
it’s good to design schema so that the document size doesn’t change very often, because then the server has to move the data often
capped collection:
collection with a limited size
if it hits the limit, it deletes the oldest documents
useful for logs
“Indexing and Query Optimizer”
it’s often difficult to guess up front what indexes are needed, you need to know your actual, real query patterns
a collection may have 64 indexes, but a query can only use one at a time
indexes aren’t free - they add linear cost to all inserts and updates
for compound indexes, order of keys is important
the indexer is smart enough to handle cases where a field is a number in some documents, and string in others, and optimizer will use that (for arrays, it adds one entry for each element to the index)
if you query by regexp, it can use the index on that parameter, but only if it’s a left-anchored regexp (/…/)
geospatial search is a special case, it can only use indexes and won’t run without them
in what queries indexes can’t be used:
negations
$mod
most regexps
JavaScript calls ($where, map/reduce)
in general, if you can imagine how an index is used, it will be, and if you can’t, it won’t be
db.collection.find(…).explain - explain what exactly the query does:
cursor:
BtreeCursor - means it uses an index
BasicCursor - table scan, doesn’t use an index
nscanned - number of documents scanned
if something is slow, and you think it has an index, check this first
make sure this isn’t the initial query run, where it tries all possible ways to run a query
query profiler - logs all queries slower than a given limit to the log
how to pick a limit: get some statistics of query run times, if you see two maximums on the graph and a minimum between them, pick that minimum
if you really want to force Mongo to use an index:
find(…).hint({ x: 1 }) to use an index on x
find(…).hint({ $natural: 1 }) to disable indexes
this should almost never be necessary…
“Sharding Internals”
when to use sharding - when you:
have too much data to fit on one box
have too big indexes to fit into memory on one box
want to divide write load on many boxes to make writes faster
want to divide reads on many boxes and keep consistency (if you don’t need full consistency, just add more slaves)
auto balancing:
the shards will automatically rebalance themselves in the background with no downtime when you add more nodes
you can also add sharding seamlessly without downtime to non-sharded master/slave or replica set; everything is consistent as if it was a single server
this won’t work if you want to change the sharding key on existing shards - too much data to move
it’s up to you to decide how to partition the data, because it all depends on the kind of data you have
examples:
user profiles - shard by some kind of unique id (so lookup by id will go to single shard; lookup by other things, e.g. location, will have to check all shards)
user activity, timeline - shard by user id, so user’s timeline will load from a single shard
photos, like flickr - shard by photo id, because very often you’ll know the photo id
logging:
you could shard by date, but then all currently written logs will go to a single server and will put all load on that single server
you can divide by source machine, but then searching for all recent logs will have to access all shards
you could also divide by log type (e.g. mysql logs, apache logs)
in general, for writes it’s good if they’re divided across shards, and for reads it’s good if they read from one shard
always think what exactly is the bottleneck (data, speed), why do you want to shard, and how the system is going to be used
shard config servers:
they store all metadata about shards
they’re very important - if all of them go down, it’s very difficult to restore the database (not impossible, but very time consuming)
they use two-phase commits
you should have at least 3 of them
if one goes down, metadata becomes temporarily read only, so you can’t add/modify shards during that time (but you can still add data)
they require very little resources, so can be installed on existing machines
each shard is a master/slave pair or a replica set (or just a master, but that’s a bad idea)
mongos - sharding router:
acts as a frontend of the databases for the application
clients can’t tell the difference between a regular mongo database and a sharding router, they access it in exactly the same way
redirects all requests to the shards (using the metadata from a config server)
you can have as many as you want
one solution is to have one on every application server - then there’s no extra network load, because every application server accesses its own mongos on the same machine
chunk migrating:
moving chunks (parts of a collection) to a different shard
during the copy, the origin keeps a log of all operations, because clients can still do writes in the meantime
migration is finished when the data is copied and all new operations are applied (when the target catches up with the changes on the source)
balancing: server automatically detects when a shard has too much data and moves some chunks to other shards
usually only one or a few biggest collections are sharded, and the non-sharded data is kept on one shard which is automatically selected by the system as primary
Foursquare example: only checkins collection is sharded, users and locations aren’t
number of physical machines for sharding:
usual number of machines initially used for sharding is 6 or 9
but you can start with 2 or 3 (one box can hold master for one shard and slave for another)
“Replication Internals”
oplog:
it’s a capped collection that stores recent operations on master
it should be large enough to hold all operations for some period of time - how much exactly depends on the speed of adding new data
on master/slave setups oplog is kept only on master, and on replica sets - on all nodes
syncing: if slave is empty, it first copies all data from master, then applies all the changes from oplog that happened in the meantime
some operations are converted when added to oplog, to guarantee that they are idempotent:
incs are translated to updates with $set
deletes are divided to one per record (because it won’t be possible to match the same set of records again)
no-op operations are added to oplog in regular intervals to ensure that it isn’t paged out to disk
–fastsync
assumes that data on the slave is identical to the master data
you can use that if you copy all the data manually, e.g. by sending whole drives to another data center
FedEx delivers replicated data at about 11 Mb/s ;)
with capped collections, it’s possible to do a ‘blocking read’ that will wait until new data is available (like tail -f)
replica pair: something old that was replaced by replica sets, don’t use that
replica sets:
instead of master and slave, we have primary and secondary, because they’re dynamic
when primary is down, for a short time writes are blocked, and remaining nodes have an election process to find out which one has the newest data, and that node becomes a new primary
some really new data, that wasn’t written to any secondary nodes, may be lost
when old primary reconnects, it marks all changes that didn’t make it into the new primary as invalid and logs them somewhere
there may be some problems with determining if the primary is dead or just inaccessible - may be solved by setting vote strengths, adding extra nodes, or adding arbiters (nodes that don’t store any data and only take part in the election process)
“Scaling with MongoDB”
often you can change performance by orders of magnitude only by rethinking the schema
embedding:
it’s great for read performance if you need to read the main object and all embedded objects together (there’s one request to load all objects)
but writes can be slow if you’re adding embedded objects all the time, because very often the document will have to be moved somewhere else on the disk
if the updates to embedded objects are e.g. 1-2 per second then it will copy the data all the time
indexing: if you have an index on (A, B) then you don’t need an index on A
sometimes you need to think hard if having an index will bring better performance than not having it - because huge indexes that swap out constantly are very bad for performance
right balanced indexes:
index is right balanced if all the new elements go to the right side (when they’re sorted by date or id)
in such index, usually only the right side of the index needs to be kept in memory, even if the whole index is huge
it might sometimes make sense to add a new key just to have a right balanced index
use this for data that is mostly accessed only for time, e.g. recent photos, messages, etc.
working set: you need to figure out how long a user stays on the site and how much data they’re going to need during that time (and how many users come simultaneously)
horizontal scaling:
usually read scaling is most important, because most people just read the data, and only some of them modify it
simplest way: master + one or more slaves; slaves' data may be inconsistent at times
for data size and write scaling: add shards
typical setup for sharding: 9 servers (3 masters, 6 slaves)
for sharding, choose such keys that are more or less evenly distributed in the collection
Lightning talks:
“Survival Guide: How to select the right database”
http://nosql-database.org - about all nosql databases
http://www.slideshare.net/jboner/scalability-availability-stability-patterns
“Map/reduce”
map/reduce is useful e.g. for aggregating data from entire database in a batch task
it can be thought of as a distributed grep command whose results are piped to another command that processes (reduces) the data
“Indexing Plugins”
if built-in Mongo indexing doesn’t satisfy your needs, write your own indexer
you can add new commands that will be called with runCommand
right now you can’t use data outside of mongodb for indexing
“MongoDB Roadmap”
features planned in 1.8:
single server durability
improved replication
full text search
Discussion in the ATmosphere