Working at MongoDB

Last Friday marked my last day as an intern working at MongoDB. Inc. During my time here, I’ve met countless amazing people—full timers and interns alike—and had one of my most memorable experiences at a workplace, ever.

The Company

pongoDB

MongoDB, Inc. started out as 10gen, way back in 2007. Something neat I learned this summer is that the “10” in “10gen” actually refers to the number “2” in binary form, and the “gen” in “10gen” is short for “generation”, leading to a clever word-play on the phrase “second generation”. After all, their idea revolved around building a “second generation” database, back when NoSQL wasn’t super popular yet.

The location of the office can be both a blessing and a curse. Having an office in Times Square gets a bit hectic, and the feeling that people walk either too slowly front of you or too quickly behind you always managed to creep up on me every day, immediately after exiting the subway station. While Times Square tends to be very busy during the day, it tends to have a much calmer atmosphere at night, which I personally liked. I took the photo below after MongoDB took the interns out to see a Broadway show.

times_square

Having its headquarters in New York City is a nice perk, as there is a seemingly endless number of activities to be done within Manhattan itself, not to mention the other boroughs of New York. This is all without sacrificing the the office perks stereotypical to tech companies in the Bay Area, as one could find all of the same office perks (e.g. catered lunch, free snacks, nap rooms) at MongoDB.

The Work

Over the course of the summer, I worked as a software engineer in the distributed systems group, on the sharding team. Needless to say, my contributions involved improving the sharding mechanisms of the MongoDB database. I felt particularly lucky, as I was able to work on a variety of different projects throughout the summer.

Improving Automatic Chunk-Splitting

The main project I was responsible for at MongoDB involved redesigning some of the logic behind auto-split, which partitions chunks of data within sharded collections into smaller pieces when those chunks become too large. Because auto-split is not a cheap operation, I worked on improving its performance by reducing the number of network calls made by and eliminating duplicate work done by mongos router nodes while executing auto-split. I also worked on the asynchronization of the operation so that auto-split does not block writes.

Before this summer, tracking the approximate sizes of particular chunks that were written to was done within mongos router nodes. This required that the mongos routers make network calls to corresponding mongod shards every time a particular chunk got too large, as shown below.

network calls

By building a new method of tracking chunk sizes on the mongod side, MongoDB is now able to execude the auto-split operation without invoking an unnecessary number of network calls between mongos router nodes and mongod shards, as shown below.

improved

Another benefit of this is that previously when multiple mongos router nodes were running simultaneously, each mongos router node would call the auto-split operation, because each would track its own copy of the chunk sizes. By moving the tracking logic over to mongod, an enormous amount of duplicate work is eliminated by multiple mongos routers all trying to perform the same task.

Refactoring Metadata Commands

The second project I worked on involved moving some of MongoDB’s metadata commands from router nodes to the config server. Metadata commands include commands that modify the state of the sharded cluster, such as movePrimary, removeShard, and dropCollection. Because users are able to run metadata commands concurrently, a router node must take one or more distributed locks from the config server before running a metadata command, so that the router node running the metadata command does not interfere with another router node running a metadata command that affects the same aspect of the cluster.

Having these metadata commands run on the router nodes is suboptimal in the sense that there is a large dependency on brittle distributed locks. By moving the metadata commands onto the config server, these dependencies can be eliminated, and a more correct and coherent way to manipulate sharding metadata can be developed, and used as the groundwork for future improvements.

Nondeterminstic, FSM-Based Concurrency Testing

The last project I worked on was a bit more stand-alone and revolved around constructing a better way of testing the correctness of MongoDB’s metadata commands when multiple are run together concurrently. For example, the removeShard command can only remove a shard from a sharded cluster if the shard to be removed is not the primary shard. Hence, it makes sense to test the behavior of the cluster if removeShard is called on some shard A while movePrimary is simultaneously called to move the primary from some other shard B to shard A. That is, would it be possible that the cluster beings removing A before A becomes the primary, but A becomes the primary before it is removed? The aforementioned scenario would involve a sharded cluster with no primary, which is certainly not desired.

In order to test the behavior of such scenarios, I used a finite-state-machine (FSM) based testing framework that simulated multiple threads being run concurrently by spawning multiple mongo instances and running metadata commands in each of those instances. Each mongo instance represented an FSM, with pre-specified states and probabilities connecting those states. Each FSM would then transition through its respective states in parallel with other FSMs, until the number of iterations specified is reached. For example, a concurrency test that tests removeShard and movePriamry simultaneously could be illustrated via the diagram below.

fsm

Each thread within the test would begin in the init phase, and then transition between running the removeShard and movePrimary metadata commands until a set number of iterations has been reached. For example, a possible setup for this implementation could be written as the following.

// Define the different states of the FSM and what should
// be done in each state.
const states = {
    init        : function init()        { /* ... */ },
    movePrimary : function movePrimary() { /* ... */ },
    removeShard : function removeShard() { /* ... */ }
};

// Each FSM starts in the init state, and transitions to
// next states based on specified probabilities.
const transitions = {
    init        : { movePrimary: 0.5, removeShard: 0.5 },
    movePrimary : { movePrimary: 0.2, removeShard: 0.8 },
    removeShard : { movePrimary: 0.8, removeShard: 0.2 }
};

const threads = 2;      // number of FSMs
const iterations = 10;  // iterations for each FSM

return runFSM({
    states: states, 
    transitions: transitions, 
    threads: threads, 
    iterations: iterations
});

By writing FSM-based concurrency tests for all of the metadata commands, the correctness of each unique combination of metadata commands run concurrently can be tested without explicitly writing the code to test each particular case. This approach is much more concise than writing a single test for each scenario, is easier to document and understand, and hence saves a significant amount of time for developers working on writing concurrency tests.

Final Remarks

I definitely feel that I got to learn a bit about how sharding and replication are done at MongoDB, along with some exposure to how distributed systems are implemented in the industry. I gained experience with writing production-level C++11 code and became more comfortable with working in JavaScript, learning about many of the nice features of ES6 and encountering the gotchas and pitfalls of the language itself.

A few of the things that I wish I could have gotten some exposure to were working with Go, since that has been a language on my bucket list to explore for some time now, and learning more about MongoDB Stitch, a new backend-as-a-service offered by MongoDB.

In retrospect, MongoDB greatly exceeded my initial expectations of how much I would learn within a single summer. Beginning with my initial push to master during my first week and ending with some final bug fixes during my final week, I genuinely felt that I made an impact (as cliché as it may sound) on the millions of MongoDB users worldwide. I felt sad that my internship had come to an end, and very strongly recommend working at MongoDB to anyone interested in working in a fast-paced environment and growing as a software engineer.