How do you break up a Rails monolith?

22 Dec 2013

This was originally posted on Quora.

Part 1: Why a monolith is hard

Number of contributors

Your application probably made a lot of sense when fewer than a dozen developers contributed to it. As your headcount scales so does the desire of individual engineers to work outside the globally-shared assumptions of a single codebase.

Lines of code

Every additional file makes the percentage of the project that an individual engineer understands smaller. We don’t like to work on things we don’t understand, so unless we built the whole thing a large project just feels wrong.

Data growth

If your application assumes that every piece of data can be relationally joined to any other piece of data then you’re limited to storing only as much data as fits on a single hard drive. At the time of writing the very largest, fastest drive you can get is a 6TB drive from Fusion-IO. It is not cheap. Once you outgrow that you have to split your data up into parts and it may be easier to break up your application than teach your application to connect to multiple datastores.

Speed

Any large application will have some helpful code that is commonly-enough used that it becomes part of the framework. This may be that every web request looks up the user’s physical location via geo-ip or decrypts encrypted cookies, etc. When you have a feature that has to be super-fast you’ll wish it were on its own so all of the main app’s assumptions didn’t apply to it.

Upgrading dependencies

In a huge, well-aged app it’s a giant pain to upgrade the dependencies (like moving to a more modern Rails release). Smaller systems can be more easily upgraded as there are fewer lines of code to modify and fewer changes will be made by other engineers concurrently. This is a bit of a lie, however, because a well-designed monolith has dependencies abstracted away such that the upgrading difficulty is the same.

The underlying problem

Behind the above 4 things is the real problem: tech debt scales poorly. You can get away with crappy code when your application is 99% the Rails framework and 1% your code. Once it’s 15-25% your code then you own the bugs and the poor design. And, unless you have no engineers contributing to or users using the application, there’s inertia behind the current poor design.

Part 2: How to break up your monolith

Extract the most important thing

At AirB&B, Yammer, and Square the first parts extracted from the monolith were the high-performance or high-critical parts. Whatever behavior your app has to do fast, correctly, and constantly that is directly related to your competitive advantage as a business is the first thing you extract. This allows you to rebuild this feature properly and it lightens the stress on the monolith. Once you have, say, credit card payments not running straight through your monolith you can breathe a little easier.

Chances are, though, that after this first extraction there is more code in your monolith because now you’re coordinating the data between the two applications. There are still features in the monolith that depend on data in the extracted service (or vice versa) so you have to bridge them somehow. This has not solved your problem.

Extract more things

Especially if data growth is your problem you can probably buy yourself some runway by finding any data that looks or smells like a log of activity and move it out into a proper logging or analytics service. This still doesn’t solve your problem, however.

Build new features in new services

You’ve reached a tipping point when you have the platform support to create a new service for new features rather than having to build the features first in the monolith. This’ll slow the growth of the monolith by giving you an outlet for new development. It’ll also mean that you have a small team dedicated full-time to coordinating services and building synchronous and batch interfaces between them. This has actually made your problem slightly worse.

If you follow this path long enough and with enough engineers you no longer have a monolith. You have several. There’s the old one (now slightly tamed) and probably at least one new one that, if you were publish the source, would cause people to gawk at its size and sprawl.

Part 2 for real: How to actually break up your monolith

You need to design your systems.

When your startup is tiny and your biggest threat is that you’ll shut down before anybody hears of you then you’re just building whatever works. You’re probably writing good code but you are likely not writing good systems. This is okay in the beginning but once you’re surviving and trying to hire many more engineers you’ll need to rethink how you organize everything.

Start with data

The most important thing in your company is your data, not your code. So build a system that caters to the size and shape of your data. If you do financial work does the schema of your database use domain concepts that a CPA would recognize? What secondary datastore will you need to analyze your data (to avoid ruining the schema and scalability of the db where it was created)? Is there a natural way to shard your customers? What parts of your data are derivable from others? What can be eventually consistent and what needs to be consistent at all times?

One mistake I’ve often made is to try to centralize data about a single entity in a single place. In fact, as long as every entity (e.g. a user) has a unique token generated for it then each service and application can and should keep it’s own private data about the entity and communicate to each other using the unique token. You can always generate a complete synthesis of data by pulling information from multiple services. You don’t need to centralize most data any more than you need to centralize most code.

Think about the usage profile of your services

Once you extract your first service you might be stunned to notice that, now that you have users, some parts of your system have orders of magnitude more usage than others. People might be favoriting/sharing things 10,000 times per second but you’re only seeing signups happen 5 times per second. If this is the case then you can optimize the signup code to be friendly to your designers and ensure a development experience there that’s optimized for rapid iteration and A/B testing at the low cost of a few extra servers. And maybe you need to rewrite your favoriting/sharing service in Clojure from the ground up just to eek out every bit of performance.

There will be parts of your system that require speed, others that require ultra-high-availability, and others that are mere test features where the important quality is that you can ship them and scrap them fast. Each of this requires a different application design and, frankly, probably different programming languages.

Get your interfaces right

When all your code is in one app you know that you’re using other code correctly because you’ll get an ArgumentError if you improperly invoke a method or a NoMethodError if you try to call a method with a typo’d name. This doesn’t work across service boundaries.

You have two choices for inter-service communication:</ br>
Write documentation for humans to read and a build a CURL-able API into every services so that they can verify they did things properly. Pray the service and clients never miss an API update.
Use some machine-parseable contract of the API between client and server. Your options here are Thrift, JSON-schema, Google protobuffers, msgpack, capnproto, Blink and probably a few others. These will allow you to ship your API definition to any client and let them validate the schema of any inter-service message before they send it.

Delete everything you can

If you’ve made it this far then you probably have 20+ services and there are diminishing returns to adding any new service – the system isn’t getting any simpler. So the next step in splitting out your monolith into services is bold: delete old features and code. At some point every company needs to carefully review all the features they’ve written to see if they can be deleted or replaced with something simpler. For example, my team recently deleted 16 large, complex files that operated a flaky batch processing job because the service we were sending batch files to had built a better API in the last two years.

Really, anything to lower the amount of code and/or data going through your system (in any service) will mitigate the pain you felt when you started breaking up this monolith.

Part 3: Doing this with a Rails application

First the good news: Ruby on Rails apps have a standardized organizational structure and years of community best practices that help you keep things in place while you grow. This means that a Monorail (Rails monolith) can grow immensely larger than an application in most languages and still continue to function. As long as you have good tests your 200-controller Rails/ActiveRecord app will be far more pleasant to work with than a 200-controller Java/Hibernate app.

Now, the bad news: Ruby has very poor tooling for in-process dependencies. You are allowed to require anything in the load path at any time and circular dependencies are totally normal as long as you’re willing to do some kind of delayed initialization. This means your Monorail’s class-level dependencies would not, if displayed in a graph, be a tree structure – they’d be a cyclic non-directed graph. This pattern is also called the “big ball of mud” software pattern.

So unless you’re SUPER lucky and your developers read Sandi Metz book (search for ‘POODR’) they won’t have added explicity ‘require’ statements at the top of source code files and will instead just assume that everything gets loaded into memory all the time.

So your biggest challenge dissecting a Monorail is straightening out the dependencies such that you know what’s referring to what. Once you know that feature X can live in a separate database (or service) and nothing will try to join its SQL queries across feature X’s tables then the extraction is relatively straightforward. But getting there is hard because there’s currently no tool that I know of to analyze the kinds of queries you make in production and develop a dependency graph of models. Please build this and I’ll totally use it.


Please if you found this post helpful or have questions.