Sunday, August 27, 2006

Continuous integration kills large projects

Continuous Integration is an widely accepted approach in software development. Martin Fowler describes has written een great article about it.

However, for a large project it does not work. I will explain why not. Suppose that we have a big project of 100 software developers. Each of them is an excellent worker, and only 1% of their deliveries causes the daily build to fail. With 200 working days per year, a developer causes only 2 daily builds per year to fail. Can you beat that?

Now although this is an excellent achievement, with 100 developers this will cause about 200 daily builds to fail, which is... every day the build fails!

So every day, the build manager has to figure out what is wrong. And since each of the developers are excellent workers, the finding the true cause of the build failure is hardly ever a trivial challenge. So with the help of the developers that have delivered, it might well take the whole day before the problem is fixed. Then, the build has to be distributed across all development teams, all development sites and incorporated in the workspaces of the developers. There will be little time left for integrating the new build with the developer's changes in his private workspace. So... continuous integration leads to continuous build failures!

In addition to the continuous build failures, when a developer must stay in sync with the latest build he has to integrate the latest build with his own changes every day. If this takes 1 hour, he will spend 4 hours per week on synchronization, which is at least 10%, day-in-day-out. They probably will get frustrated about going through their old changes every day, just because "build management provides a new build". So... continuous integration lead to frustration!

But when developers get frustrated, they will get sloppy at their work. And when they are sloppy, they are more likely to deliver changes that cause build failures. Let's assume it increases from once per half year to once per quarter. That is 400 build failures per year, or 2 build failures per day. So build managers are confronted with more problems causing the build to fail, which probably requires more work to uncover and fix. One day will not be enough. So... continuous integration stops to be continuous!

What can we do to overcome this problem?

I think we can go in two directions:

  1. Reduce the scope of the continuous integration to smaller teams
  2. Replace continuous integration by iterative integration

You can reduce the scope of the integration by dividing the large project into smaller teams, for example sub-system teams. Each team then performs continuous integration in a local scope, independent of the build of other teams. Only when it succeeds, the changes are promoted to a higher level of integration, for example system integration. Then when the system integration fails you have a choice. Either get the "guilty" developers involved in fixing the system integration, thus extracting them from their sub-system team temporarily, or force the "guilty" sub-system team to incorporate the changes of the (failing) system integration build in their sub-system build.

The other approach is to go to a weekly integration, where all developers deliver to. It will be a "slow" variant of the continuous integration approach. With less frequent integration, the amount of integration problems will increase. But instead of some hours, you have some days to fix them. And developers lose less time on synchronization with the build since they only have to do it once per week. I have seen this work, but only for a relatively quick build. For a build that takes several hours to run, the cycles for fixing the build are too long and even weekly integration will get stuck.

So my conclusion is that for large projects, it is better to divide the project into smaller teams, each with their own continuous build. It requires a more thorough (layered) integration strategy, and propagation of changes will be slower than with continuous integration. Continuous integration does not work for large projects.

Updated 14/4/2007:
Added labels.

4 comments:

Brad Appleton said...
This comment has been removed by a blog administrator.
Brad Appleton said...

Hi Frank!

I think that in your scenario, long before you ran into the problem of daily build failures you would first see progress grind to a screeching halt due to commit-contention and build-cycle-time. If the project is large, then lets assume the build is at least 30 minutes (which is pretty modest for a "clean" build - yes?).

There is no way 200 developers are going to be able to commit their changes even as often as every other day if it takes 30 minutes to build+test and no one else can commit changes during that time.

So what do you do? You do the classic technique of recursive decomposition. You absolutely *have* to decompose the system into separately built+integrated components.

Each component can then do continuous integration (if not, you recursively decompose again). And all the components need to be integrated together and tested at a somewhat higher frequency (perhaps a couple times a week).

The paper discusses this in a little more depth. Basically, each component should periodically "push" its latest good build into a staging area where all the other components "dock" their latest good builds.

They may not be able to do this as frequently as each component builds. They will probably have to use what Don Reinertsen calls
Nested Synchronization and Harmonic Cadences. I blogged a bit about that (following the previous hyperlink).

So I would say it's not accurate that "Continuous Integration Kills Big Projects" because I think that continuous integration wasnt intended to scale "linearly", but recursively, at multiple levels of integration. Agile is not only about having frequent feedback, but also about having feedback at all levels of scale (just as the XP practices are themselves arranged at the levels of individual, team, customer).

Cheers!

Unknown said...

Thanks for your comment, Brad!

I think you are absolutely right. That's why I argue that it better to reduce the integration scope (single components, or single teams) and use a multi-layer integration.

In organizations I am usually involved with, a central build has worked for many years and gradually degraded the project's performance.

They argue that if we add extra integration layers, every layer will add a build cycle and integration with changes of other components is postponed until the next layers. Integration problems will show up later. "How will performance improve if we add delays?", they ask.

It's hard to convince them that "grind to a screeching halt" is what they are heading at, especially since I have no reference where it actually happened - well, actually I do, but I know nobody who wants to admit that bad integration strategy was debt to it.

May be the title is a bit harsh, but it does draw attention, doesn't it.

Anonymous said...

What ever you have explained is purely happening in my case.The build time is 2 hrs and using hudson if the build fails , It takes another 2+ hrs to build and test.the daily build is a sore dream.