Let’s pretend that we’re in the house building industry (c’tors Inc.), One days while you’re getting fresh air and working hard the building inspector comes along, climbs to the top of our not yet complete structure and yells: “There’s something wrong with the left side of the building!” then goes away. As a pretend construction professional – what do you think are the chances that someone would fix the problem?
If the scenario above sounds crazy to you – that’s ok, unfortunately I see it unfold daily.
Most companies these days have some kind of automatic build process (and I use the term loosely), files get checked-in/submitted/pushed (all of the above?) to the source control and server will try it’s best to build the new source (and maybe run some tests) at one point according to preconfigured trigger – anything from immediately to the next day. The problem starts when that process fails, in that point there are two possible outcomes: someone would fix the problem and make the build continue to pass (a.k.a green) or everybody will ignore that server for a long time which could become forever.
I’ve noticed that usually when software developers ignore the broken build do not do so out of malice or laziness.
Unfortunately, a broken build means that although someone (perhaps yourself) took the time of automating parts (or all) of the build/test process and all of her hard work is wasted because no one would fix the damn build.
I’ve noticed that when the build system is left broken for a long time is happens due to one of the following reasons:
- No/little build visibility
- Lack of knowledge
- No definition of Individual responsibility
Ideally anyone every relevant member of the team must know when a build fails. Better yet all of the company should have easy access to the current build state.
Consider the following:
I think #5 is going too far but you get the point.
- All of the team has access to the build server by URL
- Email is sent when a build fails to the relevant person(s)
- 60-inch screen in the middle of the dev room shows the current build status
- When a build fails a big red light mounted in the dev room/hallway blinks
- When a build fails a picture of the person who broke the build shows in every screen in every conference room
If you think that installing a build server and making the URL available for the whole company is good enough – I got news for you. People are way too busy to go to that URL and try to understand what the build server is showing them. Adding Email is case of failure is also a good idea but not sufficient – after a few of those some (read: most) developers would learn to ignore them. If you add email notifications on successful build you’ll only make this process (of ignoring builds) happen faster.
A failing build should be visible and impossible to ignore
At one company I worked with some developers didn’t even know what the build URL was, and no idea how to find out why the build has just failed…
Another important factor is how easy/hard it is to discover why the build has failed. Not all build servers were created equal – some do a better job of showing the root cause of the failure and some require reading 10 pages of logs. My point is – fixing a broken build happens when you need to do something else (developing software) and as such should be as simple and painless as possible.
This usually a problem if the build script performs too many things. Let’s go back to our imaginary scenario where the build inspector’s shouts about a problem in one of the build’s components – and I’m not familiar with that component or I don’t have the right expertise to fix that particular problem. In that case I’m going to continue working as if nothing happened – or go and grab a cup of coffee until the problem resolves itself.
The problem with big build scripts that does a lot of things is that it’s hard to tell why a specific step (or 100 tests) have just failed and then everyone on the team get a bad case of “it’s somebody else’s problem”.
After fixing the visibility problem we know the build has failed and with some investigation we can tell why – and yet it does not matter if the problem domain is so complex that no one how to understand the failure reason.
The right solution is to try and split the build into several individual builds where each team (and each team-member) knows exactly where their responsibility (development wise) starts and ends.
In the heart of a healthy process lies personal responsibility and integrity.
When a build fails the last person to commit code is responsible to make sure that the build passes as quickly as possible. Anyone effected by this failure is responsible not to make the problem worse by blindly committing more code and help if asked. As simple as that. This kind of personal integrity could only be achieved if the build failures are visible, easy to investigate. Some teams need a manager to tell them so and some need a simple reminder from time to time. It usually helps if there is someone who is passionate about the build although this is a team effort, not “Joe’s” problem.
I would avoid shaming (e.g. show build breaker name on all conference screens) and instead try to understand why people don’t care that the build is broken. Usually it has something to do with one of the previous points and not because of lack of commitment.
A broken build is not a pretty sight and should be fixed as quickly as you can. The good news is that it’s easily solved with the proper tools, education and plain old nagging. As long as you take the time to understand what are the reasons other talented developers seem content of leaving it broken.
Try it out – you might be surprised to find out that you’re not the only one who cares.
Until then – happy coding…
Labels: CI, Thoughts