I've been in the software industry for a decade and a half and have worked on dozens of projects. Many of the systems that I have worked on were considered legacy systems. As with any system, but even more so with legacy systems, developers will get frustrated with the systems inflexibility. And inevitably this will lead to the developers decreeing that if they could only re-write the system all the problems will be solved. Unfortunately most product owners will eventually give in to these cries and will commission a re-write.
I'm here to tell you today (as both a developer and a manager) giving in to this urge IS NOT going to solve your problems. What it is going to do is grind your production to a halt and make your customers unhappy. This will have downstream effects on the team as the pressure to produce builds and builds and builds.
So why is a re-write not a viable solution?
Re-writes are usually based on a few commonly held (but false) beliefs in the software industry.
- If we start the project over from scratch we won't carry the problems from the old system into the new.
- If we start the project over from scratch we can use the latest and greatest technologies that are incompatible with our current technology stack.
- If we start the project over from scratch we can move faster and produce results quicker.
Why are these fallacies? If we dig a little deeper we will see that a ground up re-write means you are more likely to introduce problems in the new system than you are to solve problems in the old system. What is typically glossed over is the fact that the current architecture is doing a lot of stuff correct. How do I know this? Because it's the architecture that is in production right now running your business.
Let's take them at each of these fallacies one by one.
If we start the project over from scratch we won't carry the problems from the old system into the new.
This statement can really be broken down into two parts. The first part says that there are problems in the architecture that prevent you from extending the code and because you're now aware of those problems you can re-architect the software so that those problems no longer exist. The second part says that you won't carry over existing bugs into the new system. The second part of this statement is really related to the second fallacy, so we'll cover it when we cover that fallacy.
Because it is true that re-writing a system with knowledge of the current architectural problems can help you avoid current pain points most people are quick to accept this statement without challenge. There are many different times in the life-cycle of a product when problems arise. Some arise as bugs when writing the software. These can typically be rooted out with some sort of unit testing. The next class of problems crop up when integrating each of the pieces of the system together. You can create integration tests to help reduce the amount of integration bugs but often there are integration bugs that don't show up in pre-production environments. These tend to be caused by the dynamic nature of content. Because the new system is a re-write of the old system it will be more difficult to use real inputs/outputs from the old system to test the integration of the new system. Because of this you're likely to introduce problems in the new system that don't already exist in the old system. Because the new system won't be in production till it's done, these new architectural problems are not likely to be found till your new system is in production.
If we start the project over from scratch we can use the latest and greatest technologies that are incompatible with our current technology stack.
On the surface this statement is likely true. What this statement hides is similar to what's hidden in the previous statement. New technologies mean new bugs and new problems. Again it is likely that many of these problems won't surface till the new system is in production because, as anyone who has worked in the industry for at least a few years knows, production traffic is always different from simulated traffic. You run into different race conditions and bugs simply because of the random nature of production traffic.
If we start the project over from scratch we can move faster and produce results quicker.
The final fallacy is usually the one that most companies hang their hat on even if they acknowledge that a re-write from the ground up will introduce new bugs and problems and re-introduce existing bugs and problems. The reason is because they believe that their knowledge of the existing system should help them to only solve problems that need to be solved which leads to the system being built much faster.
The fallacy in this statement is more subtle but much more severe than the others. The reason is because until your new system performs all functions of your old system, the old system is superior from a business value perspective. In fact it isn't untill the new system has 100% feature parity with the old system that it starts to provide the same business value as the legacy system, not to mention more business value. Some will try to gain business value from the new system earlier by switching over to the new system before there is 100% feature parity with the old system. But by doing this you're offering your customers less value for the same amount of money, time, and/or investment.
This visual does a good job of illustrating the feature parity problem.
What is the solution then?
Are you saying I'm stuck with my current architecture and technology stack? NO! The best way to upgrade your technology stack is to do an in-place re-write. By doing this you help mitigate the problems presented in a ground up re-write. What does an in-place re-write look like?
By segregating and replacing parts of your architecture you're reducing the surface area of change. This allows you to have a well defined contract for both the input and output of the system as well as the workflow.
In-place re-write has another huge benefit over ground up re-write. It allows you to validate your new system in production as you would any new feature of the system. This allows you to find bugs sooner as well as validate the workflow and feature parity.
Another benefit of an in-place re-write is that you can decommission parts of the legacy system as you go without ever having to do a big (and scary) "flip of the switch" from the old system to the new system.
Most importantly, your customers do not suffer when you do an in-place re-write as you are not ever taking away features from your customers. Even better, you can prioritize giving your customers new features earlier by implementing them on the new system even before you've finished porting the entire old system over.