Recently I’ve been working on two legacy projects, which contain a substantial amount of stinking code. Actually the reason why I’ve been assigned the projects might have been the fact that the code bases have become unmaintainable for the previous developers (who seemingly have been selected for their low cost, and consequently, low skill). However, there is never a project that you couldn’t learn more on, nevertheless the otherwise poor coding practices in use.
On the second project I have been wondering about the overall error management philosophy utilized. Although being full of duplicated code, long methods and security vulnerabilities, the error management seems to have been written by someone who wants to avoid errors from being communicated to the user up to the last possibility. Although that under the hood there might be (and are) multiple exceptions raising for example from missing server connectivity or some errorous SQL, the system uses multi-layered try-catch structure to prevent any errors from being directly shown to the user (however, in the most serious bugs even that is not enough). It makes me to wonder the nature of the organizational culture of the software company that forces the low skilled coders to improve their skills up to the extreme on error management rather than keeping up the quality in the first place…
Anyway, since the approach of error management in this project is so different (I’ve been always happily throwing errors to the users), I thought there might be something to learn about the approach. From the user experience (UX) point of view, things seem to be actually going rather smoothly, since system seems to almost never have any problems, at least on the surface. When you learn the system more deeply, you start to notice that the results produced are not quite right, but it takes months and deep domain level expertise until you notice the problems. I thought that this would be also generally a better approach for error management than just throwing everything to the user, who might not know (unless he is a programmer and a system admin) what to do about them. In fact, the error messages I’ve been traditionally writing are more related to debugging than giving useful information to the user, how to recover from the problem.
In the end there are many non-debugging related problems that might need user involvement. Typically these kind of exceptions are more related to the external connections and environment of the software rather than internal operations (where you need debugging messages). For example your internet connection or database might be down, which cannot be fixed by the developers or software alone. What then would be a better approach to manage error messaging than the extremes of just throwing errors directly to the user, or suppressing them in all cases giving no information of the problems?
Jidoka 23 Steps of Autonomation
Shigeo Shingo has described the 23 stages of autonomation, how a system can manage errors. On the first level the system does not detect or react to any errors, but needs a human operator to constantly monitor the system for irregularities. On the highest level of automation, a system can both detect and fix errors by itself, continuing operations and minimizing need of human involvement.
Quote from Wikipedia: “Jeffrey Liker and David Meier indicate that Jidoka or ‘the decision to stop and fix problems as they occur rather than pushing them down the line to be resolved later’ is a large part of the difference between the effectiveness of Toyota and other companies who have tried to adopt Lean Manufacturing. Autonomation, therefore can be said to be a key element in successful Lean Manufacturing implementations.”
Thus it seems that a better way of error management would be to use Jidoka-style error recovery. For a computer program detecting problems is usually quite easy, by using the try-catch -statement. The difference comes from what to do the next. The traditional options are to pass the error forward on the next level (user), to suppress it, or to log it for debugging and sysadmins to fix later.
Sheigo suggests that when feasible and cost-effective, the system should try to repair itself and recover from the detected error. One bug that I’ve been recently fixing is related exactly to this. It occurs only in the rare situations when the client software looses connection to the server. The try-catch -statements detect the situation, and the recovery process includes passing the input back to the user. However, the problem is that though by quick inspection the return “looks like” being correct, it is missing vital added-value information provided by the server. In addition the recovery process introduces a (duplication) bug. Thus the recovery process is both errorous and recovering wrongly. Initially, when fixing the problem, I thought that it would be enough just to fix the bug that I was assigned to fix. However, when doing automated test cases, I noticed that since also the recovery process was functioning wrongly, another approach should be used than “seemingly recovering”, but not actually recovering from the missing server connection.
How could the missing server connection be remedied? I was thinking a few approaches, firstly the client-server connection wouldn’t need to be synchronous, an asynchronous queue and messaging system could actually handle the recovery better. A monitoring system should be set up to notify the system administrators of missing DB or server connectivity, or other environmental problems. The system could queue (and not block) the messages until the environment has been restored. The particular situation where the issue arises is actually doing development while commuting without Internet connection. The development environment issues could be also remedied by using mock services simulating an operational server.
Facebook has actually built an automatic remedy system for infrastructure caled FBAR.
The original idea for the automatic recovery become from refactoring legacy code by automated tests, so we support using the approach for all projects – automated testing surfaces issues that are otherwise easily bypassed. Also, I find the idea of automatic recovery important from both User Experience and error proofing (Six Sigma) point of views. When you are catching an exception, do not pass it forward, or suppress it, but initiate a recovery process that tries to fix the situation. The remedy process can for example contain asynchronous messaging system, monitoring or mocks. The users and admins should be notified only when the recovery process also fails.