Software Won't Fix a Bad Safety Culture - Part 2
In my previous article last week, I described what I view to be the systemic issues that played into Boeing's deployment of a flawed 737 Max. As you may recall, my article began as an extended commentary on an article published by EE Times -- "Software Won't Fix a 'Faulty' Airframe". This article continues on that path, with some talk about the "cultural laziness" described in that article, along with other elements of safety-culture within regulated industries, such as aerospace. [the orginal version of this article was published on LinkedIn]
In my view, "safety-culture" connotes an ambient peer-pressure to do the right thing instead of the expedient thing. This factor often seems to be lacking when decision making goes bad. This article is about the people-factor that can lead to safety-issues. The issues with the 737 Max were not unforseeable. The evidence suggests that people did not follow the processes as they were required to. Since I can't know about the people-issues that may have been at play while that aircraft was being developed at Boeing, I will draw on examples from my own experiences of more than 20 years with clients who make safety-critical equipment and applications.
Team makeup can vary greatly from one organization to the next. It is fairly typical that a "team" assigned to a given subset of functionality is comprised of 10 or fewer people. Sometimes, there are often just a few teams involved in the engineering and quality assurance for a single product, or even a product family. In these smaller groups, I've seen fewer issues. Where systems are more complex, the engineering organizations tend to be larger. Some groups I've worked with have been as large as 150 people. These are "fun-sized" into teams, each responsible for a small piece of the overall system. Coordination of responsibilities among these many teams introduces a lot in the way of process complexity, politics, and obfuscated communications.
Politics - Engineering organizations are like any other -- cliques and heirarchies form. Some people will irritate, some will annoy, and some will be technically inept yet skate by on their likeability, and everybody wants to get ahead. Most people will be above the politics most of the time, but now and then politics get in the way of the best decision. You'll see this when a person is given a high-profile task that they are ill-suited to, either to boost their profile, or to reduce the profile of a better candidate. Often this is self-correcting, occasionally not. I suspect that this facet of the culture will never be eliminated. People are people.
Inter-team Communication & Coordination - In larger organizations, communication and coordination among teams can create issues. It doesn't require much imagination to see how this might impact safety.
Process Complexity - This is an article about people, but safety-culture is heavily influenced by process. Defined processes are a necessary part of any safety-critical development effort. As groups and systems get larger, and coordination issues become more complex, there is a perceived need to harmonize the work being done by the various teams. This is typically done through creating bigger and more well-defined processes. A certain degree of this is helpful, but it is often taken too far. (*-1) One common manifestation of the complexity problem comes in the form of coding standards for software development. Put simply, a coding standard is a set of rules and constraints for writing software. In an effort to make code perfectly consistent throughout and across projects, coding standards can rapidly become overwhelming. Many rules tend to be the arbitrary formatting preferences of those writing the standard. When a coding standard includes 135 pages of rules, what are the odds that any developer will remember them all, or that peer review will catch them? Yes, I've seen coding standards get this big. In every case where I've seen this kind of bloat, I've also seen that enforcement of the standard is lax (*0). Since the real point of coding standards is to help ensure that the code is sound, reviewable, and maintainable; arbitrary rules accompanied by lax enforcement introduces risk. In my opinion,it is far better to have 5 - 10 pages of important rules that are easily remembered by all, and consistently enforced.
Poor Training - All organizations want their people to be well-trained on the technologies they are using, but many aren't willing to absorb the cost for that training. I worked with one organization that had adopted the SEI Personal Software Process methodology. This process is considered a best-practice for improving both quality and productivity in software development. The organization had gone to the effort of developing a tool-set for entering and maintaining the data required for the method. With over 100 software engineers on staff, they took a little short-cut in the training. Instead of enrolling their entire group to the two separate 1-week courses needed to master this topic, they sent 2 people who were not part of the engineering group, who then distilled the material into a 1-day overview course given to the larger group. At least this was my understanding; as those of us who joined the group after the overview had been presented were just given a 10-minute briefing on putting data into the toolset. My view on PSP was that it was the most ridiculous approach I had ever used (*3). This is because the people who explained it to me had only a couple of major takeaways from their training. 1) You keep data so you can improve your development, and 2) Don't spend much time reviewing your own code, because it's better for the verification teams to find more defects. For the latter reason, at least some engineers were being intentionally lazy about finding defects in their code; and they actually believed it to be a good thing. As it turns out, their understanding of PSP was almost entirely incorrect due to bad training. I don't know what percentage of the engineering group shared these misperceptions about PSP, nor how many put those misguided ideas into practice. Having experienced how the method was applied, I am pretty confident that mandating PSP was a net loss, perhaps to the tune of millions of dollars on this one program alone. Since verification will never find 100% of the defects that these engineers left behind, the safety-risk introduced into the system is incalculable.
Budget, Schedule & Engineering Judgement - Often, not always, managers come from the ranks of engineers who were less technical and more personable. This seems to account for about half of the technical managers that I've worked with through the years. This is not necessarily a recipe for failure. When these managers recognize their limitations and trust the judgement of their engineering teams, it can go quite well. On the other hand, when these people believe that they became managers due to their own expertise, it can cause them to override sound engineering decisions. Good engineers tend not to be lazy. Instead they tend toward perfectionism and over-engineering. This sets them up for conflict with budget and schedule. When a manager is less technical than his/her reports, how should can they determine whether an engineer suggested enhancement is a requirement to achieve safety, or just over-engineering? Sometimes it is pretty clear, other times not. This can be a source of risk.
One of the most egregious departures from safety-culture that I have witnessed was justified by schedule & budget. In working with a critical subsystem for a commercial airliner, I was involved in a safety analysis of the software process interactions. Because of the nature of the analysis, I was reviewing huge swaths of code in a short period of time. I discovered a large number of defects, most of which were innocuous, but some of which were not (*4). As the process required, I logged all defects into the issue tracking system, and tagged it with my assessment of the severity. By the end of my analysis I had logged dozens of defects, a few of which were very serious. A couple of weeks later, I got a tip that the Change-Control Board (CCB) was intentionally disregarding all of the defects I had reported. It seems that a senior manager on the board had made the determination that I was reporting too many issues, and the manager believed that disposition was impacting CCB productivity and potentially the project schedule -- so the board would simply defer any issue that I had reported until a later release, without evaluation (*5). I took the information to my team lead. He reviewed all of the issues I had flagged, and agreed with me about their severities. There were about five that were serious enough for concern. He asked me to re-write each of those, changing the wording to make it appear original, and he would re-submit in his name. We did that, and all of these issues were immediately prioritized for resolution. Had I not been tipped off about the CCB; or had I been discouraged, and just let it ride; had I cared more about receiving credit for the finds, than about their resolution; or had my team lead not been a hero in his own right; this might easily have resulted in a catastrophic loss of aircraft.
Shortcuts and Laziness - I worked with one organization that was building a complex safety-critical aerospace system. During code reviews, I was told several times that I bugs I had reported were not valid. When I would point out the clause of the coding standard that was violated, I was told to "ignore that one". When I asked my manager about this practice, he said that the published coding standard was basically a guideline, and that the organization allowed the team leads to establish standards for their own teams -- but these weren't written standards, they were just whatever the team lead liked. Effectively there were no coding standards in regular use. While I did not see any unsafe code result from this lack of discipline, I don't know how one could demonstrate that a valid review was ever done. An audit should always be able to show what standard & version a given module of code was reviewed against. An audit done on this code would have found that the software had very little conformance to any written standard. I was also shocked to learn that nobody within the organization seemed to know what FAA-mandated standards & regulations applied. The site manager told me that those were probably managed out of another office. While there seemed to be solid discipline in the processes, it was as if they were just going through the motions. The safety-culture seemed non-existent. I left the organization at my earliest opportunity.
Ego & Ignorance - I once created an automated test that reliably and repeatedly caused a medical device to improperly reset. When I took my results to the engineer who was responsible for the failing component, he reviewed his code and declared that my result was not possible. He insisted that the design was sound and that his code conformed to the design; therefore the fault had to be that my test was detecting a problem where none existed. After ruling out this interpretation, I reviewed the design, which I found to be sound. I then reviewed the code and found a problem -- where the design called for a specific condition to be not equal to false, he had coded it to test for the condition to be equal to true. While this sounds perfectly reasonable, the subtle problem here was that, in the C programming language that this code was written in, true and false equate to the integer values of 1 and 0, respectively. Since an integer can assume values other than 1 and 0, "equal to true" (1) is not the same as "not equal to false" (0). When I showed the engineer this error, he seemed very offended that I had reviewed his code, and continued to insist that it was correct. Eventually, when I kept pushing the issue, he suggested we take it to his supervisor. The supervisor listened to both sides, and concurred with my assessment. The issue was resolved, but only due to my willingness to keep pushing. This engineer was one of the best on his team, and I certainly respected his abilities. However, he seemed to be unable to accept and evaluate a critique from someone who he perceived as less experienced than him. He was sure that I would be reprimanded for making him involve his superior, and he told me so. Fortunately for the customers of this device, his supervisor was willing to listen and act.
Organizational Hubris - This is Ego to the next level. I mentioned this in the previous article. It seems incredible, but I've seen it in practice. There are organizations that believe that their long experience makes them infallible, and therefore unaccountable. I was working on a complex aerospace system that was built on top of a multi-tasking real-time operating system. The operating system was developed by an external organization with close ties to ours. The system that we were creating was to be installed in a new commercial airliner, and the aircraft manufacturer required specific types of analyses and reports for all software components, including the operating system. The creators of the operating system were taking a stand. They had the needed documents, but would not release them to us or our customer. It seemed to be a point of pride for them. Their argument was that we should trust them, because they had been doing this for a long time. Admittedly, their reputation was sterling. All of the members of that team were well known to us, and nobody doubted that they had done exactly what was required of them. That did not change the fact that we needed documentation to show that the proper analyses had been done. I was asked to help facilitate their cooperation, and though it took a few weeks, and multiple teleconferences, our team was finally able to present them with an argument that they found compelling. They delivered the documents, and as expected, their analysis was flawless. A good safety-culture requires us to provide objective evidence that we have done the right things, and done them correctly. It doesn't allow for exceptions just because we think we are that good; even if we really are.
Some of the enemies of safety-culture are inherent in human nature. Others represent a conscious departure from, or de-prioritization of safety. Safety-critical applications involve hard engineering problems, and even harder people problems. Those of us working in this space need to call out problems when we see them. Failure to do so may have far reaching, disastrous results.
===
*1 - A more famous example was [DOD-STD-2167A](http://continuum.org/~brentb/DOD2167A.html), a software process standard mandated for use by all defense contractors in the late 1980s. I never had the honor of working to this standard, but it's been said that it required roughly 10 lines of formal documentation for every line of software written. I understand that many vendors were able to justify waivers just due to the cost of adhering to it. When processes grow this complex, people naturally look for shortcuts and loopholes.
*2 - There are tools that can automate the checking for some kinds of rules -- this is the preferred solution when the rules are complex. Often, however, a tool will find violations, and a case-by-case decision will be made to disregard. This too may lead to shortcuts -- such as filtering and disregarding anything below a given severity.
*3 - PSP is actually a very good methodology, but not as it was introduced to me, nor in the way that it was applied by that organization. In fact, a 2006 study concluded that PSP was one of only two published methodologies that consistently resulted in enhancing software quality and developer productivity, the other being Correctness-by-Construction (CbyC). Reading that study prompted me to purchase the official [PSP book](https://www.amazon.com/PSP-Self-Improvement-Process-Software-Engineers/dp/0321305493), and I've been a fan since that time.
*4 - A software defect can be something as simple as a comment incorrectly reflecting what the software is doing at that point. This often happens when a software engineer changes his/her mind about the best approach. It is inocuous, in that it does not impact current functionality, but it can create issues in future updates.
*5 - The Change-Control Board is a body that decides on what changes will be permitted, as the software nears completion. This is because there is a need to stabilize or "lock down" the software so that testing and certification can be completed. The person who tipped me off was a new manager and a CCB member, but was too intimidated to take on the manager who had made the decree. I wasn't told specifically which senior manager had made this determination. Given the nature of the decision, the manager who made that call should have been reported and relieved of all safety-related management duties.