Ariane Launcher Case Study

This case study describes the accident that occurred on the initial launch of the Ariane 5 rocket, a launcher developed by the European Space Agency. The rocket exploded shortly after take-off and the subsequent enquiry showed that this was due to a fault in the software in the inertial navigation system.

In June 1996, the then new Ariane 5 rocket was launched on its maiden flight. It carried a payload of scientific satellites. Ariane 5 was commercially very significant for the European Space Agency as it could carry a much heavier payload than the Ariane 4 series of launchers. Thirty seven seconds into the flight, software in the inertial navigation system, whose software was reused from Ariane 4, shut down causing incorrect signals to be sent to the engines. These swivelled in such a way that uncontrollable stresses were placed on the rocket and it started to break up. Ground controllers initiated self-destruct and the rocket and payload was destroyed.

A subsequent enquiry showed that the cause of the failure was that the software in the inertial reference system shut itself down because of an unhandled numeric exception (integer overflow). There was a backup software system but this was not diverse so it failed in the same way.

Use of this case study in teaching

This case study illustrates issues with requirements specification, multi-organisational working, critical systems validation and some of the problems of software reuse. The example illustrates that good software engineering practice (reuse, don’t introduce changes unless necessary) can have problems and highlights the need for diversity as well as redundancy. It also shows the organisational complexity of systems development and how organisational issues can lead to systems failure.I have used it in conjunction with lectures on critical systems validation.

Supporting documents

My video explaining the causes of the Ariane 5 launch explosion

A video of the take-off and explosion after 37 seconds

The Ariane 5 failure.  A Powerpoint overview of the system failure

Ariane 5 – Who Dunnit?. A short article by a distinguished professor of software engineering discussing the complex causes of the failure

Ariane 5: Report of the post-accident enquiry (External link)

Ariane 5: A programming problem? (External link). An extended discussion of the Ariane 5 failure

Design by Contract: The Lessons of Ariane

by Jean-Marc J�z�quel, IRISA
and Bertrand Meyer, ISE

This article appeared in a slightly different form in Computer (IEEE), as part of the Object-Oriented department, in January of 1997 (vol. 30, no. 2, pages 129-130).

Reader reactions to the article published in IEEE's Computer magazine appear at the end of the article.

Keywords: Contracts, Ariane, Eiffel, reliable software, correctness, specification, reuse, reusability, Java, CORBA, IDL.

How not to test your software

Several earlier columns in IEEE Computer have emphasized the importance of Design by Contract for constructing reliable software. A $500-million software error provides a sobering reminder that this principle is not just a pleasant academic ideal.

On June 4, 1996, the maiden flight of the European Ariane 5 launcher crashed about 40 seconds after takeoff. Media reports indicated that the amount lost was half a billion dollars -- uninsured.

The CNES (French National Center for Space Studies) and the European Space Agency immediately appointed an international inquiry board, made of respected experts from major European countries, who produced their report in hardly more than a month. These agencies are to be commended for the speed and openness with which they handled the disaster. The committee's report is available on the Web in these two places:

It is a remarkably short, simple, clear and forceful document.

Its conclusion: the explosion was the result of a software error -- possibly the costliest in history (at least in dollar terms, since earlier cases have caused loss of life).

Particularly vexing is the realization that the error came from a piece of the software that was not needed during the crash. It has to do with the Inertial Reference System, for which we will keep the acronym SRI used in the report, if only to avoid the unpleasant connotation that the reverse acronym could evoke for US readers. Before lift-off certain computations are performed to align the SRI. Normally they should be stopped at -9 seconds, but in the unlikely event of a hold in the countdown resetting the SRI could, at least in earlier versions of Ariane, take several hours; so the computation continues for 50 seconds after the start of flight mode -- well into the flight period. After takeoff, of course, this computation is useless; but in the Ariane 5 flight it caused an exception, which was not caught and -- boom.

The exception was due to a floating-point error: a conversion from a 64-bit integer to a 16-bit signed integer, which should only have been applied to a number less than 2^15, was erroneously applied to a greater number, representing the "horizontal bias" of the flight. There was no explicit exception handler to catch the exception, so it followed the usual fate of uncaught exceptions and crashed the entire software, hence the on-board computers, hence the mission.

This is the kind of trivial error that we are all familiar with (raise your hand if you have never done anything of the sort), although fortunately the consequences are usually less expensive. How in the world can it have remained undetected, and produced such a horrendous outcome?

Is this incompetence?

No. Everything indicates that the software process was carefully organized and planned. The ESA's software people knew what they were doing and applied widely accepted industry practices.

Is it an outrageous software management problem?

No. Obviously something went wrong in the validation and verification process (otherwise there would be no story to write), and the Inquiry Board makes a number of recommendations to improve the process, it is clear from its report that systematic documentation, validation and management procedures were in place.

The contention often made in the software engineering literature that most software problems are primarily management problems is not borne out here. The problem is technical. (Of course one can always argue that good management will spot technical problems early enough.)

Is it the programming language's fault?

Although one may criticize the Ada exception mechanism, it could have been used here to catch the exception. In fact, quoting the report:

Not all the conversions were protected because a maximum workload target of 80% had been set for the SRI computer. To determine the vulnerability of unprotected code, an analysis was performed on every operation which could give rise to an ... operand error. This led to protection being added to four of [seven] variables... in the Ada code. However, three of the variables were left unprotected.

In other words the potential problem of failed arithmetic conversions was recognized. Unfortunately, the fatal exception was among the three that were not monitored, not the four that were.

Is it a design error?

Why was the exception not monitored? The analysis revealed that overflow (a horizontal bias not fitting in a 16-bit integer) could not occur. Was the analysis wrong? No! It was right -- for the Ariane 4 trajectory. For Ariane 5, with other trajectory parameters, it does not hold any more.

Is it an implementation error?

Although one may criticize the removal of a protection to achieve more performance (the 80% workload target), it was justified by the theoretical analysis. To engineer is to make compromises. If you have proved that a condition cannot happen, you are entitled not to check for it. If every program checked for all possible and impossible events, no useful instruction would ever get executed!

Is it a testing error?

Not really. Not surprisingly, the Inquiry Board's report recommends better testing procedures, and testing the whole system rather than parts of it (in the Ariane 5 case the SRI and the flight software were tested separately). But if one can test more one cannot test all. Testing, we all know, can show the presence of errors, not their absence. And the only fully "realistic" test is to launch; this is what happened, although the launch was not really intended as a $500-million test of the software.

So what is it?

It is a reuse error. The SRI horizontal bias module was reused from a 10-year-old software, the software from Ariane 4.

But this is not the full story:

It is a reuse specification error

The truly unacceptable part is the absence of any kind of precise specification associated with a reusable module.

The requirement that the horizontal bias should fit on 16 bits was in fact stated in an obscure part of a document. But in the code itself it was nowhere to be found!

From the principle of Design by Contract expounded by earlier columns, we know that any software element that has such a fundamental constraint should state it explicitly, as part of a mechanism present in the programming language, as in the Eiffel construct

    convert

    (horizontal_bias: INTEGER): INTEGERis

      require

        horizontal_bias <= Maximum_bias

      do

      ensure

      end

where the precondition states clearly and precisely what the input must satisfy to be acceptable.

Does this mean that the crash would automatically have been avoided had the mission used a language and method supporting built-in assertions and Design by Contract? Although it is always risky to draw such after-the-fact conclusions, the answer is probably yes:

  • Assertions (preconditions and postconditions in particular) can be automatically turned on during testing, through a simple compiler option. The error might have been caught then.
  • Assertions can remain turned on during execution, triggering an exception if violated. Given the performance constraints on such a mission, however, this would probably not have been the case.
  • But most importantly the assertions are a prime component of the software and its documentation ("short form", produced automatically by tools). In an environment such as that of Ariane where there is so much emphasis on quality control and thorough validation of everything, they would be the QA team's primary focus of attention. Any team worth its salt would have checked systematically that every call satisfies the precondition. That would have immediately revealed that the Ariane 5 calling software did not meet the expectation of the Ariane 4 routines that it called.

The lesson for every software developer

The Inquiry Board makes a number of recommendations with respect to improving the software process of the European Space Agency. Many are justified; some may be overkill; some may be very expensive to put in place. There is a more simple lesson to be learned from this unfortunate event:

From CORBA to C++ to Visual Basic to ActiveX to Java, the hype is on software components. The Ariane 5 blunder shows clearly that na�ve hopes are doomed to produce results far worse than a traditional, reuse-less software process. To attempt to reuse software without Eiffel-like assertions is to invite failures of potentially disastrous consequences. The next time around, will it only be an empty payload, however expensive, or will it be human lives?

It is regrettable that this lesson has not been heeded by such recent designs as Java (which added insult to injury by removing the modest assert instruction of C!), IDL (the Interface Definition Language of CORBA, which is intended to foster large-scale reuse across networks, but fails to provide any semantic specification mechanism), Ada 95 and ActiveX.

For reuse to be effective, Design by Contract is a requirement. Without a precise specification attached to each reusable component -- precondition, postcondition, invariant -- no one can trust a supposedly reusable component.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *