The course has been iteratively refined over many years. In prehistoric times, courses may have revolved from the students' viewpoint around getting programs to run, with a consequent prevalence of horrible code and ad hoc changes. However, at least on this course, this has long been submerged by the aim of teaching structured program design. Problem solving gets as much time in lectures as programming language features; marks from very early in the course onwards are given as much for good modular design and documentation of high level design (so called "level 1 plans") as for getting the program to run. This is now so successful that it is several years since I have been asked to look at a program that is hard to understand because it is horribly structured or laid out. Current improvements in teaching are aimed at making testing and documentation as important as design and implementation. (These improvements are still being worked on: they are reflected in the marking scheme, but only partially in student practice as they do not yet have their full share of lecture time and paper materials to support the students by precept and example.)
However my experience suggests that these impressive teaching achievements have a down side: that students currently show surprising deficiencies in the area of debugging. In earlier times, students acquired considerable skill at debugging because success largely depended on reacting to problems by modifying the program until it worked i.e. on debugging. Although it wasn't taught directly, student success depended on this skill and they acquired it through practice. Now it seems they can be quite deficient. They are successfully taught top-down structured design, with the result (just as in the claims for this method) that many problems do not arise in the first place. Furthermore, because Pascal is a strongly typed language, most low level errors are caught by the environment and the error messages point to them fairly clearly. When the occasional bug occurs that does not fall into this category, the students often seem helpless, and unaware of even basic debugging ideas such as inserting tracing code to show how far the program gets before failing or what the values of variables are during a run, or how to go about discovering what an obscure error message really means.
In other words, "programming" in fact requires a diverse set of skills, reflecting the many component tasks of the overall activity of "doing programming". The environment modifies what the tasks and skills are, for instance by relieving the programmer of some jobs. In a teaching context, this can mean they fail to learn some skills (possibly the most transferable and important skills). Well designed teaching, too, can have a considerable impact on both the skills learned, but also on skills not required as the result of good practice removing much of the need for them. I suggest that we need to identify explicitly all the component skills a programmer should have, and to design teaching for each of these skills instead of merely expecting them to be acquired implicitly through practice of the overall activity of programming.
1. Lexical issues. These students have never seen an un-indented program (the environment does this), or one with completely uninformative identifiers (they are not forced to work with other people's code, let alone with perverse naming systems). The only exercise I have provided is a printout of a working program with formatting removed and identifiers randomly renamed. This is probably not effective, as it is not an activity (beyond having the students feed it to the THINK system and see the indenting recreated).
2. Learning what problems typically lie behind error messages. A major part of the expertise a tutor can bring to a student with a problem is of the kinds of bug associated with each error message. THINK's error messages are not particularly good from a programmer's viewpoint. In most cases, the content can be safely ignored and the programmer just looks at the line indicated to try to spot any obvious syntax violation. The second class of messages do describe the problem usefully e.g. "x is not declared", and require no training. However there is a third class where the message actually is quite diagnostic to an expert, but not to a novice, and learning is required. For instance "Insufficient stack space to invoke procedure or function" is caused by infinite recursion, "Your application zone is damaged" is caused by having procedures return huge values (e.g. large arrays: using a var parameter would avoid this), "Bus error" usually means an error with using pointers, and having the whole THINK application crash usually means that there was an array subscript error without bounds checking being turned on.
The exercise tells students that they should, as part of learning each new Pascal topic, deliberately provoke errors in order to learn the error messages, and supports this with some example error-provoking code.
3. Finding bugs by reading. Although these students are sometimes required to read code, they do not seem to take this seriously as a debugging method. Whereas in the days of slow batch compiling, reading printouts of code was an important activity, we probably have to teach it nowadays. The exercise requires them to find some bugs by reading only. What we have to realise, and then to teach the students, is that some bugs are easy to find by reading and it is inefficient to look for them in other ways; but other bugs are hard to find this way and other techniques are better e.g. testing and using debugging tools. There are in fact 2 skills here: a) spotting errors that do not require understanding of the program e.g. missing declarations, redeclarations of the same identifier, code that can never be executed. b) Predicting what the code must do, and realising that this is not (cannot be) what is wanted. E.g. looking at a set of conditions (e.g. in a multi-branch if statement) and deciding if cases seem to overlap or be missed.
4. Debugging tools. We seem to need to train students in using the debugger. They do not seem to teach themselves, even though they need it at times, and even though I didn't have trouble teaching myself how to use it without documentation. Probably what they lack are the ideas of what would be useful if it were provided, and what is likely to be provided in a modern debugger. E.g. being able to inspect the value of any data location at the point the program stopped, the calling sequence, being able to tell the program to halt at some specific point in the code. The exercise gives them a basic tiny manual for operating the debugger, and some buggy programs whose bugs they must find.
5. Spotting symptoms in Input/Output samples. As a prerequisite for doing testing, students have to be able to notice that a program's output is not in fact correct. Many students do not notice this: as if anything that did not provoke an error message must be running correctly. The exercise is reading again: reading not code but printouts of input/output samples. For instance the first exercise says "The program reads in a list of names sorts them; and prints them out with some processing" and gives a sample and the students have to spot that one of the input data lines has been omitted from the output (a typical bug in a sorting routine).
6. Diagnostic testing: the skill of designing input data to explore and diagnose a bug given a preliminary symptom. Our students are now required to perform tests and submit documentation of this, but it is clear that they do not expect the testing to discover any bugs. Consequently their tests look plausible but lack real diagnostic power. The exercise gives them executable programs (no source: true black box), a brief statement of the program's function, a vague description of a problem and the original test data giving rise to suspicion, and the task of designing test data to refute or sharpen up the suspicion.
7. Black box tests. Given an executable program file, and a brief description of its function (built into the program and displayed on each run), the student must design a set of tests to discover what problems if any it has.
Berry & Broadbent were mainly concerned with what strategies people used, and how they could be trained in the optimum (binary split) strategy. In fact people are in many cases very resistant to using the optimum strategy, even when given direct training. This inability to use the best strategy seems to be due to the layout of the table given to subjects, which in their experiments consisted of a list of factories, and against each factory name, a list of pollutants.
Gilmore (1991) ran variations on these experiments. His purpose was to analyse
an apparent cognitive dimension of "visibility" into three dimensions, which he
named accessibility, salience, and congruence. He compared four table layouts
by varying a) whether the tables gave factories first then pollutants against
factories, or vice versa; b) whether the secondary properties (e.g.
pollutants) were given as a list or in a grid so that a reader could easily
scan for all the primary instances (e.g. factories) that shared a given
property (e.g. pollutant). Gilmore showed that:
a) Different table formats vary the difficulty of carrying out any given
method; and conversely the usefulness of a format depends on the method
used.
b) Different table formats vary the difficulty of the task (i.e. of the best
method for the task, given the format).
c) The method chosen by the user depends on the task but also on the user.
d) The method chosen by the user, and whether they choose the optimum
procedure, is affected by another property of the format ("salience"), largely
independent of features determining difficulty. I.e. what procedure seems
obvious to users is also, but independently, influenced by table format, and
this is often independent of any explicit training given to subjects.
These tables are in effect a visual notation for supporting a task. The format of these tables, then, can be varied in a number of ways including: by which of the two entities (factories or pollutants) is primary, by whether lists or a 2D grid layout is used (i.e. whether columns are meaningful), by whether each of the dimensions has random order, alphabetic ordering, or some other ordering. Berry & Broadbent fixed on one format and studied how users could choose a method for the task given the format. Gilmore compared formats, showing effects on choice of method and on the effectiveness of a chosen method, and hence on task performance. However it is interesting to consider an alternative task: not how to choose each test for pollutants in turn, nor how to choose a method for that task, but how to make reformatting choices for the table in order to make the task easier: the corresponding visual notation selection task.
In the talk, I will illustrate some of these alternative formats, and also (by asking the audience to suggest modifications to the current format) that we are actually poor at choosing a better or optimum format for the task.
In brief: the most important discoveries from testing are the unexpected issues that are obvious to a human observing, and that correspond to missing requirements. However not only are these not deliberately sought, but they threaten software engineering as a rational enterprise: how could anybody plan rationally to discover the unexpected?
In fact this is nothing special about software engineering. Petroski's books argue that civil engineering, for instance, progresses in part by learning from disasters which mainly reflect, not negligence, but learning the hard way about new requirements that were always implicit and automatically satisfied until old parameter ranges were exceeded and they emerged into significance. A simple example would be, that if you build bridges out of stone then you need never worry about side winds as by the time you have satisfied the requirement to carry the vertical load the structure will be too massive to be affected by wind. With modern steel bridges this is not the case, as was discovered when the Tacoma Narrows bridge disintegrated. Since all designs depend on an infinite number of requirements, these can never be written down and checked (the bridge, or the software, must work at all phases of the moon, when the operator drinks tea, if someone speaks Chinese nearby, if the wind blows at 47 m.p.h., if the wind blows at 41 m.p.h which just happens to correspond to the resonant frequency of the artifact, ...). All rational design can do is to write down the requirements that previous experience suggests may not be automatically satisfied; but how do you guard against the unexpected, against a new requirement becoming important for the first time in this case? You build the artifact, and you try it out i.e. test it. If nothing undesirable happens, then it is probably OK. But you cannot be sure (perhaps the side wind didn't blow on the day of the test), and you cannot design these tests by considering the explicit requirements and specs, because what you need to detect is the issues missing from those lists.
Examples in programming might be noticing that the software runs too slowly, that when an error message appears it obscures the display it refers to, that the most common user error in selecting an input file is that the same file is still open in another editor and special support for this should be provided. Any problem, once identified, can become part of a standard set of requirements to be applied in future to most or all projects; and in principle this should happen. However firstly, there must be a first time the issue occurs; and secondly, in practice such requirements often are not written out, but rediscovered by programmers during testing. This rediscovery is NOT because programmers explicitly foresee this possible error and then test for it. Rather, they "just notice" what the problem is when they run the program.
Can we think about testing rationally? Programmers are usually taught about black box and white box testing. Actually these concepts are undermined by similar issues. Black box and white box testing are really the same in that they both depend on (possibly unjustified) assumptions about the device in order to get a few tests to stand in for the huge number really needed to be exhaustive. Black box tests typically assume that if inputs and outputs are, in mathematics, continuous ranges of values then the implementation will be smooth in some way (so only a few values need be tested). This of course may be wrong e.g. if lookup tables are used.
Basically most testing is for foreseen errors. This requires two theories: one
of how the device should work, the other of how errors are generated. Together
they predict how to detect errors. White box theories explicitly use a theory
of how a device works, and black box theories implicitly use such a theory in
assumptions about the nature of the input and output functions. What kinds of
theory do we have about how errors / bugs are generated? Testing checks on
uncertainty in a product's properties. There are 3 broad sources of
uncertainty:
1) Unreliable elements in the products e.g. metal fatigue, operator error in
human-machine systems. In principle this can be treated statistically.
2) Errors in executing the design and engineering process: where knowledge is
adequate, but (human) execution of design and production is faulty. E.g.
coding slips, a program that does not meet its requirements. Possibly, but
questionably, this could be modelled statistically, depending on how good a
statistical model of the generation of human errors in intellectual tasks we
can build.
3) Uncertainties and inadequacies in the design method itself. In particular,
the absence of any method to guarantee that all relevant requirements are
identified and written down.
How to test for the unforeseen, for missing requirements? The best approach may be to see it working. Hence in practice, as opposed to theory, the first time test may have a special importance, which would explain why engineers including programmers always want to test their creations out, usually informally "just to see if it works".
The first time you see a version of your program working is a heartening sight. And more and more you see justifications of development methods that allow this early on e.g. by writing stubs for all parts. There is a sort of reason for this in that as soon as it runs, however stubby, you can use the program output / behaviour as an additional source of information. But behind that is a reason something like this: If the program runs at all, then the logical AND of a very large number of required properties must be (or is very likely to be) true, so a huge leap in certainty is made in one test. (In contrast, later more systematic tests are mainly addressed at seeing that the scope of the assertions is true: that it goes on working as various parameters are pushed to the limits of their ranges.) But a further reason lurks there: that not only the AND of the explicit requirements must be true, but so must the AND of all the infinite implicit requirements. In other words, this first success is also a test of the most uncertain aspect of the whole design process.
Thus the most informative test is probably the first informal one -- the one done intuitively by all programmers, but seldom mentioned in "methods".
The problem with the Hubble space telescope would of course have been detected if only they had done this test: any test of the whole assembled telescope. It was not a random problem, one of uncertainties in materials. It was a problem (of type 2 in the list above) within a module (shaping the main mirror) that affected implementation and testing equally. It would have been detected by a whole-system test, because that would have shown what that module looked like to other modules.
I have constructed an argument that the first informal test of an artifact, for instance a program, has a special importance. Is this true? This suggests a research project, which I haven't done: to keep a record of when a given programmer discovers each bug/problem, and so discover the proportion that a) were discovered during informal rather than formal tests; b) to discover the proportion that could not have been discovered in formal tests because they were bugs in the requirements or specs, not in the implementation.
Berry,D.C. & Broadbent,D.E. (1990) "The role of instruction and verbalization in improving performance on complex search tasks" Behaviour and information technology vol.9 pp.175-190
Gilmore, D.J. "Visibility: a dimensional analysis" in HCI'91 People and Computers VI: Usability Now! (eds.) D.Diaper & N.Hammond pp.317-329 (Cambridge University Press: Cambridge).
Petroski, H. (1982) To engineer is human: the role of failure in successful design (Macmillan: London).