Weird Software Problems
by Dennis L. Feucht
Innovatia Laboratories
In real-time, embedded programming, sometimes one encounters really bizarre software faults. Knowing possible causes of inexplicable embedded-system behavior can shave weeks off product development time. What makes a fault subtle is that when everything is checked in detail, there are no (apparent) causes for misbehavior. This article describes a list of causes in the author's experience that are bound to cause you (the firmware programmer) some trouble, sooner or later.
Variable Range
A common but sometimes frustrating problem is caused by insufficient attention to variable values. In designing a six-step, sensorless motor-drive using the Atmel AT90S2313 mC, I failed to initialize the PHASE variable that represented the drive step. At turn-on, sometimes the motor would spin up nicely. At other times it would repeatedly send (out the serial port) the initialization character at the beginning of the MAIN routine, indicating multiple restarts. What was happening? I used the PHASE value as an index to a table, to jump to the routine to turn on the high- and low-side switches for the corresponding step. Because the PHASE value was not initialized, it was sometimes not within range (0 to 5, inclusive), and the program dethreaded ("hung up".) As it went out of control (with watchdog timer disabled, so that couldn't be the cause) it would eventually run the start-up code, proceed normally, and dethread again. For a while, at first, I thought that Atmel might have goofed in designing a reliable restart circuit. (Blame the hardware!) A clr PHASE instruction in the init routine solved that problem.
A related problem appeared that was somewhat more subtle. Sometimes it is necessary to run a routine that will properly initialize a value. I needed to set a value that represented which motor induced-voltage zero-crossings to be looking for, to advance to the next drive step. This variable, (SENSE), if not initialized, could sometimes be 2 of 8 values (3 bits) that are not used for six steps. These values are invalid and can be regarded as out of range for (SENSE). They would cause the motor-drive to play dead, looking for a combination that does not occur and consequently not advancing the drive to successive phases. In running the (SENSE) initialization routine, I had to make sure I put it in the right place in the initialization sequence, so that it would be properly initialized before I called it. Initialization can be sequence-dependent.
Calls and Jumps
One subtle difficulty arises from apparently working code that is assumed fault-free because it works. I had code that spun a motor. Everything seemed to work. Then I exchanged the order of two subroutines in the source code, and the motor quit working. Bizarre! I looked at the disassembled code and there were no compiler faults. (Don't assume your compiler is bug-free until you've had extensive experience with it.) The two exchanged routines - call them B and C - were both called in sequence in a loop in routine A. In the source code, the sequence of the routines was: B, C, A. The calling routine, A, is simplified below, where register N is one of 32 general registers accessible by most instructions, and used as a working register. (Backslashes indicate comments and ' returns the address from subroutine names.)
CODE phase-control
in N , PINB \ Input phase sense bits: CB BA AC x x x x x
andi N , $E0 \ mask sense bits 5-7
cp N , (SENSE)
0= IF \ advance drive step if motor zero-crossing sensed
rcall ' advance-phase \ advance PHASE step
rcall ' advance-drive \ drive this PHASE step
rcall ' advance-sense \ update (SENSE)
THEN
ret
C;
This problem killed at least a week. How could merely exchanging the order of subroutines advance-drive and advance-sense, preceding their calling routine, phase-control, in the source code, affect anything? Then I looked more carefully at the second subroutine in the loop. It was the drive-step advance routine, simplified as this:
CODE advance-drive
\ Turn off drive before advancing step
{ some code here to do this }
\ write drive bits using PHASE to index into DRIVE table
\ calculate SRAM table address => Y reg
ldi YL , DRIVE #LO
ldi YH , DRIVE #HI
mov N , PHASE
lsl N \ byte to 16-bit index using left shift (x 2)
add YL , N
\ load contents of Y reg address into Z reg
ld ZL , Y+ \ post-increment Y
ld ZH , Y
ijmp \ assert drive bits for PHASE
C;
The address of the (one of six) driver routines indexed by PHASE is calculated (no indirect indexed addressing modes on these newer mC architectures!) and loaded into the Z register.
The final instruction in the advance-drive routine is the indirect jump (through Z) to one of six possible drive routines, all ending in a subroutine return instruction. The code is streamlined by making the final icall in advance-drive an ijmp instead, to let the jumped-to routine do the return for advance-drive. Before this correct routine was fixed, the ijmp was a subroutine jump (icall) and it did not have a ret instruction after it. Upon return to advance-drive, there was no intended code after icall to execute. Consequently, the next code in sequence was called (advance-sense), just as it would have been had advance-drive returned properly to phase-control. The return in advance-sense then caused execution to return to call advance-sense within phase-control. This repeat execution does not of itself cause any fault symptoms. Consequently, all appeared to run well, but only because the sequence of called subroutines in phase-control was also the source-code sequence!
Hardware versus Software
One of the difficulties of embedded-systems programming is in determining whether a fault lies in the hardware or software. Not uncommonly, if the fault is subtle enough, hardware designers conclude it must be in the software, and vice versa. Perhaps this is why the job title of hardware/software engineer arose a couple of decades ago. Somebody needs to understand the entire system well enough to find the causes for subtle problems which are the result of hidden interactions at the borders of technology disciplines. The most demanding but often necessary way to solve these problems is to understand the entire system in detail. Somebody must; if it's you, your contribution to the project is significantly enhanced, and your overlords might favor such initiative.
I was once brought into a surgical laser company to help find some endemic problem. The system used three microcomputers to run different subsystems: Front-panel user interface, laser system, and the third was the master which communicated with the serial port and commanded the other computers. The system as a whole was unreliable at power-on, and sometimes did not start up. The problem was not hard to find; the three embedded computers needed to be sequenced in their start-up, and they were not. The start-up sequencing of the individual computers did not have adequate handshaking so that they could synchronize message passing with each other. Instead of developing some multiprocessor multitasking operating-system tools (such as semaphores), we solved the problem more easily by using a common hardware reset and added suitable software delays to ensure successful sequencing.
On a related theme, power-off behavior can also cause problems and both electronics and software engineers should add both power-on and power-off behavior to their list of design considerations - especially if non-volatile memory is used. Its reset threshold must be at a higher voltage than the rest of the computer, to avoid spurious writes at power-on and power-off.
Have multiple pieces of hardware available for checking whether the problem is in the hardware. Sometimes the software will break the hardware (such as for mC-based motor-drives or power converters.) Unless the failure event is observable (such as bits of MOSFET packaging dispersed about the room) the software engineer does not know whether the fault is in the software or hardware. Much time can be wasted assuming that a subtle software fault exists when the problem is broken hardware. Hardware swapping should be considered a diagnostic method - unless the hardware is overwhelming, such as a chemical plant or an aircraft.
For really complicated systems, complexity can be handled using the same methods software engineers already use: Modularize the hardware in order to isolate problems to within a subsystem and to the interface to other subsystems. Additionally, if the system allows hierarchical layering of modules, all the better for managing complexity.
Closure
These are the kinds of faults which try programmers' souls. It is usually
hard enough to get the code right for the application, but to be fooled
by programming errors compounds embedded-systems development woes. However,
once you overcome subtle faults, you tend not to forget them. And with advance
warning about a few faults you might not have experienced yet, your programming
efforts will hopefully be more productive and joyful.
Contact the author