Towards Assuring Non-recurrence of Faults Leading ...

Towards Assuring Non-recurrence of Faults Leading to Transaction Outages – an Experiment with Stable Business Applications Anushri Agrawal

Ravindra Naik

Tata Research Development and Design Centre 54 B Hadapsar Industrial Estate Hadapsar, Pune -13 +91-20-6608-6375

Tata Research Development and Design Centre 54 B Hadapsar Industrial Estate Hadapsar, Pune -13 +91-20-6608-6336

[email protected]

[email protected]

ABSTRACT Reducing the cost and efforts for maintaining a legacy business application is a major challenge faced by the industry. Capturing faults at the early stage of software life cycle is known to prevent certain kinds of defects in production – we exploit this aspect to detect a class of software faults introduced while making changes to stable business applications. We discuss the results of an experiment with stable, back-office COBOL applications of a core banking system. With changes, the stable system exhibits frequent transaction outages (among other defects). We observed that most critical outages were due to few common causes, which were extremely difficult to identify by testing. We present our analysis and the results of the successful attempt to automatically detect the causes, using structural analysis and control flow analysis techniques.

Categories and Subject Descriptors D.3.3 [Programming Languages]: COBOL

General Terms Experimentation, Languages, Reliability

Keywords Fault Detection, Program analysis, Maintenance

1. INTRODUCTION Generally, a software system used in production for more than a decade is considered to be stable. However, due to business requirements, if it undergoes several changes with a number of more changes pending for implementation, then the stability is a question. One such system, a core banking system that processes over 1 million transactions every day and is in use for several years expressed some business concerns. Few of them were: • Repeated failure of business transactions, called transaction outages

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ISEC’11, February. 23–27, 2011, Thiruvananthapuram, Kerela, India. Copyright © 2011 ACM 978-1-4503-0559-4/11/02…$10.00

•

Certain defects reported from the production were not easily reproducible • Longer turn-around time to make changes, consequently limiting the number of changes delivered per release We conducted a series of interactions with the director, who manages the system, and a few senior sub-system owners, to narrow down on a sub-system that was considered notorious. A detailed analysis of the transaction outages revealed that the root cause of the symptoms were certain programming anomalies that are specific to COBOL, which is the programming platform used by the system under study. Some others were the result of the development habits of copy-paste. The above observations were confirmed by the owners and developers of the modules. A detailed causal analysis of the transaction outages is outlined in section 3. Based on the root-cause analysis of the defects and their code fixes, with the help of the development team, we initiated an action plan to automatically detect the code anomalies - faults. As the development team started using the automatic detection of the first set of faults, automation for detecting the new faults would follow, thus ensuring the practical usefulness of the automation. Section 2 talks about the related work. An analysis of the defect data is presented in Section 3. We describe the faults and the automated detection tool in section 4. Results of the automatic detection of faults are presented in section 5, which is followed by the concluding section. The terms Failure, Defect, and Fault are adopted as per the definitions in [1], wherein Fault is the defect in the system or code which causes Failure(s) upon execution.

2. RELATED WORK Static analysis is used for defect detection by many tools. These tools are mostly for Java, C, C++ languages, some of the prominent tools being those discussed in [2], [3]. For COBOL, a few tools [4], [5] are available, which are standards checking or coding guideline checking tools. Our work is aimed at detecting faults which are hard to spot by testing and can potentially result in transaction outages. COBOL minefield detection [6] investigates the behavior of PERFORM statements and focuses on restructuring of the program to remove error-prone code. Our tool detects all out-ofrange GOTO statements and also reports the potential Fallthrough chains. Our work helps to detect other faults like

parameter mismatches, and work is on-going for un-initialized variable usage.

3. DEFECT DATA – A QUALITATIVE ANALYSIS Typical failures observed in the application under study are: Scenario 1- Failed transaction • Transaction: Converting the amount from foreign currency to Indian currency • Defect: Frequent transaction outage • Analysis for re-producing the defect: Transaction outage would occur only if the transaction is fired after 6pm • Root cause / Fault: Improper GOTO statement due to copy-paste under a condition that checks for 6pm Scenario 2 - Incorrect data • Transaction: Closure of Recurring Deposit Account • Defect: Wrong entries in the Bank Ledger account • Analysis: If different branches attempt to close different recurring deposit accounts at the same time instance, values of fields from one transaction are used in another transaction • Root cause / Fault: Un-initialized variable The observed panorama of the transaction failures and the defects reported in production are as below: • Number of average monthly transaction outages is 5-8. But in case of frequently occurring transactions the outages can vary from 500-800. • The main challenge is in replicating the failure situations that occur in production environment typically do not occur in test or development environment • Another challenge is the analysis of data, value and transactions around the defect to identify the root cause • Expert and knowledgeable persons spend around 1 to 15 work-days in re-producing and analyzing one defect. An analysis of the root-causes leading to transaction outages points to various reasons, as illustrated in Figure 1. Clearly, the out-of-range GOTO and call parameter mismatch are the prime reasons and are the basic faults that need to be detected. Call program name not unique 10% Array subscript out-of-range Call 10% parameter mismatch 20%

Others 10% Out-ofrange GOTOs 50%

Figure 1: Reasons for transaction outages

4. FAULTS and AUTOMATED DETECTION 4.1 Faults Based on the causal analysis of defects following are the commonly seen faults that lead to outages of transactions.

4.1.1 Fall-through and Out-of-range GOTOs In COBOL, the term "fall-through logic" implies that upon reaching the end of a paragraph, the execution falls-through to the next paragraph and continues, potentially till the end of the program. Fall-through logic can occur in multiple ways: • Using PERFORM THRU statement (fall-through is limited to the range of paragraphs performed) • Using GOTO statement (fall-through can be to the end of the program) • Execution of the code in the PROCEDURE DIVISION (or immediately after an ENTRY statement) falls through to the next paragraph, unless it is terminated by STOP RUN or GOBACK statement. Usage of GOTO statements require extreme caution [7], as they can lead to inadvertent execution resulting in incorrect calculations or infinite cycles in the execution leading to transaction outages. Use of GOTO considered harmful [8] that makes program understanding difficult – at the end of a paragraph, it depends on the context whether execution will return to the PERFORM THRU statement, or continue with the next paragraph. Introducing a new piece of code for a feature change also becomes expensive for the developer. However, use of GOTO statements that transfer control within the limited range of paragraphs (specified by PERFORM THRU) does not result in fall through, and is an acceptable practice, especially for exception handling. This analysis enabled to determine the concept of out-of-range GOTO statements, each of which is an instance of fault that can lead to potential defects. The out-of-range GOTO statements are defined as: • A GOTO statement that is not contained in a PERFORM-THRU range • A GOTO statement that is contained in a PERFORMTHRU range and executes a paragraph that is not in the same range. The reasoning made for the paragraphs is also applicable for the sections of COBOL.

4.1.2 Dead Paragraph/Section The COBOL compilers are not known to be great optimizing compilers, and do not usually detect unreachable code in the binary. As the applications evolve, old functionality is replaced by new, and some of the old features are dropped. It is observed that the developers usually leave the old code (paragraphs and sections) in the program, though they are never executed, at least intentionally. However, in the presence of fall-through, the dead paragraphs and sections are executed, and result in improper initializations, wrong calculations or infinite cycles. Presence of dead paragraphs/sections also makes program understanding difficult, as the dead code is not easily identifiable. An added advantage of removing the dead paragraphs from code is the reduction in the size of the load module and its consequent faster loading in memory. Figures 2 and 3 show an illustration of fall-through, out-of-range GOTOs and dead paragraphs.

4.1.3 Mismatched Actual and Formal Parameters When a call is made, it is expected that the actual parameters match the formal parameters according to the following criterion:

1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25.

PROCEDURE DIVISION. MAIN-PARA. PERFORM A01 THRU A09. DISPLAY “Done PERFORM”. MAIN-EXIT. STOP RUN. A01. PERFORM C01 THRU C09. A09. EXIT. B01. DISPLAY “FALL THRU”. B09. EXIT. Out-of-range C01. GOTO IF FLAG GO TO B01. ELSE GO TO C09. Dead C09. Paragraphs EXIT. D-PARA. DISPLAY “DEAD PARA”. D-EXIT. EXIT.

Figure 2: Sample COBOL Program

3 Fallthrough Chain

PT1 8 PT2

• •

Number of formal and actual parameters must be equal Size of respective actual and formal parameters must be same. While the absence of type matching provides programming flexibility; it can result in potential (unintended) misuse. However, COBOL compiler does not report any mismatch in terms of the above criterion. During execution such mismatch may trigger an exception, which eventually results into a transaction outage or an incorrect value in calculations.

4.2 Automated Detection Each of the faults described in the previous section is detected by using program analysis techniques, in particular structural analysis and control flow graphs. The algorithms to detect undesired fallthrough, out-of-range GOTO statements, dead paragraphs or sections, and mismatch of actual and formal parameters make use of the information generated by the TCS proprietary program analysis workbench [9]. While describing the algorithms is out of scope for the paper, they are designed in a manner to make effective use of the language independent static program analysis, and the language independent internal representation, thus paving the way for their possible use for detecting errors in programs written in other programming languages too. In addition to the COBOL parser (translates the input program into a language independent internal representation) and the program analyzers (builds control flow graph and computes data flow information), the fault detector makes use of the analysis information and traverses over the internal representation to identify specific faults. The fault detection is considered as a pattern matching activity that makes use of control flow and structural associations of the Abstract Syntax Tree (AST). One interesting observation is that the automated detection of the faults of interest results in very few false positives. The execution of the automated tool takes, on an average, 30 seconds per 10000 lines of COBOL code, which is a practically acceptable timeframe.

5. RESULTS 19

16

21

17 Fallthru

End of PT2

The fault detection was carried out initially on a set of programs chosen by the development team. The team wanted to verify the results using programs that they were familiar. The results for a few programs are shown below. As seen in Table 1, large programs have a large number of out-ofrange GOTOs triggering undesired fall-throughs. Table 1: Fault report for few large programs

10 12

End of PT1 Fallthru 4

14 Fallthru 6 Unreachable Code 23

25

Figure 3: Flow of sample COBOL program

Program Name

#Paragraphs

Fallthrough chains 16

Dead Paragraphs

1038

Out-ofrange GOTOS 23

PGM000 AN0000

1226

76

49

0

BA0000

823

39

25

15

CR0000

658

20

13

0

DT0000

68

0

1

43

EI0000

758

183

48

4

2

Table 2.1 and 2.2 illustrate call parameter mismatch results for a few programs of PGM module. A summary for the entire module and detailed mismatches for each program are shown: Table 2.1: Summary for Parameter Mismatch for PGM000 PGM PGM PGM PGM PGM Called 023 024 022 088 046 Programs 7 3 2 1 Number of 8 mismatches Table 2.2: Detailed Report for PGM022 Calling Prog.

PGM000

PGM000

PGM000

Call at line no.

8424

8530

11494

Mismatch Type

Total Parameters

Total Parameters

Size

Parameters Passed

0

0

1

Mismatched Parameters

STARTAREA

Parameter Size

4461

Parameter Type

Record

To study the impact of change requests on faults, the tool was executed on multiple versions of some programs which were released over a period of 12 months. Results for one such program PGM000 are presented in Table 3. In the May-10 version, the development team corrected the outof-range GOTOs, while more out-of-range GOTOs were introduced in Jul-10, potentially because new code was added. Inferences • As the paragraphs were added over time, an increase in outof-range GOTOs is seen, suggesting that code changes introduce undesired fall-throughs. May-10 version is an exception to this observation since the out-of-range GOTOs were corrected in this version. • In the presence of fall-throughs, dead paragraphs are lesser in number. However, as the fall-throughs were removed in May-10, the no. of dead paragraphs increased, indicating that in the previous versions, few unintended paragraphs were potentially executed due to fall-throughs.

6. CONCLUSION / FUTURE WORK In the paper, we have presented an analysis of the defects observed in production, and how certain programming malpractices and language anomalies can introduce faults in the code, resulting in not-so-easily re-producible failures. We presented data to corroborate our inferences. We presented how many of these faults can be detected automatically during development, before testing. The automated techniques can additionally detect all occurrences of the faults, which could potentially lead to multiple failures during production. While we have automated the faults that can be detected with practically zero false positives, we continue to study other kinds of faults that would need heuristics to reduce the false positives. We also see the need of doing a more elaborate defect data analysis across multiple business applications, and across different domains, to infer other defects and faults that are prominent and can be detected with static program analysis techniques. In the long term, with much work needed towards scalability of the automation techniques, we would like to study what guarantees of the absence of certain kinds of defects can be given for business applications. We argue that it is important to remove the faults reported by the proposed automated techniques on a regular basis, and it is the discretion of programmer how to correct them. At the least, the development team can drastically bring down the number of outages by correcting the reported faults.

7. ACKNOWLEDGMENT We acknowledge the deep trust and co-operation by the development team of the business applications. They have contributed immensely to the analysis of defects, and validating the detected faults. We would like to thank Ajitraj Sancheti and Ashutosh Gulanikar in particular for their contributions.

8. REFERENCES [1] J.D. Musa, Software Reliability Engineering, McGraw Hill, 1999. (http://johnmusa.com/glossary.htm) [2] N. Ayewah, W. Pugh, J. D. Morgenthaler, J. Penix, Y. Zhou, Evaluating Static Analysis Defect Warnings On Production Software, 7th ACM Workshop on Program Analysis for Software Tools and Engineering, June 2007 [3] http://www.coverity.com/products/static-analysis.html

Table 3: Consolidated result for PGM000 over one year PGM0000

Total Paragraphs

May -09 Nov-09 May-10 Jul-10

[4] http://www.raincode.com/cobolchecker.html

Fallthrough

Dead Paragraph

[5] http://www.semanticdesigns.com/Products/StyleChecker/ind ex.html?Home=Main

984

Out-ofrange of GOTO 16

17

7

[6] Niels Veerman and Ernst-Jan Verhoeven, Cobol Minefield Detection, SP&E, Volume 36 Issue 14, November 2006

1038 1044 1074

23 2 5

16 3 11

2 38 30

[7] Donald E. Knuth, Structured Programming with GOTO statements, ACM Computing Surveys, Vol. 6, No. 4, Dec. 1974

The development team is making the tools part of their development process, while also encouraging other business units to improve their quality and reliability by using the fault detection tools.

[8] Dijkstra E. Go to statement considered harmful. CACM, 11, pg. 147–148, 1968 [9] “PRISM - Static data and control flow analysis workbench”, Technical report, TRDDC, Pune, 2008