Enriching Program Comprehension for Software Reuse

56 downloads 24301 Views 85KB Size Report
Contact: Manager, Copyrights and Permissions / IEEE Service Center / 445. Hoes Lane / P.O. Box .... the calling structure of the code and the identification of.
Copyright 1997 IEEE. Published in the International Workshop on Program Comprehension 1997 (IWPC'97) May 28-30, 1997 in Dearborn, Michigan.

Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works, must be obtained from the IEEE.

Contact: Manager, Copyrights and Permissions / IEEE Service Center / 445 Hoes Lane / P.O. Box 1331 / Piscataway, NJ 08855-1331, USA. Telephone:

+ Intl. 908-562-3966.

Enriching Program Comprehension for Software Reuse Elizabeth Burd, Malcolm Munro The Centre for Software Maintenance University of Durham South Road Durham, DH1 3LE, UK

Abstract This paper describes the process of code scavenging for reuse. In particular, we consider enriching program comprehension for the identification and integration of reuse components by information abstraction and the use of graphical representations. The requirements of good reuse candidates are described, and then a description of a process of identifying and preparing for their reengineering into reuse units is given. In particular, we describe two main activities: the identification of units; and the definition of the units user interface. Initially, the identification of reusable units applies some of the methods from RE2 but is extended with the use of graph simplification procedures. The identification process is based on the calling structure of the code. Secondly, data analysis is performed on the identified reuse candidates. The data analysis process provides an indication of the potential use of the component and the effort required to make the candidate reusable. This paper provides examples and results from a number of case studies which have been used to evaluate this work. Our work relies heavily on there being communication between technical and non-technical staff. We achieve this through the use of graphical representation and thus results are displayed graphically where applicable.

software for many domains, so developers must rely on updating existing software if they are to benefit from reuse. The process of finding and using software which has not been designed for reuse is often termed software scavenging. When scavenging is adopted, frequently, modifications (therefore maintenance) have to be performed on the software to make it reusable. However, if such maintenance issues can be overcome, the advantage of such an approach is the availability of large quantities of potentially reusable software and thus large overall potential cost saving. For software to be reusable it must exhibit a number of features that directly encourage its use in similar situations. The following finding by Wood [Wood89] list the important qualities of reusable software. Reusable software should offer: •

• • • •

1. Introduction Reuse has long been thought of as a way of improving productivity and maintaining, or even improving, quality standards [Biggerstaff89, Freeman87, Tracz88]. In addition, it is believed that reuse based software development can reduce the costs of maintenance because the nature of such software is more modular and therefore software updates are more localised. However, there is insufficient specifically designed reusable

Environmental independence - components should be reusable irrespective of the environment in which they were originally created High cohesion - components should implement a single operation or a set of related operations Loose coupling - components should have minimal links to other components Adaptability - components should be able to be customised so that they will work in a range of similar situations Understandability - components should be easily understandable so that users can interpret their functionality

It is believed that ideally the developer should be able to find large fragments of source code quickly that can then be reused without significant modification. However, all code proposed for reuse needs first to be assessed to evaluate the necessity for modifications to make it reusable. Furthermore, an evaluation needs to be applied to investigate whether the software reuse candidates comply with the Wood's quality issues.

In practice, the overall effectiveness of the scavenging process is severely restricted by its informality for the following reasons: • a programmer can only reuse code fragment he has knowledge of • there is no systematic way to share fragments • identification and integration are inefficient and they require thorough understanding of the code The first two issues are concerned with the setting up of a reuse environment where a repository of information available to support the reuse process. Thus with such support information is available for the systematic sharing of components there is an increased knowledge of the existence of reuse code. Work in this area has included reuse library classification mechanisms [PrietoDiaz85] and the identification and retrieval of appropriate software components [Albrechtsen92]. This paper will discuss the issues raised in the third reason for the lack of reuse successes; the understanding of code for the identification and integration of reusable components. This paper describes the results of work we have recently completed involving assessing the feasibility of obtaining reusable components from legacy COBOL code. While many of our techniques are applicable to most software development languages, we investigate the usability of COBOL since it represents the largest set of legacy systems. The scenario of this work results from a realisation that many business applications perform the same operations repeatedly throughout a number of applications. Therefore, if this repeated functionality can be extracted from the source code and reengineered into a new reuse module the cost of future developments can be reduced and benefits will be gained from reduced maintenance. This paper describes our analysis process using actual business applications. The examples we will give are from a set of 12 case studies ranging from 3,000 to 40,000 lines of code excluding copy code. Each of the programs are from business applications and are still in use. Throughout this paper we describe the process by which reuse candidates are identified and how their suitability as reuse candidates is evaluated. We describe analysis of the calling structure of the code and the identification of data interactions as the basis of designing an interface for the reuse modules. At each stage of the analysis process we describe how the process enhances understanding to assist the reengineering process of the interfaces. This understanding process is supported with the use of graphical representations. This aids the communication process between the users and maintainers of the code under analysis. Examples of the graphical representations that we use are provided throughout this paper.

The paper considers three aspects of the reuse process. Section 2 describes the identification of potential reuse candidates from existing legacy code. The integration of the components, and, in particular, our approaches to investigate the simplification of reuse candidate interfaces, is described in section 3. Finally, section 4 concludes with the results of our work and suggests areas of further work which will offer improvements to our existing approach.

2. Identifying Reusable Components In section 1 we indicated the lack of reusable components designed specifically for reuse. For this reason it is necessary to find components from existing software applications. In order to locate such components we use a number of approaches from the RE 2 paradigm[Canfora95, Cimitile95, De Lucia95, Tortorella94]. The RE 2 paradigm provides many techniques for reengineering existing systems. Specifically we use RE2 to investigate the calling structure of the code (COBOL PERFORMs) to generate a ‘PERFORM graph’, and the dominance relation of the calling structure between procedures (COBOL SECTIONs) to generate a ‘dominance tree’. Using the largest and most complex of the examples we have analysed, we will describe the results of this analysis for the identification of reusable components. The benefits of each technique towards aiding understanding of the reuse candidate are also now explored. A PERFORM graph is an example of a call-directedgraph. A PERFORM graph is constructed by analysing the PERFORMs for each SECTION of the COBOL program. Within each of the sample programs it was the convention to structure the code with SECTIONs. Each SECTION contained two PARAGRAPHs, one acting as an EXIT function. This convention ensures that there is no 'fall through' between SECTIONs. For each PERFORM in a SECTION a link is made between the calling and the called SECTION. In COBOL such an assessment of the calling structure can be complicated with the use of statements such as GOTOs and through implicit fall through. However, we were fortunate that the code used within the case study was well structured and did not support such features. The results of the PERFORM analysis can be seen in figure 1. Figure 1 gives an indication of the complexity of the software being analysed. Within the 40,000 lines of code there are 230 SECTIONs and between the SECTIONs 553 different relationships 932 in total (when including the number of times SECTIONs call the same SECTION is taken into account). A number of graph simplification procedures can be used to

A215L-DYNAMIC-USER-DEF

V200-CALL-ACCESS-CTL

H020NR-MERGE-OUTPUT-SCREEN

H050NR-SET-MSG-LINE

H030NR-TEST-FOR-DISPLAY-MAP

H024NR-EXTRACT-DAM

H054NL-USER-PROMPT

H051NR-BROADCAST

X450NLR-START-DMAB

X410NLR-DCABNDC

X460NLR-RETURN

X455R-RMT-ABEND-MMBI-MODE

X420NL-DCAH2DC

A000NLR-MAINLINE

X480NL-CALL-DCAH2

A020-MMBI-CLIENT-SETUP

A040-EXAMINE-MMBI-RESP

A030-CALL-MMBI-SERVER

A045-RESET-DM-REQ-BLOCK

A050-MMBI-TRANSACTION

H080R-SEND-MAP

H040NR-SET-CMD-LINE

A100NLR-TEST-IF-REMOTE

C000NLR-GET-CURRENT-CONTEXT

A060-MAIN-PROCESSING

A021-3270-MMBI-RECEIVE

A025-GMI-MMBI-RECEIVE

A080-WRAP-UP-PROCESSING

H010NR-ADDRESS-LINE-24

H060NL-UPDATE-BCAST-NO

A010-INITIALISATIONS

C950NLR-SET-DAP-AREA

C660R-UPDATE-USER

C700NLR-SET-LOG-AND-MONITOR

C400NL-READ-CONTEXT-TSQ

A085-GMI-SEND

H200NL-SEND-TEXT

C150R-ADDRESS-STORAGE

H000NLR-MAPOUT

C500NLR-ADDRESS-CWA-CSA-TWA

H070N-SEND-MAP

C800NLR-GET-TIME-AND-DATE

S100R-BUILD-RTL-BLOCKS

S300R-BUILD-RSTART-BLOCKS

C820-CHECK-DBASE-DATE

E480NL-PROC-LOCAL-XFER-CMD

S120R-BUILD-RTL-GDATA

S130R-BUILD-RTL-MIN-GDATA

S320R-BUILD-RSTART-GDATA

S330R-BUILD-RSTART-MIN-GDATA

C900-RESET

C450NL-CHECK-WAIT-STATE

C100NL-GET-STORAGE

C170NL-ADDRESS-LOCAL-START

C300NL-CHECK-STARTCODE

C155R-GET-STORAGE-REMOTE

C190NL-ADDRESS-LSTART-BLOCKS

C180NL-ACCEPT-LSTART-GDATA

D000NL-INITIALISE-SESSION

A200L-LINK-TO-REMOTE

A070-SWITCH-TO-MMBI

D100NL-COMMAND-ONLY

D200NL-COMMAND-AND-PARAM

F000NR-ACTION

G000NR-CANTDOTHAT

A220L-ACCEPT-RTL-GDATA

A230L-PROC-RTL-AREAS

B000NLR-HANDLE-ABEND 1

X458R-MMBI-DB2-ERROR

C200NLR-INITIALISE

R300NL-DELETE-QUEUES

F170-APPL-KEY

A210L-BUILD-LTR-GDATA

F070-HELP-CMD

1

1

F280-ROUTE-QUIT

F120-SAVE

1

1

F300-SELECT-ROUTE

1

1

1

F020-PROCESS-SCREEN

1

1

1

F130-RECALL

F110NR-TRANSFER-SELEC

F850L-REMOTE-FTS

F135NL-XUA-RECALL 1

F140-QUIT-HELP

F320-TOGGLE-SCREEN-LABEL

E000NLR-PROCESS-INPUT

E340NL-OPTIMIZE-TRANSITION

E200NL-TC-RECEIVE

E100NL-CHECK-FOR-START

E300NL-PROC-INPUT

A225L-UPDATE-UDTQ

F125NL-REMOTE-MAPIN-TSQ-SAVE

I000NLR-PROCESS-CONTEXT

1 F310-QUIT-SELECT-ROUTE

F810-COPY-TSQ

F127NL-XUA-SAVE

1

H072N-OVERLAY-MAP

I010NLR-UPDATE-CONTEXT

C810-CONVERT-EIBDATE

K000NLR-DC-RETURN

E400NL-ENTER-PF22-PF8

Y570-DESTROY-DATA

E490NL-PROC-LOCAL-XFER-PARAM

V300NL-GET-BCAST

E460NL-BAD-CMD-LINE

E411NL-MMBI-CMD-PARAM

E420NL-PROC-VALID-CMD

E440NL-PROC-MAP1

E450NL-PROC-NON-MAP1

E430NL-PROC-VALID-CMD-PARAM

E435NL-SET-PREV-XN-INFO

E940NL-SET-ROUTE-FROM-CONTEXT

E970NL-FROUTED-VIOLATION

E980L-SETUP-SELECT-ROUTE

H100-PROCESS-SCREEN-LABEL

C195NL-REMOTE-CSI-QUIT

I020NLR-STORE-CONTEXT

E500NLR-READ-GXUA

E410NL-FIND-CMD-PARAM

I100NLR-STORE-GXUA

E550NLR-GXUA-ABEND

E412NL-UNINTELLIGIBLE-INPUT

I150NLR-STORE-GXUA-ABEND

E960NL-REVERT-TO-HOME

R400NL-PARTIAL-KEY-UPDATE

F180-RESTART-TRAN

F240-GENERAL-NOTICES

F999-SYS-ERROR

F220-GET-IN-TRAY-MESSAGES

F230-LOCAL-NOTICES 1

F150-APPL-PAGING

F060-PAGE-FORWARD

F210NR-TRANSFER-SEL-PARAM

1

1

F111-XCTL-TO-MTOP

F400-INT-XFER-PXUA-CHANGE

F212-MENU-PROGRAM 1

1

1 H082R-REMOTE-OVERLAY-MAP

F075-HELP-PARAM

F090-HELP-USING-PF-KEY 1

1 1 H090NL-SEND-TEXT-MAPPED

1

1

1

1

1

F315-RESTORE-UNMAPPED-DATA

F213-PROCESS-VALID-XN

1

1 F830-VALIDATE-PARAM

F940-GET-CV-NUMBER 1

1

1 H025NR-LINK-TO-DCDAMTCS

E470NL-PROC-LOCAL-PASS

I200NLR-SETUP-PXUA

F910-VALIDATE-INPUT-MAP

I210NL-STORE-PXUA

1

I250NL-STORE-PXUA-ABEND

W330-TELL-OMEGAMON

1 E950NL-CALL-ROUTER

3

1 1 E988L-LOG-NO-CCRF

E982L-STORE-UNMAPPED-DATA

E930NLR-MERGE-INPUT-MAP

E952-XN-EXIT-PROC

1

1

2

1 1

1

1

1

1

1

F700-SAVE-AND-STACK-CONTEXT 1 1

E932NL-REMOTE-MAPIN-MERGE

F085-DM-LOGOFF-MSG

F122-STACK-CONTEXT

1

1

2

1

F121-SET-UP-NEW-CONTEXT 1

F040-TRANSACTION-MENU

F030-SUBSYSTEM-MENU

1

U100-TRANSFER-TO-IDMS-DC

T200-EXEC-XN-END-EXIT-PROGRAM 3

1

2

1

T110-LINK-TO-PROGRAM 1 1

Y590-BAD-LOG

Y526NLR-LOG-DAP-PROC-FAIL

T400NL-SWITCH-AND-TRANSFER

T135-CHECK-MAP-VERSION

T430NL-SWAT-OVERLAY-MAP

T440NL-CHECK-SAT-AND-VSA

Y525-LOG-EVENT

R100-CHECK-SYSTEM

T130-ANALYSE-ACTION-FLAG 1

3 2

1

T140-TRANSACTION-ERROR 2

1

1

F080-DM-LOGOFF

T160-CHECK-DCSRD-FLAG

1

T150-RE-EXECUTE-DEFAULTS

1 2 T410NL-SWAT-PROCESS

C955NLR-SET-DAP-FIELDS

T132-SETUP-FOR-NEW-XN

Y580NL-DELETE-TSQS

E995-DELETE-PXUA

J000NLR-LOG-STATS

T136-ACCESS-RMH-TABLE

T131-QUIT-TRANSACTION

T137-ACCESS-MAPH-TABLE

E990-DELETE-GXUA

Y550NLR-TIDY-USER

T300-CHECK-CURSOR-RETENTION

E600NL-READ-PXUA

U200-CHECK-INPUT-ON-PFS

T420NL-SWAT-ABEND Y520-FORMAT-LOG-RECORD

E610NL-PXUA-SAME-PROJECT

E620NL-PXUA-PROJ-CHANGED

Y530-LOG-TO-CMF

X470L-LOG-FORMAT9

E630NL-READ-PXUA-QUEUE

E640NL-DECOMPRESS-PXUA

E670NL-CHECK-PXUA-TRAILER

E660NL-CHECK-PXUA-HEADER

R200NL-UPDATE-QUEUES-TSQ

E650NL-MOVE-PXUA

E690NL-PXUA-ABEND

E912R-BITFLIP-TCTTE

E915NLR-MAPFAIL

E910NLR-MAPIN

S200N-REMOTE-FORMAT-SCREEN

C600NLR-SET-REQBLK

X430NLR-DUMP-PROGRAM

X482NL-DCSRD-UNSTACK

F820-UNSTACK-CONTEXT

E922-CHECK-MAP-VERSION

C650R-ACCEPT-LTR-GDATA

H052NL-STATUS-INFO

E920NLR-GET-MAP-MASK

A204L-RING-ACC-FAILURE

T138-LOG-VERSION-CHANGE

Y560-CSI-QUIT

V400-FIND-XN-INDEX

Figure 1: PERFORM graph improve the understandability of the PERFORM graph including, for example, removing SECTIONs which are only concerned with error checking (frequently such SECTIONs mask the true functionality of the program). The results of removing 3 types of nodes can be seen in table 1. The number of corresponding links which are removed along with the removal of the different types of nodes is also indicated.

between the nodes px and py if and only if px directly dominates, and is the only node that calls py. The results of this analysis is shown in figure 2. Those nodes which are not strongly dominant are darkly shaded within the figure. e995_delete_pxua

e990_delete_gxua

a000nlr_mainline

r200nl_update_queues_tsq

t138_log_version_change

t150_re_execute_defaults

v300nl_get_bcast

h100_process_screen_label

c650r_accept_ltr_gdata

u200_check_input_on_pfs

c810_convert_eibdate

c800nlr_get_time_and_date

a210l_build_ltr_gdata c600nlr_set_reqblk

r100_check_system

a204l_ring_acc_failure

c955nlr_set_dap_fields

e960nl_revert_to_home

e910nlr_mapin

t300_check_cursor_retention

j000nlr_log_stats h053nr_fetch_msg

y540_output_abend_log_msg

t131_quit_transaction

e940nl_set_route_from_context

c820_check_dbase_date

a100nlr_test_if_remote

y570_destroy_data

Type of SECTION Number of Number of (node) Removed nodes links Error handling 5 80 Database checks 2 26 Message composition 2 10 Total 9 116 Table 1: The results of the graph simplification process

h052nl_status_info

x470l_log_format9

y530_log_to_cmf y560_csi_quit

y520_format_log_record

a045_reset_dm_req_block

i010nlr_update_context e915nlr_mapfail

y526nlr_log_dap_proc_fail e650nl_move_pxua

u100_transfer_to_idms_dc

y590_bad_log

t430nl_swat_overlay_map

h010nr_address_line_24

grouped_node

t136_access_rmh_table

h200nl_send_text x460nlr_return

a050_mmbi_transaction

x430nlr_dump_program

h090nl_send_text_mapped

y525_log_event h040nr_set_cmd_line

a085_gmi_send

s120r_build_rtl_gdata s100r_build_rtl_blocks

s130r_build_rtl_min_gdata s320r_build_rstart_gdata

s300r_build_rstart_blocks

s330r_build_rstart_min_gdata

x456r_mmbi_cics_error

x455r_rmt_abend_mmbi_mode s

x458r_mmbi_db2_error h080r_send_map

h082r_remote_overlay_map

h054nl_user_prompt h050nr_set_msg_line h051nr_broadcast

h060nl_update_bcast_no

d

s h070n_send_map

h072n_overlay_map

s

d

s s a070_switch_to_mmbi

e000nlr_process_input s

f121_set_up_new_context

s

e500nlr_read_gxua

s

s

r400nl_partial_key_update

f122_stack_context

f700_save_and_stack_context

e952_xn_exit_proc

g000nr_cantdothat

s s s

s

e550nlr_gxua_abend

s

s s

s

s

s d

d

d

a230l_proc_rtl_areas s

d

s

d

s

The simplification process in this case has resulted in the removal of very few nodes only 4% of the total number of SECTIONs within the code. However, this has resulted in a 21% reduction in the number of links between the remaining SECTIONs.

a225l_update_udtq

a220l_accept_rtl_gdata

s

a200l_link_to_remote s d d

s

f910_validate_input_map

f300_select_route

s

f999_sys_error

f260_analyse_error

s d s s

s

f150_appl_paging f830_validate_param

f400_int_xfer_pxua_change

f320_toggle_screen_label f810_copy_tsq s w330_tell_omegamon d s s s s d f090_help_using_pf_key f060_page_forward f280_route_quit s s d s d d f240_general_notices f180_restart_tran s s s

s s a060_main_processing

f170_appl_key

f230_local_notices

f075_help_param

f220_get_in_tray_messages

f070_help_cmd f140_quit_help

f940_get_cv_number f020_process_screen

s

f000nr_action s s s

f213_process_valid_xn

s f210nr_transfer_sel_param s

s s

f212_menu_program s s

d

s

f111_xctl_to_mtop s

f110nr_transfer_selec s

s d

d d d d d

d

d

d

d

s d

s

d

d d

d

d

d d

d

d

s

d d

d d

d

d d d

d

f130_recall

d

s s

f135nl_xua_recall f850l_remote_fts

d

s s

d s

s

f120_save

s d

s s

f127nl_xua_save

s f125nl_remote_mapin_tsq_save s

s

In summary, the PERFORM graph provides an indication of the complexity of the code being analysed. In addition, 'problem SECTIONs', those with a high degree of connectivity to other SECTIONs, can be pinpointed for investigation as, frequently, those nodes represent the SECTIONs providing error checking and mask the system's functionality.

s

f310_quit_select_route

s

f315_restore_unmapped_data

s d

s

s

s

s

e200nl_tc_receive

r300nl_delete_queues

s

d000nl_initialise_session

e950nl_call_router

e100nl_check_for_start e340nl_optimize_transition e420nl_proc_valid_cmd

e411nl_mmbi_cmd_param

d100nl_command_only

s d s

s

s s

s s

s

e970nl_frouted_violation

d s

s d d

d

s

s

s s

s d

s

s s

s

s s

e400nl_enter_pf22_pf8

s

e480nl_proc_local_xfer_cmd e470nl_proc_local_pass e460nl_bad_cmd_line s s s e435nl_set_prev_xn_info e490nl_proc_local_xfer_param s d d e450nl_proc_non_map1 e440nl_proc_map1 e430nl_proc_valid_cmd_param

s

s

d a021_3270_mmbi_receive

d200nl_command_and_param

c195nl_remote_csi_quit

e300nl_proc_input

e410nl_find_cmd_param

e412nl_unintelligible_input

e982l_store_unmapped_data a020_mmbi_client_setup

e980l_setup_select_route e988l_log_no_ccrf

a025_gmi_mmbi_receive

c000nlr_get_current_context

c700nlr_set_log_and_monitor c500nlr_address_cwa_csa_twa

c150r_address_storage c950nlr_set_dap_area

c660r_update_user c400nl_read_context_tsq

c900_reset

c450nl_check_wait_state c200nlr_initialise

c300nl_check_startcode

a010_initialisations

c190nl_address_lstart_blocks

b000nlr_handle_abend

c170nl_address_local_start c180nl_accept_lstart_gdata

c100nl_get_storage c155r_get_storage_remote

From the PERFORM graph we are able to generate a dominance tree. The PERFORM graph is initially turned into a call-directed-acyclic-graph (CDAG). The graph is obtained by collapsing every strongly connected subgraph into one node. Thus all nodes involved in a recursive cycle are grouped to form a single node of the graph. From this the dominance tree is constructed using a method devised by Hetch[Hetch77]. According to Hetch in a CDAG a node px dominates a node py if and only if every path from the initial node x of the graph to py spans px. In order to find potential reusable candidates a tree of strong direct dominance between nodes is considered. In a CDAG there is a relation of strong direct dominance

Figure 2: The dominance tree The subgraphs within the tree which do not include the darkly shaded nodes represent the potential reuse candidates. From the code analysed above, 11 potential reuse candidates can be generated. The size of these candidates ranges from a few hundred lines to several thousand lines of code per reuse candidate. However, in other examples we have found up to 20 from a code module of around this size. Our findings have left us to conclude that this is not only the size of the initial code module which indicates the number and size of reuse candidates that are obtained from this analysis, but the complexity of the interaction between SECTIONs. For this reason, moving 'non-functional' SECTIONs (i.e.

error checking) before generating the dominance tree does help to identify larger components. However, the implication of removing the SECTIONs for a purpose other than simply aiding the understanding process needs to be considered carefully. The PERFORM graph and dominance tree provide a communication mechanism through their graphical representation. From these discussions we can begin to establish which candidates should be considered further. Therefore, we consider which function or sets of functionality are provided by the identified candidates which is to be most useful for later reuse. This process is concept assignment. Further details on how to perform this process can be found in Biggerstaff [Biggerstaff94]. Predictions of potential future reuse are dependent upon the future business strategy, but factors like the effort required to reengineer the component into a reusable form need also to be considered. Therefore, the graphical representations we use are an important aid to assist discussion between management or systems procurement and technical staff. The graphs provide a fairly high level view of potential components. This enables the rejection of candidates which are unsuitable i.e. those whose functionality is deemed unsuitable. However, the work we have described so far does not provide any details as to the reengineering effort required to make the candidates reusable. For this reason, it is important to consider data analysis. However, the cognitive load for such a process is large. Therefore, we deliberately delay this detailed analysis until only the most promising components, selected from the use of the dominance tree, remain.

reuse candidate and the remaining code that the candidate will be separated from. The initial analysis within the reuse candidate performs the following activities: 1.

2.

3. 4.

The secondary analysis between the reuse candidate and the remainder of the code involves the following activities: 1.

2.

3.

3. Integrating Reusable Components The ease of integration of components relates to the qualities of high cohesion and loose coupling. It is desirable to achieve high cohesion within the reuse module so the module only implements one, or a group of similar features. This point was considered in the previous section. However, tied with high cohesion, is loose coupling. Loose coupling represents the interaction between the reuse candidates and the application under development. To investigate the interactions we consider the data dependencies for each of the reusable components. The data dependencies between reuse components form the component’s user interface, so whenever possible, we seek to reduce unnecessary interactions and therefore simplify its user interface. We perform the analysis of the data items at two levels. Initially, data analysis is performed within the reuse candidate itself, and, secondly, the analysis is performed to investigate the boundary between the perspective

C. R. U. D. - data items are categorised using SSADM (Structured Systems Analysis and Design Method) data groupings i.e. as to whether they are Created, Read, Used and/or Deleted within the reuse candidate. Data inter-relationships between typing - where subtyping is used but reference is made within data items of the same type to both super and sub-types. Data inter-relationships by value sharing where enumerated types are defined and value overlapping is recorded. The use of the REDEFINES relationship where one set of types or data items is used to redefine another set of types or data items.

4.

C. R. U. D. - data items which are Created, Read, Updated, or Deleted within the reuse candidate but used within the remainder of the code, or vice versa, are recorded for further analysis. Data inter-relationships between typing - a record is made where supertypes are used within the reuse candidate but its sub-types are used within the remainder of the code, or vice versa. Data inter-relationships by value sharing - a record is made where data items involving value overlapping are used within the reuse candidate and also within the remainder of the code. The use of the REDEFINES relationship where REDEFINES statements forces relationships between the reuse candidate and the remainder of the code.

H900

H905

H930

H935

H910 Figure 3: A strong dominance subgraph reuse candidate

01 INPUT-DATA. 05 INPUT-DATE 05 INPUT-RATE-R1 REDEFINES INPUT-DATE 10 FILLER 10 INPUT-YEAR 10 INPUT-MONTH 10 INPUT-DAY

PIC 9(8) PIC 9(2) PIC 9(2) PIC 9(2) PIC 9(2)

Table 2: Sample COBOL code To demonstrate how we make use of the above activities and how these are used to reduce the interactions of the reuse candidates, therefore assisting the integration process, we give examples from one of the medium sized software modules analysed. We select a reuse candidate of around 1,000 lines of code which is composed of 5 SECTIONs. The dominance tree for the subgraph to be analysed is shown in figure 3.

helps to gain an indication of the costs and benefits gained for the approach. In the reuse candidate, one candidate data item for relocation was identified.

All data items of the reuse candidate are then analysed to group them according to the categories proposed in activity 1, above. The table (table 3) below shows the results of this analysis process. Within this paper we are particularly concerned with the data items which are updated and those which are only read. This information provides an indication of how candidate reuse modules interact with each other. In table 3 we have not included the IDMS database statements as for the purpose of this paper we restrict our analysis to COBOL source code. We therefore see little use, as one might expect, of creation and deletion of data items. However, we have included the results here for completeness.

From this we can derive the following consists-of data relationships shown in figure 4.

H900 H905 H910 H930 H935 Total Created 1 0 0 0 0 1 Read 56 55 10 10 2 133 Update 14 20 6 6 1 47 Deleted 0 0 0 0 0 0 Table 3: H900 subgraph data usage

Figure 4: An example of the consists relationship from COBOL data typing

Unfortunately, quite a high proportion of the data items of the reuse candidate seem also to be used throughout the remainder of the code. Thus, the reuse candidate without reengineering would have a complex interface. We will investigate ways of reducing the complexity of the interface throughout the remainder of this section. By categorising data items using SSADM's Creation, Read, Updated and Deleted, unnecessary interactions between reuse units data accesses can be identified. For instance, those SECTIONs where data items are defined but not used should be identified and such definitions moved to the SECTION where the data is used. The long term benefit of making such changes can be identified by assessing the reduction in the number of data interactions between SECTIONs and thus a decrease in the complexity of the reuse candidates interface. This

Activity 2 identifies the need to address relationships between data items through typing. For instance, in COBOL we find the construct which is shown in the code sample of table 2.

Input Data consists

consists

Input Date

Input-date-R1 consists

consists consists

Input-year

Input-month

Input-year

For figure 4 we can identify that Input-data consists-of Input-date and Input-date-R1. In some cases not all of the type hierarchy will be present within the reuse candidate code. The absence or presence of part of the type hierarchy from the reuse candidate can be represented graphically by shading. It is important to examine how the typing is used within the candidate to examine fully the relationships between it and the remainder of the code. The results of this analysis process can be seen in figure 5. The shaded boxes represent those data items which are referenced within the reuse candidate.

B

z

n q consists consists consists

consists

o

p

O

C

A

e consists consists consists consists consists 8

9

consists consists

consists

Overlapping values (activity 3) are evident where use is made of enumerated typing. For instance, in the code sample shown within table4.

d consists

7 consists consists

y consists consists

w

k

f consists

h consists consists consists

r

i

j

x

P consists l consists consists consists

X consists consists consists

Y

0

T

Z R

s

N

I

v

H

b

consists

K consists consists

1 consists consists

In the above example the last three data types overlap. We are able to represent these overlapping values graphically as an extension of the consist graph (figure 6). In this example the weekend data type groups its two overlapping values. The notation can be further extended when the reuse candidate does not comprise all the possible values, by adding box colour coding. Day-of-week

U consists consists

consists

D consists consists consists

E

6 G consists

J consists consistsconsists

consists t consists u consists consists consists

m

S

consists

a

F consists

Q consistsconsists

g

2

3 consists consists

L

M

W

V

c 4

5

Figure 5: The consists relationship for the reuse candidate From figure 5, it can be seen that there are no internal inter-relationships between the data items of the reuse candidate. Inter-relationships would be identified by the existence of two shaded boxes within one path of the consists trees. We need therefore only to consider the relationships of the data items between the reuse candidate and the remainder of the code. When analysing the interaction of data items between the reuse candidate and the remainder of the code, we need to consider two such relationships. We need to record how the remainder of the code effects the reuse candidate and also record how the reuse candidate effects the remainder of the code. From the analysis of figure 5 we can conclude that the reuse candidate is potentially effected by write operations on sub-types whose supertype is read by the candidate and also by supertypes whose subtypes are updated by the remainder of the code. From the analysis of the code, we have identified only two interacting data items which must be recorded. These are the grandchildren on P, nodes marked R and S. No write operations are performed on the supertypes. However, all write operations within the reuse candidate will effect the remainder of the code. 01 DATA-INPUT. 05 -DAY-OF-WEEK 88 MONDAY 88 TUESDAY .... 88 SATURDAY 88 SUNDAY 88 WEEKEND

Monday Tuesday Wed'day Thu'day Friday Weekend Saturday

Sunday

Figure 6: An example of overlapping values within the consists graph In our case study, 4 sets of overlapping values have been identified, but only 1 set is involved within the subgraph. The result of the data analysis has found that there is use made of some of the data values within the reuse candidate consists graph as well as the remainder of the code. During the above example, we have found the shaded data items (figure 7) to be used outside the reuse candidate. Day-of-week consists

Monday Tuesday Wed'day Thu'day Friday Weekend Saturday Sunday

Figure 7: Values used within the remainder of the code

PIC 9. PIC 9(1) PIC 9(1)

VALUE 1. VALUE 2.

PIC 9(1) PIC 9(1) PIC 9(1)

VALUE 6. VALUE 7. VALUES ARE 6 7.

Table 4: Code sample with Enumerated typing

Therefore this procedures provides the justification of why we must record both the Weekend and Sunday and also Day-of-week data items, as each are potentially affected by the remainder of the code. This indicates a further area where the simplification of the user interface could be achieved by reengineering. Further data interrelations also exist through the redefines relationship (activity 4). An example of the

usage of such a construct can be found in the typing sample (activity 2). i.e. 05 INPUT-RATE-R1 REDEFINES INPUT-DATE

The effect this has on the consists graph (figure 4) is shown in figure 8. In addition, in some cases it is important to consider the reverse relationship 'isredefined-by'. In the example below, we can identify this relationship being Input-date is-redefined-by input-dateR1. We will now consider the results of the analysis for these two relationships. Input-data

Input-date

Input-date-R1

redefines Input-year

Input-month

Input-year

Figure 8: An example of the redefines relationship Within the reuse code from which we identified the reuse candidate, we have identified 18 REDEFINES relationships. Within the reuse candidate, no data items are redefined internally or outside the subgraph. Furthermore, the same result was found with the isredefined-by relation. That is, no data items within the remainder of the code are redefined-by data items within the subgraph. Throughout this section we have described how detail analysis is performed to identify the data interactions between the reuse candidate and the remainder of the code. Each of the four activities enhance the understanding process towards indicating the cohesion and coupling within the candidate. The activities gradually provide more and more information about the reuse candidates so that initially their usefulness and finally the effort to reengineer the candidates can be assessed. An effort is made to pinpoint areas where reengineering should concentrate to reduce the complexity of the components interface and so enhance its reusability. Furthermore, this is an important first step towards making the reuse candidate environmentally independent and adaptable for reuse in many different circumstances. Its reusability is also dependent upon the re-user understanding the functionality of the component. Graphical representations of the calling structure and data interactions are used whenever possible to allow visual identification of the unit’s functionality.

4. Simplifying the Graph Structure This paper has described a process by which reuse candidates can be identified and evaluated. The identification process concentrates on the structure of the code from a high level, i.e. the calling structure. This process identifies a set of potential candidates which can then be evaluated by reuse support staff to select which sets of functionality will be most reusable for future applications development. Our case studies have shown that using software managers to perform tasks of concept assignment is a good starting point to evaluating the candidates suitability for reuse. We have found that our graphical representations very much support this process as a way of providing an abstraction of the source code. In most cases the managers were able to gain sufficient information from the SECTION names to be able to provide a brief description of the functionality of the candidate without having to refer to the source code. These descriptions of the candidates functionality could then initiate the discussion concerning the potential reusability of each of the candidates. The selection of appropriate reuse candidates goes beyond simply investigating the functionality provided. If a significant degree of reengineering work must be performed on the candidate to make it reusable then it is important to assess the cost of performing such work along with the potential gains for having the reuse module available for future application developments. The most significant cost of the scavenging process is the encapsulation of the reuse candidate to define a simple user interface to ensure that it is usable in new applications. The information for such decisions is provided through our data analysis which we have described in section 3. The details provided by performing the described activities serve to enhance the knowledge gained from the SECTION analysis and discussions with management to assign function descriptions to the candidates. Each of the four activities provides important details of how the data from the reuse candidate interacts with its environment. The data interactions between the reuse candidate and the original code can be overcome by defining the data items within the user area. However, if possible data interactions should be avoided as this helps to keep the user interface free from unnecessary complexity. From performing the SECTION and data analysis it is possible to gain a full and detailed understanding of the extent of the work involved in providing an interface for the reuse candidate. This can then be used to support the process whereby reuse candidates can be evaluated for their potential reusability. Thus, throughout this paper we have described the process whereby candidates can be selected and assessed.

Two main factors of Wood’s [Wood89] reuse qualities remain to be considered: those of adaptability and environmental independence. These issues relate to the adaptation of reuse candidates to make them potentially reusable in more applications by improving their generality. The details gained from performing these activities described within this paper provide the backbone of the information required to perform the generalisation tasks. However, to investigate such issues, this work needs to be coupled with reuse case scenarios. This is an area of work which is currently ongoing.

References [1]

[2] [3] [4].

At present this method is supported by pre-prototype tools. The result of using these tools has been to demonstrate that tool support can serve to automate many aspects of the method, providing it is also supported by human understanding, for instance, the concept assignment process. The result has been that the lengthy tasks can be completed automatically, such as the SECTION analysis and providing abstractions of the reuse candidate, to aid the human understanding process. While this process has worked well, commercial strength tools must be developed to support the analysis of very large applications.

[5] [6] [7] [8] [9]

Acknowledgements The authors wish to acknowledge, and thank, British Telecommunications for their assistance and financial support for this project.

[10] [11] [12]

Albrechtsen H., 'Software Information Systems: information retrieval techniques', in Software Reuse and reverse Engineering Practice', P.A.V. Hall (ed.), Chapter 6, UNICOM Applied Information Technology 12, Chapman & Hall, 1992 Biggerstaff T.J., Perlis A.J., (eds), 'Software Reusability', Vol 1 and 2, ACM Press, Addison Wesley, 1989 Biggerstaff T.J., Mitbander B.G., Webster D.E., 'Program Understanding and the Concept Assignment Problem', Communications of the ACM, May 1994 Canfora G., Cimitile A., Visaggio G., 'Assessing Modularization and Code Scavenging Techniques', Journal of Software Maintenance, Vol 7, No 5, October 1995 Cimitile A., Visaggio G., 'Software Salvaging and the Call Dominance Tree', Journal of Systems and Software, vol 28, No 2, February 1995 De Lucia A., 'Identifying reusable Functions in Code Using Specification Driven techniques', M.Sc Thesis, University of Durham, 1995 Freeman P., (ed), 'Tutorial on Software Reusability', IEEE Computer Society Press, New York, 1987 Hetch M.S., 'Flow Analysis of Computer Programs', Elsever, North Holland, 1977 Prieto-Diaz R., 'A Classification Scheme', PhD Thesis, Department of Information and Computer Science, University of California at Irvine, 1985 Tortorella M, E., 'Identification of Abstract Data Types in Code', MSc. Thesis, University of Durham, 1994 Tracz W., 'Tutorial on Software Reuse: Emerging Technologies' IEEE Computer Society Press, New York, 1988 Wood M, 'Software Function Frames: an approach to the classification of reusable software through conceptual dependency', PhD Thesis, University of Strathclyde, January 1989