gathering unstructured workflow data into relational ... - CiteSeerX

GATHERING UNSTRUCTURED WORKFLOW DATA INTO RELATIONAL DATABASE MODEL USING PROCESS DEFINITION LANGUAGE Amir Pourabdollah School of Computer Science and IT The University of Nottingham Jubilee Campus, Nottingham NG8 1BB United Kingdom [email protected]

ABSTRACT This paper introduces a new method for gathering unstructured workflow definition data in order to model it in relational databases. In addition to introducing an open database model, a solution has been developed to facilitate the process of defining the workflow data for process designers. This solution consists of the Process Definition Language (PDL), pre-processing of unstructured workflow definition to be converted to PDL, and an interpreter to convert PDL lines to SQL statements. This solution is able to fill the basic database tables in order to drive a general workflow management system (WFMS). PDL is very close to natural language in order to be used by non-IT people, but structured enough to be interpreted. KEYWORDS workflow models, description language

relational

databases,

Michael Hartley School of Computer Science and IT The University of Nottingham Malaysia Campus Jalan Broga, 43500 Semenyih, Selangor D. E. Malaysia [email protected]

• Knowledge-based systems like representation and solution processes. In the field of knowledge systems, workflows are highly combined with concepts of knowledge. Some examples are the test steps in identifying an unknown chemical material and identifying the fault of electronic chips [2]. A Workflow Management System (WFMS) is a total computer support for the design and execution of processes [1]. It is a base for both defining and executing workflows [3]. The Workflow Management Coalition (WFMC) [4] has a standard WFMS reference model, as shown in figure 1.

process

1. Introduction Workflow is the sequence of actions or steps used in a process, which is usually run by more than one involved parties and uses many different resources usually raised from set of operation rules [1]. In complex workflows, process engineers usually convert these rules or policies to processes, and then an automated workflow management system can handle these processes. It is noticeable that in this paper, workflow automation doesn’t mean leaving computers to do tasks of workflows (even if this is possible), but using computer systems to help human resources to know who must do what, when. Workflow automation has two main areas of applications: • Business/industrial processes automation like office automation and e-commerce. Some of the benefits of using automated workflows in this field are improvement in transparency, efficiency, management and flexibility to process changes [1].

Figure 1: WFMS reference model by WFMC[4] Flexible workflow models are those with clear abstraction of workflow definition from other parts of the model (Interface 1 in the figure 1). Other parts can be considered as engines to be driven by workflow definition. A WFMS’s typical features may include WF definition tools, task lists, rule-based decision making tools, process monitoring and management tools. Some advanced features are simulations and prototyping of workflows, computer-aided workflow definition tools, graphical representations of workflows, integrating with other

WFMSs or organizational systems and embedded access to www, email, etc.

2. Motivation Workflow definitions, which are usually expressed as unstructured texts or graphs by process experts, are the row material for filling WFMS underlying layers and some repository systems must do the required data manipulation for this part of the system. This paper uses a relational database model for workflow definition and execution. As will be shown in section 5, this selection is towards an open and centralized architecture for WFMS. Depending on the database design, certain algorithms are necessary to convert a process description to the storable data. Workflow definition tools and languages, as listed in the section 4, can intermediate this conversion. These languages mostly look like programming languages and might be too complex to be used by non-IT people, especially when dealing with complex workflows. This paper focuses on an intermediate language which is designed to be as close as possible to the natural English language, but structured enough to be interpreted by computers. The process of converting the unstructured text to this language will be easier and the result will still readable and ready to modify by normal users.

3. Terminology The working definitions of the basic terms in this paper are as follows: Subject: A work case which is being processed in a workflow, like a project, a file, or a problem. Node: Any processing step where the subject waits to be processed, including a user task (a step to be performed as defined in [5]) or making a decision. Reference: A logical relation between two consequent nodes, which makes it possible to pass a subject between them. It is user’s choice when passing a decision node, or what user has done during a task. Workflow Definition: A repository for what users can do within a workflow. Workflow Execution: A repository for what users have done or must do within a workflow. Involved Group: A group of people with defined role for processing a specified subject. User: An instance of the group class.

introduced by the University of Helsinki [6] and WfXML 2.0 [5] which is accepted as a standard language in Workflow Management Coalition [4]. Aalst in [5] introduces XRL as an XML/Petri-net-based language for inter-organizational workflows. A new subclass of Petri-nets, called WF-net is also introduced for describing workflows. The main application of this language is when there is no centralized management on controlling the workflow validity between clients or in the case of uncertainty in users. XRL (Exchangeable Routing Language) supports cross-organizational processes and uses XML for the representation of process definitions and Petri-nets for its semantics [7]. Aalst in [8, 9] classified all possible patterns (or basic elements) that can exist in workflow definitions as 21 patterns in [9], and White in [10] proposed graphical notations for what Aalst proposed. These patterns are collected to cover all possibilities of behaviors of applied rules in workflows. Kiepuszewski in [11] uses the level of coverage of workflow patterns [9] as an evaluation of workflow systems.

5. The Model Structure Although the mentioned related works contain several methods for workflow modeling, the openness of the model has not been completely focused. The provided model in this paper has more openness, in terms of: • Abstraction of workflow data from workflow applications, and as a result, sharing of workflow definition data to be used by multiple applications. • Abstraction of workflow definition data from workflow execution, as required by WFMC model [4]. The above requirements lead to use an open and centralized relational database. Thus the logical location of the workflow data can move from inside the WFMS to a shared layer for open access by simpler WFMSs (figure 2).

4. Related Works Nyberg in [6] has a complete list and description of languages that has been introduced for workflow definition. The more important ones are based on XML. Examples are: FlowDL (FlowMark Definition Language): Structural blocks of codes introduced by IBM, WDL2 (Workflow Definition Language 2)

Figure 2: Two approaches to modeling workflows

Other approaches may use languages for interchange of workflow definition data between clients (like XMLbased languages), but the main information should be stored in some application-variant repositories. As a disadvantage of such approaches, there will be two versions of workflow definitions, one stored in WFMSs, and one other in interchanging texts. Also the knowledge gap between process experts and IT engineers hasn’t been completely considered in those related works. The complexity of the mentioned workflow definition languages may prohibit non-IT users to be involved actively in the WFMS development. As shown in figure 2, a process expert has a plain process description. In other approaches, this text has to be converted to a non-natural WF definition language, with a manual process and most probably with complexity, and then a WFMS will use this as a driver. In this paper’s approach, this description is being converted to a more natural text (PDL) with simpler attempt, and then PDL lines will be used by an interpreter to fill the required tables.

6. The Database Design As the most structured illustration, workflow descriptions are usually expected to be a graph or a text which is convertible to graph. A sample workflow graph is like figure 3. This section describes the approach of this paper to store information of such graphs.

Figure 3: A sample workflow graph The required tables consist of two sets: the Definition Set which is about explaining the workflow; and the Execution Set which is about the subjects and how they move. Except for workflow periodic changes, the first set is static while the second is dynamic but dependent on the first. Figure 4 shows the proposed design on the required sets of tables and their relationships. Nodes and arrows are the most important elements of a workflow graph to be modeled. Generally there are 5 types of nodes: Start Nodes (no input), Action Nodes (single output), Distribution Nodes, (multiple outputs with same result), Decision Nodes (multiple outputs with multiple results) and Stop Nodes (no output). Instead of

dealing with these various types, a single multi-purposed node is definable which can have 0 or more input and/or output. Each node of this single type will be a record in the Nodes table. Also each arrow will be a record in the References table and will specify the originating and the terminating nodes of that arrow, in addition to a reference description, which means what will happen if the user passes a subject from the originating node to the terminating node. Although it is more meaningful for decisions nodes (as the decision result), but it can be generalized for non-decision nodes, as the result of doing a task. In this case, it will be same for all output references from a distribution node. Comparing the information in the above two tables can differentiate various node types: Starting node are those with no relating records in Reference table as the originating node and stop nodes are nodes with no relating records in Reference table as the terminating node. Duplicate originating nodes with single terminating nodes mean being multiple inputs and duplicate terminating nodes mean being either decision node or distribution node. Then if it is with multiple descriptions, means being decision node and if it is with single description, means being distribution node. In the Execution Set, the most important table is Logging, which contains real records of what is happening when users are dealing with subjects. It will not only contain the past records of what happened before, but also what are currently waiting to be done. Each record contains information about the current subject, user, the position of this record inside the whole workflow definition, date and time of sending, and note(s) to the receiver(s). This model contains other tables and fields for considering other issues which will not be described here. These are issues like multiple workflow management in a single system, dynamicity to the applied changes while keeping the old process rules, and passing data forms between the users. It is also noticeable that according to normal forms of relational database theory [12], the model can be shown to be fully normalized. Other details about this model can be found in [13]. This model supports workflow patterns of Aalst in [9]. This doesn’t mean separate elements exist to satisfy each pattern, but means the potential exists to cover them in a future WFMS. For example, two separate groups of patterns in [9] are to cover decision and distribution, but they could be merged easily in this model. Some of these patterns are not supported by this model because of a different approach to workflows. For example, there are two groups of patterns for manual and automatic choice to handle decisions, while in this approach it is assumed that users do all decisions. However, it is easy to image a WFMS that might automate some decisions.

7. PDL: Process Definition Language This section describes how we can convert an unstructured text to a simple structured language, named PDL: Process Definition Language. Then it will be shown how PDL is used to fill the contents of the described data model. The text pre-processing is not computerized;

however it is possible to add natural language processing to this solution, which is beyond the scope of this paper. 7.1. PDL Syntax Simply, each PDL line can be a natural English sentence starting with doer and following by what to do, considering that the doer must be introduced before in the header section.

Figure 4: The relational database design for workflows modeling PDL lines should be stored in a plain text file, which has two main sections: Header and Description. The header section contains information about the name of workflow and its involved groups of users. The syntax of the header as follows: Workflow Name: Involved Groups: , ,… Description: … The word Description shows the end of the header section and the start of the main workflow description. The syntax of each line in the description section can be one of the following 3 types: • [:] [Following ,] [Continue On ] • [:] If | Else if [Continue On ] • End If

means any text that user wants to write to complete his/her sentence. Keywords, group names and labels should be written exactly as they defined before, but not case sensitive. Continue On at the end of a line means that the workflow will continue (in parallel for two labels or more) in other line(s) specified by labels. Following means that the current line is not subsequently in the continuation of the last line, but the workflow will reach to this line after completing each of other lines listed in . Two lines are supposed to be done respectively if there is no Continue on in the first line and no Following in the second. Other details of PDL execution rules can be found in [13]. 7.2. Pre-processing of Unstructured Texts These guidelines will be used in pre-processing of plain texts to be re-written in PDL: • First define the workflow name and involved groups of users. Also define the workflow subject. • Omit unnecessary sentences, and reduce other sentences, until each sentence contains a task about

•

• • • •

the defined subject, starting with a doer. The doer of each sentence must be one of the introduced groups of users. Also convert he, she, it, etc. to their equivalent user name, only at beginning of each line. Separate sentences for each task or condition in a line, optionally following by labels. Sometimes a sentence must be re-written in many lines, or multiple sentences must be re-written to a single line, depends on the nature of the sentence. For example, “all are informed” can be converted to more than one line, each for a division, or “prepare and send the documents” can be converted to two lines. If the resulted sentences are not logically consequent, label them and link them using Continue On and Following keywords. Continue On and Following keywords can also be used to avoid repetition of same processes. For the case of decision points, use nested If/ElseIf/EndIf naturally as it is used commonly in programming languages. Indentation is optional, but highly recommended, especially for the case of more complicated workflows.

7.3. PDL Interpreter PDL interpreter extracts the contents of the basic database tables from a set of subsequent PDL lines. The output is a text file containing the resulted SQL statements. It is noticeable that the information about nodes and references are combined together in PDL lines. On the other hand, reference information is depending on the records of nodes table. Thus the main job of PDL interpreter is to simultaneously extract and fill two main tables of Nodes and References from this mixed information in PDL lines. The interpreter has been written in standard C++ for applying object orientation design, and the output has been selected to be SQL for being used by all RDBMS environment. 7.4. An Example As an example, a simple job application process can be considered. The plain text of process description is as follows: “For employment in a company, each applicant must fill and submit an application form to the employer. The employer then receives the form and processes it, and decides whether the application is short-listed or not. If it is short-listed, the employer interviews the applicant, otherwise it will be rejected. For successful interviews, the application is accepted and for unsuccessful ones, the rejection will take place.”

In this case, the first review on the above text shows that the subject is ‘application’ and the involved groups are ‘the applicant’ and ‘the employer’. After pre-processing, the PDL text file can be re-written as: Workflow Name: Job Application Involved Groups: The Applicant, The Employer Description: The applicant fills and submits the application The employer processes the application If the application is short-listed The employer interviews the applicant If the interview is successful The employer accepts the application Else if the interview is not successful Continue on Rejection End If Else if the application is weak Rejection: The employer rejects the application End If The interpreter converts the above lines to SQL statements. Executing these set of SQL statements in a RDBMS will result the following tables (some tables and fields are dropped or mixed for simplicity): Groups Group Id 1 2 Nodes Id 10 20 30 40 50

Description The applicant The employer

Group 1 2 2 2 2

References Orig. Term. Node Node 10 20 20 30 20 50 30 40 30 50

Description fills and submits the application processes the application interviews the applicant accepts the application rejects the application

Description The applicant fills and submits… If the application is short-listed If the application is weak If the interview is successful If the interview is not successful

Some other examples of PDL lines are ‘If this solution is an acid continue on acidity evaluation.’ or ‘Following the order registry, sales manager informs the delivery service,

continue on delivery’ or ‘Production manager decides whether or not the project has production priority.’

Informs Conference on Information Systems and Technology (CIST-2001), 2001, pp. 30-45.

8. Future Works

[6] K. Nyberg. (2000). "Workflow Definition Languages (Seminar on Database Management)." Retrieved 12/2005, from http://www.cs.hut.fi/~kny/workflowlang/.

The main future work is designing a complete WFMS driven by the created database engine. The mentioned design has been used in recent years to develop a few small-scale WFMSs, which are successfully working in small business processes for several years. Although this can demonstrate the functionality of this model, a largescale system can be developed for general-purpose applications. On the other hand, the mentioned structure is a minimum set, and some other tables and/or fields might be added to add advanced functionalities to the resulted systems. Usability is the most important factor to be considered in designing the final WFMS, as the final application must be used by wide range of users.

9. Conclusion This paper focuses on designing a new database model for general-purpose workflow management, suitable for a wide range of applications. It also provides a userfriendly method to fill the underlying tables with unstructured workflow definitions. The described steps contained a database model as a platform for developing WFMS, a simple process definition language (PDL), and an interpreter to bridge between these two. The preprocessing of unstructured texts contains manual but very simple guidelines to re-write plain texts. The contribution of this approach consists of providing a centralized and open architecture for shared workflow information access, and introducing a natural and easy “Process Definition Language” for more usability among non-IT peoples.

References [1] B. Kemme. (2000). "Workflow Management Systems (Lecture Notes)" Retrieved 7/2004, from http://www.cs.mcgill.ca/~kemme/cs764/lectures/764workflow.pdf. [2] A. Garnemark, "Workflow and Knowledge Management (Msc Thesis)," in University of Goteborg, Department of Informatics, 2002, pp. 31-43. [3] F. Leymann and D. Roller. (1997). "Workflow-Based Applications." Retrieved 12/2005, from http://www.research.ibm.com/journal/sj/361/leymann.html. [4] WFMC. (2004). "The Workflow-Management-Coalition." Retrieved 8/2004, from http://www.wfmc.org/. [5] W. M. P. v. d. Aalst, H. M. W. Verbeek, and A. Kumar, "Xrl/Woflan: Verification of an Xml/Petri-Net Based Language for Inter-Organizational Workflows," in Proceedings of the 6th

[7] Petri-Nets-World. (2004). "Petri Nets World: Online Services for the International Petri Nets Community." Retrieved 7/2004, from http://www.daimi.au.dk/PetriNets/. [8] W. M. P. V. D. Aalst. (2003). "Workflow Patterns." Retrieved 8/2004, from http://www.workflowpatterns.com. [9] W. M. P. V. D. Aalst, A. H. M. T. Hofstede, B. Kiepuszewski, and A. P. Barros, "Workflow Patterns," Distrib. Parallel Databases, vol. 14, pp. 5-51, 2003. [10] S. A. White. (2004). "Process Modeling Notations and Workflow Patterns." Retrieved 12/2005, from http://www.omg.org/bp-corner/bpfiles/Process_Modeling_Notations.pdf. [11] B. Kiepuszewski, "Expressiveness and Suitability of Languages for Control Flow Modeling in Workflows (Ph.D. Thesis)," in Queensland University of Technology, Brisbane, Australia, 2003. [12] W. Kent, "A Simple Guide to Five Normal Forms in Relational Database Theory," Commun. ACM, vol. 26, pp. 120125, 1983. [13] A. Pourabdollah, "A User-Friendly Process Description Language Used in Creating Database Model of Workflows," M.Sc. Thesis, The University of Nottingham., 2004.

gathering unstructured workflow data into relational ... - CiteSeerX

gathering unstructured workflow data into relational ... - CiteSeerX

Suggest Documents

Unstructured Data Workflow: A Functional Overview ...

Real-time Workflow Audit Data Integration into Data ... - CiteSeerX

Creating Relational Data from Unstructured and Ungrammatical Data

Classifying Unstructured Data into Natural Language ...

Optimizing Workflow Data Footprint - CiteSeerX

Workflow data resource analysis - CiteSeerX

Converting Legacy Relational Database into XML ... - CiteSeerX

Object-relational Queries into Multidimensional ... - CiteSeerX

Converting Relational Databases into Object- relational Databases

Extracting Relational Data from HTML Repositories - CiteSeerX

Discovering Knowledge from Relational Data Extracted ... - CiteSeerX

The Computational Co-op: Gathering Clusters into a ... - CiteSeerX

Joining Interval Data in Relational Databases - CiteSeerX

Extensions to the Relational Data Model - CiteSeerX

Dependency Networks for Relational Data - CiteSeerX

Extending Relational Data Access Programming Libraries ... - CiteSeerX

Discovering Knowledge from Relational Data Extracted ... - CiteSeerX

Iterative Classification in Relational Data - CiteSeerX

EMC UNSTRUCTURED DATA CLASSIFICATION SERVICE

Integrating Incomplete Information into the Relational Data Model

Intrinsically Unstructured Proteins - CiteSeerX

Unstructured Lumigraph Rendering - CiteSeerX

Translating Relational Queries into Spreadsheets

Mining Unstructured Data - Semantic Scholar