s y s t e m d i s c x 3 < v a l />< v a l /> III. R ELATED W ORK Some of the related work in the area of structure validation using Datalog, CHR or schema validation will be discussed in section II. Additional approaches concentrate on knowledge
validation at a higher level. For example, [10] presents a summary of the recent trends in this field by defining validation as the insurance of “functional accuracy or correctness of the system’s performance” [10]. According to that, the types of validity are content validity, i.e. correct representation of the problem domain, construct validity, i.e. model represents expert behavior correctly, and empirical validity, i.e. proper mapping between the system output and the real world. [10] also states that verification ensures structural correctness of a system by detecting errors in the logic of the knowledge base. Other authors state on similar definitions. For example, [17] describes verification as structural correctness while he states that evaluation is the ability to reach correct conclusions. Others state that validation is concerned with building the right system, i.e. it ensures that the system does what is supposed to do [14], [24], [26]. Other authors imply that the knowledge in the knowledge base is injected by different experts and thus the rules have to be checked for consistency [10], [26], [24], [17], [14]. However, in this paper injecting and validating rules is not considered. This work deals with the conformance of the injected facts with the model that an expert has defined. As a consequence of the different definitions of validation, the methods for knowledge validation are mostly inappropriate for this work. Some suggestions of knowledge validation include decision tables [16], graphs [26], constraints [2], [4] and even petri nets [27]. A general approach seems to be testing [14], [15], [17], [24]. While [17] discusses completeness and consistency tests, [14] and [15] elaborate on the creation of the test sets. [24] explain validation by testing all possible input scenarios. Again, these approaches focus on a different problem domain. IV. D ESIGN P RINCIPLES Since common conformance check approaches only cover parts of the requirements of current information management systems, we take NM as a representative example to define design principles for a comprehensive approach. To check data quality and conformance according to a (business) network data model, the quality requirements discussed in section II for the structure and content, have to be validated automatically while allowing manual configurations of the validation programs. Therefore, validation programs shall reliably check the conformance, e.g. of complex business network Datalog programs, and shall be written in a simple, intuitive, application domain independent Domain Specific Language (DSL) adapted to the validation domain. The resulting programming model should be used by different persona, e.g. developer, data model and quality experts. That means, even non-technical persona shall be able to write validation programs without knowing how to program. The runtime system for validation programs shall allow for efficient processing, parallelization and shared memory computing. Default runtime artifacts shall be assigned to validation program operations. Custom runtime artifacts can be deployed, which allow for extensions, e.g. filter operators,
length checks, starts-with or ends-with checks for strings and boundaries for numeric types. The validation programs shall be debuggable to detect errors in the validation code as in any other programming language. When the validation run is finished, the program shall terminate and report on the conformance of the data. Contradictions shall be detected and visualized in an understandable and structured and readable way.
Based on the considerations on conformance checks and design principles, a system and programming model for validation programs is defined subsequently. Fig. 2 schematically shows the compile and the runtime system, which is based on Deterministic Finite Automata (DFA) as in [13]. For the validation of data represented as Datalog facts (Facts), a validation program defines the NM model for conformance (Definition file). Given this input, the validation program compiler (VPC) computes a dependency graph as semantic model, see Fig. 1, and converts it into an execution plan based on DFA. Then the runtime system (VRS) validates the facts according to the execution plan and terminates by showing the validation result (Output (Log)).
the fact, e.g. the URI. The key index is optional as there might be facts that do not have a single key argument, e.g. the runs on discx fact which has a combined identifier of systemURI and hostURI. The key index of a fact is written in brackets ([ ]). The arguments in the argument list are separated by a comma (,). If the argument depends on more than one fact, the predicates of the lower-level facts are separated by a comma (,). After the definition of one fact, there is a dot (.) that terminates the fact definition. Comments may be included using the number sign (#). A sample definition file is shown in List. 5. The first line is a comment, then a data discx fact is defined. Its key argument is the uri, the other three arguments are optional. None of the arguments depend on other facts. Then the definition of the system discx specifies a mandatory name as the cardinality is 1 while the description is optional. The last argument, i.e. systemURI, at index 2 of the argument list, is a required argument which depends on a valid and matching data discx fact. The host discx is specified similarly. The runs on discx definition at index 7 has no unique URI and specifies a semantic relation between system discx or bpartner discx and host discx.
A. Validation Programs
B. Compile Time System
The validation programs describe the conformance model in an external DSL as input to the VPC. The language follows the grammar as in List. 4. For instance, facts, their predicates, arguments and the relationships between them are defined.
The validation program compiler (VPC) analyzes the programs and creates a semantic model. From that, an execution plan based on DFA is compiled which is able to check the conformance of the facts. This approach follows the idea of event-based systems [12], which stay in a state until a specific event happens. Depending on the state and the event, they react with a specified action. Thus, the state is the “memory” of the system. The event causes the machine to react and may lead to a change of the machine’s internal state. Since, the validation of an argument decides on the conformance, or acceptance, of a fact, and a fact impacts the consistency of a set of facts, the concept of Finite Automata [13] can be applied. More precisely, fact validation has to be done by an Acceptor Automaton [12] as it either accepts or declines a fact. When applying the concepts of an acceptor DFA [13], [12], its states and transition functions to knowledge validation, the automaton used for fact validation (FVA) will be similar to a “normal” finite automaton. The Automaton definition for validation programs consists of a five-tuple which has a set of states (Q0 ), a set of input symbols (Σ0 ), a transition function (δ 0 ), a start state (q00 ) and a set of accepting states (F 0 ):
V. T HE C ONTENT- BASED VALIDATION A PPROACH
Listing 4.
Basic validation program grammar
/ / d e f i n i t i o n o f t h e schema []:< a r g u m e n t l i s t >. / / d e f i n i t i o n o f t h e argument ({< l i s t o f f a c t p r e d i c a t e s t h a t t h e a r g u m e n t d e p e n d s on>},< c a r d i n a l i t y >) . Listing 5.
1 2 3 4 5
6 7 8
Sample Definition file
# l e v e l −0 data discx [0]: uri ( { } , 1 ) , data ( { } , 0 ) , content type ({},0) , origin ({},0) . # l e v e l −1 s y s t e m d i s c x [ 2 ] : name ( { } , 1 ) , d e s c r i p t i o n ( { } , 0 ) , systemURI ({ d a t a d i s c x } , 1 ) . h o s t d i s c x [ 3 ] : name ( { } , 1 ) , d e s c r i p t i o n ( { } , 0 ) , o r i g i n ( { } , 1 ) , hostURI ({ data discx },1) . # l e v e l −2 r u n s o n d i s c x [ ] : systemURI ({ s y s t e m d i s c x , bpartner discx },1) , hostURI ({ h o s t d i s c x } , 1 ) , o r i g i n ( { } , 1 ) . The key index is the integer value that indicates the index of the argument in the argument list that uniquely identifies
F V A = (Q0 , Σ0 , δ 0 , q00 , F 0 ), while the definition is slightly different from the DFA that accept regular languages [13]: Q0 The set of states. The automaton is in one of those states before the validation of an argument. Afterwards is can be in the same as before or in another state out of Q0 . 0 Σ The set of input symbols is the set of arguments of a fact, which can be either valid or invalid. Therefore,
Fig. 2.
δ0
Schematic view on compile and runtime system
Σ0 can be interpreted as one fact with the restriction that its arguments are in an ordered set or, more precisely, a list. The transition function represents the validation of one argument, i.e. it determines whether the input argument is valid or not.
q00 F0
For the fact validation automaton, the start state is the state that the automaton is in before it starts to validate the first argument. The set of accepting states includes only one state, which is the state the automaton goes to after successfully validating all arguments. Therefore, the
success of the validation can be determined by investigating whether the automaton is in its accepting state after validating the fact. Let qn be the state that the automaton is in after the fact validation, then qn ∈ F 0 → fact is valid and qn 6∈ F 0 → fact is not valid. When applying this definition to the conformance problem, the compiler generates default or custom states from the semantic model. Thereby the semantic model gives information on how to generate the FVA. For instance, an optional argument in the semantic model, translates into a state which will be skipped by the transition function. Some of the default states, which are translated to the FVA execution plan similar to DB operators are described subsequently: Argument Validation In the transition function of this state, an argument is validated in terms checking whether it is specified. If the argument contains an empty string, the transition function returns the state that the automaton is in at the moment. Otherwise it returns the next state. Skip Argument Validation This transition function simply realizes there is an argument but does not validate it. It does not matter whether the argument is an empty string or not. Therefore, this transition function always returns the next state. Validation of a Specific Argument of a Particular Fact In this transition function, there is a fact- and argument-specific validation implemented. For example, there is one transition function that only validates the name of system facts, which is defined to be a non-empty string that is less than 255 characters long. Error State This state represents any kind of unexpected input. For example, when the fact model for the given fact cannot be found, the start state of the automaton will be this state. If a fact has more arguments than the corresponding model, this state is the last state of the automaton. In any case, once the automaton is in this state, the fact cannot become valid any more. Consequently, the transition function of this state returns the same state on any input. Fact Validation This transition function is the most advanced one as it has to validate all facts that the actual fact depends on. During the definition of the fact and argument models, nondefault validation operations might be required. For instance, the system name should be less than 255 characters long. Therefore, the system can be extended by a new transition function, i.e a new state, by following a special naming schema: V a l i d a t e,
where the name starts with the literal “Validate” followed by the fact predicate and the name of the argument that this state is for. For example, the state that validates system names would be ValidateSystem discxName. The predicate and the argument name are the ones that were defined in the definition file. As remark on the implementation, the transition function and its states are implemented in the Java programming language. Hence all constructs of the context-free language could be used within the programs. Since the transition function is an extension to the FVA, which natively accepts regular grammar languages, the constructs used in transition functions are limited to variable declarations, conditions and finite loops. All other constructs are considered unsafe. For instance, for infinite loops or RPCs, the termination of the programs or the correct detection of conformance contradictions cannot be guaranteed. In general, the transition function of the automaton is defined as a complex function which allows (a) to check whether an argument is a non-empty string, (b) to skip the validation of an optional argument, (c) to include custom factand argument-specific validation, (d) to validate the URI of a fact, and (e) in all cases, to return the next state of the Validation Automaton if the validation was successful or an error state if any of the validations mentioned above fail. While the transitions (a) to (c) are simple tests on the content of the argument, transition (d) is more complex due to the reference check. C. Runtime System The runtime system manages the actual validation of a fact or a set of facts, while executing the (automaton) plans generated by the compile time system. The actual process of validation, consists of several phases such as preparation, applying a validation strategy (not explained in this paper), applying distribution or shared memory schema, pre-processing, validation, and post-processing. The preparation phase mainly consists of compiling the validation programs, generating the semantic model as well as the execution plan from the FVC. Then the runtime system is prepared by creating a context in which the facts will be validated. In general, a fact set includes a number of facts that may or may not have the same predicate. If the facts do not have the same predicate, they may depend on each other. The set of facts, that will be validated later on, is a subset of the context or the context itself: V alidationF actSet ⊇ Context. For example, let the context C be a fact set that consists of a runs on discx fact R with a system discx and a host discx fact (S and H) that each have a corresponding data discx fact (DS and DH ): C = {R, S, H, DS , DH } A fact set S1 , that is a subset of this context may be S1 = {S}
If the arguments of all facts are valid, the fact set S1 will be valid as well, as for the system discx fact, there is a corresponding data discx fact in the context. Now, let another subset S2 be S2 = {R, E} where E is an example fact. The fact set S2 will not be valid even though the arguments of all facts will be valid because there is no data discx fact DE that matches the configURI of the example configuration. Now, let the fact set S3 be S3 = {R, S, H, DS , DH } = C The fact set S3 will be valid as there are the corresponding lower-level facts in the fact set itself. Therefore, the automaton is not only able to validate a set of facts in the context, it is also able to validate the whole context. When the runtime system is prepared, it applies a validation strategy and a distribution or shared memory schema derived from the semantic model. The distribution schema determines connected components within the semantic model and deploys programs and data to arbitrary number of nodes. The shared memory schema facilitates the storage of all validated facts. When any of the already validated, lower-level facts is supposed to be validated again, the automaton can look up the shared memory and skip this validation. When disjunctive automata store their valid facts within the shared memory, the complete validation is processed efficiently. All invalid facts are stored in the local memory of the automaton for later revalidation. Before the real validation can be started, all conditions of the system are checked, e.g. shared memory, arity and argument consistency check, determination of start state. Then the validation starts by calling the transition function of the actual state for each argument, which returns a state depending on the validation result. If the return state is current state of the automaton, the validation of the argument was not successful. Then the processing is stopped with one invalid argument. Hence the fact cannot become valid any more. Otherwise, the current state is set to the state returned by the processed state. When a fact is validated, the post-processing handles success or failure cases and logs it appropriately. VI. T HE RUNTIME S YSTEM DURING P ROGRAM E XECUTION To demonstrate the defined compile-and runtime system, a sample validation program is used to debug the information represented as Datalog facts according to the NM model. Therefore, a user is assumed, who creates a semantic relation between a system and a host as runs on discx fact. In addition, that requires a system discx with a name, description and a URI. In this case, there are no custom validation states deployed to the system, which would e.g. allow to enhance the validation with user defined checks like QR1,2 or QR5,6. In the example shown in List. 6, the error comes from an invalid URI of the data discx fact, which belongs to the system discx, but does not match its URI invalidUri (relates to QR8). For that, the definition from List. 5 is taken as validation program.
Listing 6.
Invalid reference
runs on discx ( systemUri , hostUri ) . system discx ( sys1 , desc , systemUri ) . d a t a d i s c x ( i n v a l i d U r i , c o n t e n t , c o n t e n t− type , origin ) . The runtime system detects the inconsistency during execution. The automaton started with the validation of the runs on discx fact. Then checked the validity of the system discx fact. The name was validated, description was skipped (because it is optional) and then uri failed in the ValidateFact state due to missing data discx fact. The present data discx fact is valid, but has an invalid uri (relates to QR8). This can be seen in the textual output system_discx: Could not find a valid data_discx with the same key systemURI : systemUri). or the graphical output in dot-notation [9]. When the invalidUri in the data discx fact was corrected, the validation process is started and fails again, as shown in Fig. 3. This time the automaton did not find a valid host discx fact required for the runs on discx fact (relates to QR4). Guided through textual and graphical support, this issue can be fixed (see List. 7) by creating a valid host discx and linking it to the runs on discx fact. The textual, the graphical or both output variants can be chosen. In case of small input files, the graphical representation should be used. There the textual output could be used complementarily as more detailed description to the error. For larger input files, the dot-graph [9] could only help to locate the issue, but the text might be more helpful. Listing 7.
Correct fact input
runs on discx ( systemUri , hostUri ) . system discx ( sys1 , desc , systemUri ) . data discx ( systemUri , , , ) . h os t d is c x ( host1 , desc , hostUri ) . data discx ( hostUri , , , ) . VII. D ISCUSSION AND F UTURE W ORK In this paper, we introduced a novel approach for conformance checks on structured data given an implicit data model. For that, we defined a programming model including validation programs as well as a compile- and runtime system based on automata theory, which is able to execute these programs. We evaluated our approach on a linked data domain, i.e. NM as part of BNM. On sample data and data model from NM, we showed the correctness and reliability of our system. Through debugging, we illustrated how non-conform data, identifier uniqueness and model structure are checked. We hinted on how efficient, distributed validation plans are executed. The application of our system to linked data or network domains is quite promising. Although tool support for debugging, distribution and shared memory schemas have to be
Fig. 3.
Automata in error state due to missing host-fact
improved, the application to real business network data shows good results. Especially when knowledge is added manually to the system, inconsistencies were found precisely and fast. Additionally, the system turned out to be helpful in detecting incompatible model changes, or interoperability issues between the NM discovery clients and the BNM server. Through its model-driven, DSL-based approach, our experience with the maintenance of the model and entity lifecycle management is positive. For instance, model extensions or changes can be maintained in the validation program and are automatically adapted by the compiler and the runtime system. Future work will be conducted in the areas of the language, further debugging and tool support, and tool support for distributed or shared memory programming. In our approach, the grammar of the languages accepted by the automata is regular. According to [13], the automata could also be described by a regular expression. Consequently, it may be possible to describe the format of conform data models using regular expressions. R EFERENCES [1] Berners-Lee, T., Fielding, R. & Masinter, L.: Uniform resource identifier (uri): Generic syntax. http://labs.apache.org/webarch/uri/rfc/rfc3986.html, 2005. [2] Berstel, B. & Leconte, M.: Using constraints to verify properties of rule programs. Software Testing, Verification, and Validation Workshops (ICSTW), pp. 349–354, 2010. [3] About the Eclipse Foundation. http://www.eclipse.org/org/, 2011. [4] Elfaki, A., Muthaiyah, S., Magboul, I., Phon-Amnuaisuk, S. & Ho, C. K.: Defining variability in dss: An intelligent method for knowledge representation and validation. System Sciences (HICSS), pp. 1–9, 2010. [5] Fr¨uhwirth, T.: Constraint handling rules, in A. Podelski (ed.), Constraint Programming: Basics and Trends, Vol. 910 of Lecture Notes in Computer Science, Springer Berlin / Heidelberg, pp. 90–107, 1995. [6] Fr¨uhwirth, T.: Theory and practice of constraint handling rules, The Journal of Logic Programming 37(1-3): 95–138, 1998. [7] Fr¨uhwirth, T.: Constraint Handling Rules, Cambridge University Press, chapter 1, pp. 3–10, 2009. [8] F¨urber, C., Hepp, M.: Towards a Vocabulary for Data Quality Management in Semantic Web Architectures. LWDM, Uppsala, 2011. [9] Gansner, E., Koutsofios, E. & North, S.: Drawing graphs with dot. 2006. [10] Gupta, U. G.: Validation and verification of knowledge-based systems: A survey, Applied Intelligence 3: 343–363, 1993. [11] Harold, E. R. & Means, W. S.: XML in a Nutshell, 3 edn, O’Reilly, 2004.
[12] Hoffmann, P. D. D. W.: Theoretische Informatik, Carl Hanser Verlag, Munich, Germany, 2009. [13] Hopcroft, J. E., Motwani, R. & Ullman, J. D.: Introduction to Automata Theory, Languages, and Computation, 2 edn, Addison-Wesley Longman, Amsterdam, Netherlands, 2001. [14] Knauf, R., Gonzalez, A. & Abel, T.: A framework for validation of rulebased systems, Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on 32(3): 281–295, 2002. [15] Knauf, R., Tsuruta, S. & Gonzalez, A. J.: Toward reducing human involvement in validation of knowledge-based systems. Systems, Man and Cybernetics, Part A: Systems and Humans, IEEE Transactions on 37(1): 120–131, 2007. [16] Merlevede, P. & Vanthienen, J.: A structured approach to formalization and validation of knowledge. Developing and Managing Expert System Programs, Proceedings of the IEEE/ACM, pp. 149–158, 1991. [17] Owoc, M. L., Ochmanska, M. & Gladysz, T.: On principles of knowledge validation, Collected papers from the 5th European Symposium on Validation and Verification of Knowledge Based Systems - Theory, Tools and Practice, EUROVAV ’99, Kluwer, B.V., Deventer, Netherlands, pp. 25–35, 1999. [18] Ritter, D., Bhatt, A.: Modeling Approach for Business Networks with an Integration and Business Perspective. ER 2011 Workshops, Br¨ussel, 2011. [19] Ritter, D., Ackermann, J., Bhatt, A., Hoffmann, F. O.: Building a Business Graph System and Network integration Model based on BPMN. In: 3rd International Workshop on BPMN, Luzern, 2011. [20] Ritter, D.: From Network Mining to Large Scale Business Networks. International Workshop on Large Scale Network Analysis (LSNA), WWW Companion, Lyon, 2012. [21] Ritter, D., Westmann, T.: Reconstructing Linked Business Networks from Network Mining Data using Datalog. Datalog 2.0, Vienna, 2012. [22] Ritter, D.: Towards Business Network Management. Confenis: 6th International Conference on Research and Practical Issues of Enterprise Information Systems, Ghent, 2012. [23] Resource Definition Framework (RDF). http://www.w3.org/RDF/, 2012. [24] Rosenwald, G. & Liu, C.-C.: Rule-based system validation through automatic identification of equivalence classes. Knowledge and Data Engineering, IEEE Transactions on 9(1): 24–31. [25] Ullman, J. D.: Principles of Database and Knowledge-base Systems, Computer Science Press, Rockville, MD, USA, 1988. [26] Wu, C.-H., Lee, S.-J. & Chou, H.-S.: Dependency analysis for knowledge validation in rule-based expert systems, Artificial Intelligence for Applications. Proceedings of the Tenth Conference on, pp. 327–333, 1994. [27] Wu, C.-H. & Lee, S.-J.: Knowledge validation with an enhanced highlevel petri net model, Artificial Intelligence for Applications. Proceedings., 11th Conference on, pp. 126–132, 1995.