Application-oriented terminography in financial forensics - CiteSeerX

4 downloads 176 Views 2MB Size Report
Termontography, VAT fraud, securities fraud, financial forensics, ontology, ...... “The use of lexicons and other computer-linguistic tools in semantics, ... tained a Master's degree in Taxation at the Vlerick Ghent — Leuven Management School.
Application-oriented terminography in financial forensics Koen Kerremans, Isabelle Desmeytere, Rita Temmerman and Patrick Wille

This paper covers ongoing terminography work in the FF POIROT project, a European research project in which formal and shareable knowledge repositories (i.e. ontologies) and ontology-based applications are developed for the prevention of value added tax carousel fraud in the EU and the detection of securities fraud. We will emphasise that the knowledge requirements regarding users and applications determine what textual information should be structured at macro- and micro-levels of the FF POIROT multilingual terminology base. Furthermore, we will present our ideas concerning a multidisciplinary approach in terminography, called ‘Termontography’, for future application-oriented terminology development. Keywords: application, multilingual terminology, terminography, Termontography, VAT fraud, securities fraud, financial forensics, ontology, terminology base

.

Introduction

Due to their increasing availability in electronic formats, terminological databases are now used in a way that surpasses their ‘traditional’ role as terminological dictionaries for human users. Especially their different implementations in information systems show that terminological databases need to represent (in one or several natural languages) those items of knowledge or ‘units of understanding’ (Temmerman 2000) which are considered relevant to specific purposes, applications or groups of users (Aussenac-Gilles et al. 2002). This paper covers ongoing application-oriented terminography work in the ‘Financial Fraud Prevention Oriented Information Resources Using Ontology

Terminology 11:1 (2005), 83–06. issn 0929–9971 / e-issn 1569–9994 © John Benjamins Publishing Company

84

Koen Kerremans, Isabelle Desmeytere, Rita Temmerman and Patrick Wille

Technology’ (FF POIROT) project. FF POIROT (IST–2001–38248) is a European research project (fifth framework) which aims to explore the use of tools and methodologies in order to represent, mine and use formal and shareable knowledge repositories (i.e. ontologies) in applications for the prevention of value added tax (VAT) carousel fraud in the EU and the detection of securities fraud. VAT carousel fraud is a kind of VAT fraud in which fraudsters sell goods at VAT inclusive prices and disappear without paying the VAT paid by their customers to the tax authorities. Securities fraud refers to the selling of overpriced or worthless shares, bonds, or other financial instruments to the general public (Zhao et al. 2004). The paper will emphasise that the knowledge requirements regarding users and applications determine what textual information should be added to the macro- and micro-levels of the FF POIROT multilingual terminology base (in English, Italian, French and Dutch). Furthermore, ideas will be presented concerning a multidisciplinary approach in terminography in order to better support future application-oriented terminology development. This approach, called ‘Termontography’, integrates a knowledge specification phase (Kerremans et al. 2003). We believe that, on the one hand, the knowledge specification will efficiently assist the corpus selection process. On the other hand, it will allow terminographers to establish specific extraction criteria as to what should be considered a ‘term’: i.e. the representation (in a natural language) of a unit of understanding which is considered relevant to given purposes, applications or groups of users. Furthermore, the pre-defined knowledge will also affect the terminographer’s working method as well as the software tools that will be used to support the working method (Aussenac-Gilles et al. 2002; Kerremans et al. 2004). This paper will be structured as follows: in Section 2, we discuss the user’s profiles (Section 2.1) and fraud applications envisaged in the FF POIROT project (Section 2.2). In Section 3, we show how the requirements of these users and applications are reflected in the FF POIROT multilingual terminology base. Next, we focus on ‘Termontography’ as a method to better support the development of application-oriented terminological databases and apply the methodology to the case study of financial forensics. In Section 5, we reflect on some important issues pertaining to Termontography. Finally, Section 6 summarises the most important findings in this article.

Application-oriented terminography in financial forensics

2. Applications, users and requirements in FF POIROT In this section, we first discuss the applications for which the FF POIROT terminology is intended (Section 2.2). Next, we focus on the user’s profiles (Section 2.1). 2. Applications The ontology-based applications developed in FF POIROT need to support the prevention of VAT carousel fraud in the EU as well as the detection of securities fraud. In order to evaluate the efficiency of these applications, several fraud scenario’s have been collected by VAT Applications NV, a Belgian company specialised in VAT issues at international level, and by the Commissione Nazionale per le Società e la Borsa (CONSOB), the public authority responsible for regulating the Italian securities market. In order to understand the role of the multilingual terminology base in the applications envisaged, we will discuss in more detail the two types of financial fraud. In Section 2.1.1 we discuss VAT carousel fraud in the EU, while Section 2.1.2 deals with securities fraud (Section 2.2.2). 2.. Prevention of VAT carousel fraud in the EU The current European Union (EU) VAT system is extremely vulnerable with respect to missing trader or carousel fraud. This vulnerability is primarily due to the fact that the exchange of information between member states of the EU is slow, often too slow to expose fraudsters before they have disappeared again as the fraudulent, fictitious companies set up by these malicious persons only exist for a period of between 3 and 6 months. Therefore, tax authorities should have access to several means in order to detect this type of VAT fraud. From the point of view of companies of good faith, one solution to avoid the risk of unwittingly getting involved in missing trader or carousel fraud is to know the suppliers with whom they are doing business. This is extremely important as these legal companies, when involved in a fraud carousel scheme, can be held responsible in some EU member states for the payment of the ‘missing’ VAT. To determine whether a trade should be conducted or not, each company should first check whether or not its supplier has a valid VAT registration number. If this VAT number does not appear to be valid, no trade should be conducted. Other so-called ‘fraud indicators’ are: dealing in small-sized but

85

86

Koen Kerremans, Isabelle Desmeytere, Rita Temmerman and Patrick Wille

high value goods, such as CPU’s, memory cards and mobile phones, selling the goods lower than the market price, payments in cash and pre-arranged purchase/sale. When these elements are present, extreme caution is needed when doing business. For companies and tax authorities, it is often difficult to go through all fraud indicators in order to find out whether or not a trade is safe. This holds for information about the trader, usually found in e.g., company websites or other external sources, which is often stated in one of the 20 languages in the EU. From this perspective, a terminological database becomes an indispensable knowledge resource as the information one is handling in an international setup is likely to be presented in a natural language. Suppose a Belgian company ‘A’ intends to sell memory cards to a British company. The Belgian company ‘A’ has been approached by the British company for this sale whereby the British company says it has already a supplier for these goods, being another Belgian company ‘B’. This scheme indicates that the Belgian company ‘A’ may be getting involved in a missing trader fraud circuit. So more investigation is needed. As a first step, the Belgian company ‘A’ investigates who set up the British company. In order to succeed, ‘A’ should for instance know the English terms used to designate such a person, as the bylaws of the British company in which this information can be found back, are obviously written in English. ‘A’ finds out that an Italian person, named M. G., is associated with one of these terms. When the Belgian company then searches the Internet for this name and finds an Italian text in which this person is linked to Italian terms such as amministratore, faillita and radiotelefoni, it may derive, after consulting the multilingual terminological database, that M. G., the founder of the British company, has been the ‘director’ (amministratore) of an Italian company trading in ‘mobile phones’ (radiotelefoni) that went ‘bankrupt’ (faillita). These findings may be important enough for ‘A’ to decide not to do any business with the British company and Belgian company ‘B’ as the probability to be involved in a fraud scheme becomes likely. 2..2 Detection of securities fraud Securities fraud can be achieved by establishing credibility of the financial product and/or the organization selling the product, appealing to people’s greed by offering high rates of return, and pressuring people to buy the product with stories of short-term windows of opportunity or by repeatedly contacting potential customers (Zhao et al. 2004).

Application-oriented terminography in financial forensics

As a regulatory authority, CONSOB aims are to ensure the transparency and correct behaviour by securities market participants, to disclose complete and accurate information to the investing public by listing companies, to guarantee the accuracy of the facts represented in the prospectuses related to offerings of transferable securities to the investing public, and to watch over the compliance with regulations by auditors entered in the Special Register (Zhao et al. 2004). An important task of CONSOB is to search on the Internet for websites offering on-line investment services to Italian residents and to determine whether or not there is any fraudulent solicitation of public savings. The current working procedure for retrieving suspicious investment websites is primarily based on keyword searches in different Internet search engines. The selected keywords are combined manually into complex queries based on the experience acquired during the ordinary supervision activity of CONSOB’s operative units (Zhao et al. 2004). CONSOB now faces the problem that this technique is not sufficiently effective to identify all non-compliant websites, not least because fraudsters are becoming increasingly aware of keyword searches and take steps to avoid detection. For that reason, CONSOB is to use a software tool to automate the launching of queries that will find suspect websites in the languages involved and to optimise the web information retrieval results. The tool should be able to make use of the terms that occur in the multilingual terminology base with respect to the specific crime of fraudulent on-line investment services to Italian residents and abusive solicitation of public savings. The terminology base should, on the one hand, allow the tool to conduct semantic analysis of the selected pages to identify the pages containing crime information. On the other hand, it should provide the terminology derived from CONSOB regulations in order to explain to the user, in the languages involved, its reasoning processes with respect to the identification of illegal solicitation of financial products through the web (Zhao et al. 2004). 2.2 Users The FF POIROT multilingual terminology base is intended for different types of users. For instance, it can be used as a multilingual dictionary by ontology modellers, thereby supporting the ontology engineering process (Section 2.2.1). In case of securities fraud, regulatory authorities need to retrieve and analyse investment websites in multiple languages, based on terms they

87

88

Koen Kerremans, Isabelle Desmeytere, Rita Temmerman and Patrick Wille

find in the terminology base. In case of VAT carousel fraud, both tax authorities as well as small medium enterprises (SME’s) can make use of the terminology base in order to extract specific information about traders from, for instance, company websites (Section 2.2.2). 2.2. Ontology modelers Ontology modellers in the FF POIROT project follow the DOGMA approach (i.e. ‘Developing Ontology-Guided Mediation for Agents’), which advocates a two-layered approach to ontology modelling: lexons and commitments (Meersman 2000; Spyns et al. 2002). Commitments are the application-specific interpretation of lexons. Lexons are the vocabulary of the application semantics. They are grouping elements further composed of a context identifier γ, a starting term (i.e. headword) t1, a second term (i.e. tail) t2 and two roles r1 and r2. Note that a ‘term’ in a lexon base is the representation of a category. Terms and roles appear in a semantic relationship which receives, through the use of the context identifier γ, a particular meaning in a given context Γ. This ideational context is externalised by a set of resources, such as documents, graphs and databases. Through these resources, the semantic extension of a lexon is established, communicated, documented and agreed upon among ontology developers (Zhao et al. 2004). An example of a lexon is ‘SixthDirective-MemberState-Adopt-Law-BeAdopted’ where ‘SixthDirective’ is the context identifier denoting European council directive 77/388/EEC on the harmonisation of the laws of the European member states relating to turnover taxes (henceforward: Sixth Directive). This lexon specifies that in the context of the Sixth Directive, there exists a reality, represented in an English controlled language by ‘MemberState-Adopt-Law-BeAdopted’, in which a European member state adopts a law and, vice versa, in which a law is adopted by a European member state. The specification of the context identifier is important in the sense that this lexon does not hold for all situations in reality in which a law is being adopted. Sometimes a law may e.g., be adopted by a national government. This particular type of conceptual modelling is called ‘Object-Role Modelling’ (ORM). In ORM, the world is viewed in terms of objects and the roles they play (Halpin 2001). One of the tasks of the FF POIROT multilingual terminology base is to provide linguistic evidence of the formalised model. For instance, in the reality of the Sixth Directive, the lexon term ‘Law’ is lexicalised in English as ‘law’, in French as disposition législative, in Italian as disposizione legislative and in Dutch as wettelijke maatregel. Moreover, the terminology base may also provide

Application-oriented terminography in financial forensics

terminological information which can be used by the ontology modellers for developing the ontology on financial forensics (Section 3). 2.2.2 Regulatory authorities and SME’s Given the applications described in Section 2.1.1 and 2.1.2, both regulatory authorities and SME’s will be considered important users of the multilingual terminology base. In case of securities fraud, the database will contain terms in multiple languages serving as keywords that are relevant for retrieving e.g., investment websites. This is particularly useful for regulatory authorities such as CONSOB, as they will be able to expand their on-line searches by adding keywords in multiple languages. SME’s and tax authorities are considered the two users of the application against VAT carousel fraud (Section 2.1.2). Based on the multilingual terminology base, these two users will be able to collect and understand multilingual information about traders in the EU.

3. Requirements Starting from the users and applications defined in the previous section, we can now discuss the requirements with respect to the use of terminology (Section 3.1) and additional terminological information (Section 3.2). These requirements have been written down in the FF POIROT user requirements report (Kingston et al. 2003). 3. Macro-structure For both applications against financial fraud the multilingual terminology base should hold terminology extracted from legal rules. These legal rules — encountered in national legislation, EU legislation, or rules published by regulatory authorities such as CONSOB — form the standard of correct behaviour in all cases relating to crime. From the viewpoint of a regulatory authority, they specify the necessary conditions for the existence of an enforceable claim. The task of assessment against standards should consist almost entirely of applying legal rules to an existing situation and determining whether the situation satisfies the legal rules or not (Kingston et al. 2003). Legal rules will need to be represented in the ontology and for that reason it is essential to add the terms (i.e. the words and linguistic patterns denoting categories and relations) found in these rules to the terminology base.

89

90

Koen Kerremans, Isabelle Desmeytere, Rita Temmerman and Patrick Wille

For the applications against financial fraud, the ability to reason with products or commodities is also very essential. Commercial products consist of items that are offered for sale. In this project, we are concerned with various physical items (for VAT fraud) and with financial products (for securities fraud). The full list of important features of products will need to be determined through knowledge acquisition. Important properties include: the product category (e.g., electronics, petroleum products and treasury bonds), the product value-to-weight ratio (since carousel fraud is often carried out with low-bulk high-value goods), whether the product is real or fake (an important feature in detecting unauthorised investment brokers), and whether the product attracts VAT. The last feature may vary between EU countries (Kingston et al. 2003). The terminological database should cover English, French, Dutch and Italian terms that refer to these products. Moreover, each term denoting a product will be further specified in the terminology base by means of the above-listed features. An understanding of commercial transactions, is another important knowledge requirement for the applications against financial fraud. The terminology base should therefore provide the terms encountered in texts that refer to the categories typically characterising commercial transactions. These categories are for instance: a sale, a product, a vendor, a purchaser, a place of supply and a place of acquisition. Normally, the product is a good or service and the sale price is monetary. In the domain of financial fraud, it is also normal for a record of the sale to be created, usually on some kind of invoice (Kingston et al. 2003). Terminology used to denote categories related to companies and their structure should also be placed in the terminology base. Special attention should be paid to terms denoting: the time when a company was created, its main products, its location, its main trading partners, its parent/subsidiary companies and its head of company. The terminology base should also include terms denoting business activities, operating locations, trading record, management, audit reports as well as cash flow and financing (Kingston et al. 2003). All these terms will assist the forementioned applications in carefully retrieving information about companies and in understanding the corporate structure. Being able to retrieve investment websites and determining the suspicious ones, is an important feature of the application against securities fraud (Section 3.1.2.2). The terminology base should therefore cover words or phrases that are considered to be suspicious if found in certain contexts. In the online investment domain, for example, there are often clues when websites contain

Application-oriented terminography in financial forensics

terms that either do not exist in financial circles or which are derivations of actual terms (Kingston et al. 2003). 3.2 Micro-structure Apart from terms in multiple languages, ontology modellers in FF POIROT require the following information throughout the developing stages of the ontology: definitions of these terms, co-texts, references to the textual sources from which descriptions and co-texts have been retrieved, and relational markers that link two terms in a given context. Other useful information is a relation specifying that a certain term (e.g., ‘business’) is the building block of another term (e.g., ‘business activity’) or vice versa. Consider for instance the category labelled in English as ‘investment firm’. If ontology modellers need to formalise this category, they can find in the multilingual terminology base that in the Italian legislative decree 58 on financial intermediation, a distinction is made between a società di intermediazione mobiliare (SIM), a European investment firm and a non-European investment firm. They can even look at the particular law sections in this legislative decree by clicking on the reference that is added as hyperlink. Furthermore, the terminological database shows that this English term also appears in the proposal called ‘Proposal for a Directive of the European Parliament and of the Council on Investment Services and Regulated Markets’. In this proposal, investment firm is defined as “any legal person whose regular occupation or business is the provision of investment services on a professional basis” (Article 3.1(1)). Apart from the definitions, ontology modellers derive from the terminology base that ‘investment firm’ is similar to the English term ‘investment company’ and that the Italian term impresa di investimento is a possible translation equivalent. Moreover, the database gives an idea about the frequency of occurrence of the English terms ‘investment firm’ and ‘investment company’ as all co-texts have been added to the database. The database also shows that ‘investment company’ is e.g., related to the English term ‘investment service’ by means of the relational marker ‘provided’ in the context of the European council directive 93/22/EEC. The term is considered a building block of terms such as ‘eu investment firm’, ‘non-eu investment firm’ or ‘right of investment firm’ and contains the terms ‘investment’ and ‘firm’. Apart from the multilingual terminology covering the financial fraud and fraud-related domains (Section 3.1), both regulatory authorities and SME’s require definitions of important terms (in multiple languages) as well as

9

92

Koen Kerremans, Isabelle Desmeytere, Rita Temmerman and Patrick Wille

references to the sources from which the definitions have been retrieved. The applications described in Section 2.2, also need the same information. For instance, as these applications need to be able to ‘recognise’ cases of VAT carousel fraud or securities fraud, they must be able to reason with definitions of fraud, sanctions for fraud, etc. in the EU countries. The differences in definitions and sanctions constitute factors that fraudsters use to select their ‘favourite’ country for the fraud operation. In some member states, for instance, the existence of an offence is linked to obtaining a profit, whereas in others it is enough for the act to have been committed (Kingston et al. 2003). The specification of the differences between EU countries may, on the one hand, come from domain-experts. On the other hand, the differences can be derived from a terminological analysis of legal texts and/or domain-specific texts on content-related issues and may be represented in the terminology base in order to support and facilitate the formalisation of this knowledge in the ontology. Figure 1 shows an example of the terminological record listing the terms ‘chargeable event’ (English-UK), fait générateur (French-Belgium), fatto generatore (Italian-Italy) and belastbaar feit (Dutch-Belgium). The English term ‘chargeable event’ is further specified by the following fields: ‘Term occurs in’, ‘Term contains’, ‘English description’, ‘English relation’ and ‘English co-text’. Descriptions, co-text and relations all have links to the sources from which they have been retrieved. The French, Italian and Dutch terms are only characterised by a description in one of these three languages. In the latest draft, the FF POIROT terminology base contains 5043 terminological records which have been compiled manually and semi-automatically.

Figure . Example of a terminological record in the FF POIROT terminology base

Application-oriented terminography in financial forensics

In the next section, we present our ideas concerning Termontography, a terminological approach we are working out for supporting future (application-oriented) terminology development.

4. Termontography In previous articles (Kerremans et al. 2003; Temmerman and Kerremans 2003; Kerremans et al. 2004) ‘Termontography’ was described as a multidisciplinary and functional approach for terminology description in which theories and methods in terminography are combined with methods and principles in ontology engineering. In case of multilingual terminological projects, the main idea in the Termontography approach is that multilingual terminological knowledge, e.g., terms and knowledge rich contexts (Meyer 2001), retrieved from a multilingual textual corpus of parallel and/or non-parallel texts, is structured according to a pre-defined, language-independent and task-oriented framework of domain-specific knowledge. This so-called ‘categorisation framework’ is a graphical representation of the knowledge chunks that are relevant for supporting the applications that have been described in an initial requirements specification phase. Knowledge chunks very often consist of at least two categories and their intercategorial relation. Examples of intercategorial relations are: is an acronym of, is a hyperonym of, is a hyponym of, etc. In the following subsections, we will have a closer look at the stages a terminographer goes through in order to develop a terminological database, following the Termontography approach. This development process is visualised in Figure 2. The process can be broken down into six methodological steps. These steps or phases are: the analysis phase (Section 4.1), the information gathering phase (Section 4.2), the search phase (Section 4.3), the refinement phase, the verification phase and the validation phase (Section 4.4). The analysis and information gathering phases are supervised and supported by domain experts. 4. Analysis phase Two steps are needed to integrate the analysis of users, applications and purposes of terminology bases in Termontography. The first step is writing a user requirements report (Section 4.1.1). The other is to develop a categorisation framework in the knowledge specification phase (Section 4.1.2).

93

94

Koen Kerremans, Isabelle Desmeytere, Rita Temmerman and Patrick Wille

Figure 2. Methodological steps in Termontography

4.. Requirements specification Writing a requirements report (Figure 2) is considered an important methodological step in ontology engineering approaches (Jarrar and Meersman 2002; Sure and Studer 2003). In the FF POIROT project, this report is a written document summarising the requirements concerning applications and users. Part of that document has been summarised in Section 3.1 of this article. The disadvantage of a written document is that it may still leave room for interpretation. In this case, the risk becomes high that term extraction criteria — if too vaguely defined in the user requirements report — are interpreted differently by several terminographers working on the same multilingual terminology base. In order to better align their work, we therefore propose to start from a knowledge specification phase in which a categorisation framework is set up, visualising the knowledge considered relevant for the applications and users envisaged. This categorisation framework, in combination with a user requirements report, can help terminographers in selecting multilingual, domain-specific texts for the corpus and to determine what knowledge should be represented in the multilingual terminological database (Section 4.1.2). In this sense, the categorisation framework should be seen as a meta-model that

Application-oriented terminography in financial forensics

“specifies the structure of domain knowledge and imposes appropriate constraints between various classes of concepts” (Gamper et al. 1999: 8). Furthermore, as is shown in Figure 2, it helps to outline the particular working method and to identify the tools that will be used in order to support the working method. 4..2 Knowledge specification Terminographers need a solid reference framework to scope their terminology work, i.e. determining which linguistic words/patterns are considered terms given the applications, users and purposes of the terminology base. As Le Néal observed, “a piecemeal approach is totally inadequate for studying the elaborate system of concepts that exist in most complex disciplines, and could result in major inconsistencies” (2001: 648). The framework in a multilingual terminology project may be derived from one source language only. Terms in the other languages are in that case only considered relevant if they are the equivalent of a term of a category appearing in the categorisation framework. Obviously, this method would not be useful if the applications specified in the requirements report need access to a multilingual terminological database covering for instance the domain of European VAT law (Kerremans et al. 2003). In that case, it is important for the framework not to be biased to one language and cultural setting. Instead, it should hold language-independent knowledge to which the VAT terminology in several languages can be mapped (Kerremans et al. 2003). An example of a multilingual terminological database built on the basis of what is specified on the level of the conceptual system is the OncoTerm project (Moreno Ortiz and Pérez 2000). The OntoTerm tool which has been developed in that research project even prevents a terminographer from adding a new term to the terminological database if no category for this term has been specified in the ontology. This does not mean “that an ontology must be fully developed before starting terminology work, as the user may choose to perform both tasks concurrently, mapping terms in the Termbase Editor to concepts defined in the Ontology Editor as they are entered” (Moreno and Pérez 2001: 8). This flexible characteristic of the categorisation framework is also accounted for in the Termontography approach. Based on the feedback of a domain-expert, a terminographer decides to add categories to the framework and terms to the terminology base if they cover knowledge which is considered to be relevant within the scope of the purposes defined.

95

96

Koen Kerremans, Isabelle Desmeytere, Rita Temmerman and Patrick Wille

Depending on the level of granularity, categorisation frameworks provide detailed information with respect to the manual and semi-automatic extraction and structuring of terms from, for instance, a multilingual corpus of texts. By adhering to a common reference framework, terminographers will be able to decide more efficiently which terms are considered translation equivalents and, consequently, need to be placed in the same terminological record. In case of FF POIROT, the user requirements report has introduced a number of categories that are important to represent in a categorisation framework (Section 3.1). For instance, the categories labelled as ‘sale’, ‘vendor’, ‘purchaser’, ‘place of acquisition’, ‘product’ and ‘place of supply’ indicate which terms should be retrieved from texts explaining commercial transactions. In this section, we focus on a small part of the categorisation framework that provides information about terms that are relevant for the application against VAT carousel fraud (Section 2.1.1). In order for this application to ‘recognise’ possible fraudulent intra-community transactions, it needs to have insight in the law sections related to commercial transactions between European member states. Moreover, the application should be aware of the fact that for some intra-community transactions, no VAT is required. In the European Sixth Council Directive 77/338/EEC (henceforward: Sixth Directive), there are several sections discussing such transactions. By presenting these sections in a semantic network-like structure, showing their lexical and conceptual relations to other categories in the domain of interest, terminographers can e.g., derive that: all the European VAT legislations contain a section describing transactions for which no VAT is required; the category of transactions for which no VAT is required is further divided into four subcategories: ‘transactions not allowing the supplier to deduct VAT’ (Sixth Directive, Section 13), ‘transactions allowing the supplier to deduct VAT’ (Sixth Directive, Section 28quater or 28 (c)), ‘transactions occurring outside the territory of the VAT legislation at stake’ (Sixth Directive, Sections 8 and 9) and ‘transactions occurring outside the scope of VAT’ (Sixth Directive, Sections 2 to 7); the sections in the VAT legislations describing these different categories need to be included in the conceptualised model of the domain of interest; and the terminology referring to each category in particular needs to be presented in the multilingual terminological database. Figure 3 visualises the mapping of terms encountered in different European VAT legislations to the categorisation framework covering transactions for which no VAT is required.

Application-oriented terminography in financial forensics

Figure 3. Example of mapping terms to the categorisation framework

The mapping of terms to categories in the framework is indicated by means of the dotted arrows. From this figure, we derive that some terms — e.g., vrijstelling in Dutch and exemption in French — can point to more than one category. Also note that the term ‘zero-rated’ in the UK VAT legislation denotes another category than the ‘zero-rated’ term in the Irish VAT legislation. For further details, we refer to Kerremans et al. (2003). As will be discussed in Section 4, the categorisation framework can be further enriched with information that is culture-specific. In case of VAT fraud, it is important for the application to know that in all legal systems involved — i.e. the Belgian, Italian and UK legal systems — there are transactions for which no VAT is required. However, this does not suggest that in all legal systems the same list of transactions applies. Differences emerge and should therefore be made explicit through e.g., a terminological analysis. 4.2 Information gathering phase The categorisation framework as well as the requirements report assist the terminographer in searching for relevant textual material with the purpose

97

98

Koen Kerremans, Isabelle Desmeytere, Rita Temmerman and Patrick Wille

of compiling and managing a corpus of domain-specific texts (Figure 2). After having established a first draft of the categorisation framework, texts are searched for on the Internet or are made available in electronic format through ‘Optical Character Recognition’ (OCR). Domain-experts play an essential role in the information gathering phase. They can point the terminographer to relevant domain-specific textual material or can recommend particular websites from which to retrieve texts. In case of the FF POIROT project, several texts have been considered for terminological analysis: regulations, domain-specific texts as well as websites. ‘Analysis’ means extraction of terms and knowledge rich contexts (Meyer 2001) based on what appears in the knowledge specification (Section 3.2). Regulations relevant for the application against securities fraud were available in English and Italian. The law texts relevant for the application against VAT fraud were available in English, Dutch, French and Italian. Apart from these regulations, several domain-specific texts in English, French, Dutch and Italian have been analysed as well. These texts introduce and often also explain terms referring to categories which are crucial for understanding VAT fraud and securities fraud or for understanding the relevant related (sub)domains such as ‘corporate structures’ or ‘commerce’ in case of the former and ‘stock exchange’ or ‘on-line investment fraud’ in case of the latter. Domain-specific texts in the corpus are used as gap-fillers as they usually provide substantial supplementary information about terms which are mentioned, with very few additional information, in the definition of a term referring to a category encountered in a regulatory text. This is for instance the case for the terms ‘bank’ and ‘undertaking’ in the definition of the term società di intermediazione mobiliare (SIM). In the English version of the Legislative Decree 58, this term is defined as “an undertaking, other than a bank or financial intermediary entered in the register referred to in Article 107 of the Banking Law, authorized to provide investment services having its registered office and head office in Italy” (article 1.1.e). The reason for not defining terms like ‘bank’ and ‘undertaking’ is due to the fact that they refer to categories that are considered as part of a common understanding and/or because they refer to peripheral categories for which a definition can be found elsewhere through the consultation of another source. Nevertheless, in order to capture the full meaning of a category in a formalized conceptual model of the regulatory domain, meanings of these underspecified peripheral categories that exist in the context of the core category need to be included as well. Hence the need for domain-specific texts (Zhao et al. forthcoming).

Application-oriented terminography in financial forensics

English and Italian investment websites are the third type of knowledge resource. In case of the application against securities fraud, these resources have been added to the ‘knowledge scope’ because they provide insights into the categories that are frequently referred to by investors as well as the common terms and linguistic expressions that occur in the websites. This information is particularly useful for information retrieval systems that need to search for investment websites on the Internet. Moreover, as some investment websites have been officially identified by CONSOB as cases of securities fraud, a careful comparative analysis of the terms and linguistic expressions used in legal and illegal investment websites may eventually lead to the development of the application that is able to filter out ‘suspicious’ cases from a list of investment websites. 4.3 Search phase Once the corpus has been established, the terminographer will be able to extract from the domain-specific texts, terms denoting categories in the categorisation framework as well as words or linguistic patterns indicating the intercategorial relations. The categories and relations that have been identified in each text are mapped to the categorisation framework and structured, according to the framework, in the terminological database. Given the requirements specified in the analysis phase (Section 4.1), not only the lexicalised categories and intercategorial relations, but also knowledge rich contexts may be extracted from the corpus and immediately linked to the lexicalised categories or relations which they further specify. It is possible that the term list resulting from the search phase exceeds the boundaries defined in the categorisation framework. For instance, if the purpose of the terminology project is to develop a multilingual terminology base presenting domain-specific categories from the Italian legislation on VAT, the terminographer’s task may be to find, if possible, the equivalents of these Italian categories in the other languages. However, in some cases the multilingual terminology base needs to cover terminology that is used in different cultural settings. An example is the carousel fraud scheme for which the initial categorisation framework set up by a domain-expert was culture-independent and human language-independent. This framework may need to be adapted as a result of the differences that may exist between actual implementations of this fraudulent scheme in different cultural settings.

99

00 Koen Kerremans, Isabelle Desmeytere, Rita Temmerman and Patrick Wille

As was discussed in Section 3, applications and users in FF POIROT have clearly determined the term selection process as well as the extraction of supplementary information that further specifies each term. With respect to the extraction of terms, it should be stressed that the different text types — e.g., law texts, domain-specific texts and websites — collected during the information gathering phase, each time required a redefining of what should be considered a ‘term’. In the case of law texts, we considered a term to be any word or expression used to express a category in the legal rule that needs to be represented in the ontology of financial forensics (Section 3.2.1). This means e.g., that from the first rule in article 41 of the Legislative Decree 58 (Section 3.3) — i.e. “[a]sset management companies may market investment fund units and client-by-client portfolio management services abroad” — the following terms were extracted: ‘asset management companies’, ‘investment fund units’ and ‘client-by-client portfolio management services’. This term extraction method sometimes obliged us to extract terms which one would never encounter in a traditional terminological dictionary. Consider for instance the long expression ‘15th day of the month following that during which the chargeable event occurs’, which appears as term in Section 28D(2) of the Sixth Directive: “For the intra-Community acquisition of goods, tax shall become chargeable on the 15th day of the month following that during which the chargeable event occurs.” This linguistic expression has been identified as a term because it refers to an important category in this rule pointing to a specific day in the month on which tax becomes chargeable. For the same reason the term ‘supply of services other than those referred to in paragraph 5’ was extracted from article 15(8) in the Sixth Directive. This expression denotes a category in a legal rule that fundamentally differs from a supply of services referred to in article 15(5) of the same directive. The extraction of terms from domain-specific texts is based on other term selection criteria. In contrast to a law text, the content in a domain-specific text does not have to be formally represented in the ontology of financial forensics (unless the text explains e.g., a possible fraud scheme for VAT or on-line investment). Domain-specific texts tend to play a different role in the knowledge acquisition process. As was mentioned, they introduce and often also explain terms referring to categories which are crucial for understanding VAT fraud, securities fraud and relevant related (sub)domains, or for understanding and further specifying categories mentioned in law texts with few additional information. Therefore, the term selection process should be limited to the extraction of terms denoting categories that are considered essential for the

Application-oriented terminography in financial forensics 0

applications. More specifically, linguistic patterns found in text were added to the terminology base if they: – – – –

denoted core-concepts in fraud schemes (e.g., ‘puppet company’) were essential for understanding terms denoting core-concepts in legal rules denoted core-concepts in fraud-related domains (e.g., ‘e-commerce’, ‘trading’, etc.) had been defined in the corpus

With respect to the investment websites in the corpus, it should be noted that apart from terms, no additional terminological information is extracted. What is considered a term in an investment website, differs from law texts and domain-specific texts. In an investment website, a term is considered a keyword introducing a category which is used by the investment solicitor in order to attract website visitors and to convince them to invest in one or several financial products. Special attention is also paid to the words (e.g., use of adjectives, expressions, etc.) these investment solicitors use when presenting their financial products. 4.4 Refinement, verification and validation phases Figure 2 visualises the mapping of terminological information to the categorisation framework. In the ‘knowledge structuring’ pane, the results of this mapping are reflected in the terminological database. The purpose of the ‘refinement phase’ is to further complete the terminological database by for instance: aligning those terms that are equivalent, specifying the co-texts or concordances in which terms occur as well as the reference to the source from which each co-text was extracted. The ‘verification phase’, which follows the ‘refinement phase’ in Termontography, refers to the process in which the terminology base is checked for consistency. The ‘validation phase’ is a final check to see whether the content of the terminology base really meets the requirements specified in the ‘analysis phase’.

5. Cultural differences in multilingual terminological databases In the domain of VAT law, the category paraphrased in English as ‘VAT deduction on copyright publications’ (Section 285bis in the French VAT legislation)

02 Koen Kerremans, Isabelle Desmeytere, Rita Temmerman and Patrick Wille

only appears in the French VAT law. Moreover, in the French, Italian and Irish VAT legislations a special kind of export license — lexicalised in Italian as ‘esportatori abituali’ (Section 8c in the Italian VAT legislation) — is described which does not have a correspondent in the legislations of the other EU member states. Although the Dutch term vrijstelling used in the Belgian VAT legislation and the English term ‘zero-rated’ appearing in the UK VAT legislation, both refer to transactions in which a supplier has the right to deduct VAT, it does not follow that both terms cover exactly the same list of possible transactions (Section 3.2.2). Finally, the category lexicalised in English as ‘taxable event’, defined in article 10 of the Sixth Directive, is implemented differently in the legislations of the European member states (see e.g., Section 6 of the Italian VAT legislation, Section 269 of the French VAT legislation or Section 6(2) of the UK VAT legislation). All these examples show that cultural differences may emerge from a multilingual terminological analysis and from a comparison of terms referring to the same category. Since the applications against financial fraud need to account for the differences that can arise between related categories occurring in different cultures, it may be useful to explain those differences in the multilingual terminological database. For the moment, it has not yet been decided in what format the possible degrees of correspondence between terms denoting similar categories should be presented. One may consider using controlled language, features specification in strict templates or simply a description of the degrees of correspondence in natural language similar to the ‘interconceptual relations’ (relations interconceptuelles) specification in Dancette and Réthoré’s bilingual dictionary English-French on retailing (Dancette and Réthoré 2000; Temmerman 2003).

6. Conclusion In this paper, we have discussed ongoing work with respect to how application and user constraints can largely determine the development process of a terminological database. This was shown by referring to the FF POIROT project, in which a multilingual terminological database is developed to support applications for the detection of financial fraud as well as ontology modellers throughout the developing stages of an ontology covering financial forensics. In order to better integrate user and application requirements in the terminographer’s workflow, we are working out ‘Termontography’, a terminological

Application-oriented terminography in financial forensics 03

approach in which one structures (multilingual) terminological information, retrieved from a textual corpus, according to a task-oriented framework of domain-specific knowledge. This so-called ‘categorisation framework’ developed in the first methodological step of Termontography, i.e. the analysis phase, determines to a large extent the following steps in the development of the terminology base: the compilation of a corpus in the information gathering phase, the extraction of terminological information in the search phase as well as the further refinement and validation. Note that each methodological step is supervised by domain-experts. In order to better support the approach, a software tool will be developed which will allow the user to directly map the terminological analysis to the categorisation framework and see the results of this mapping in the terminological database (Kerremans et al. 2004). We intend to do further research on the way we should present knowledge in the categorisation framework as well as on the way to describe possible meaning variations between terms linked to the same category.

Acknowledgements This research is performed within the scope of the FFPOIROT project (http://www.ffpoirot. org). The ideas presented in this paper do not necessarily represent the joint vision of the FFPOIROT consortium.

References Aussenac-Gilles, N., A. Condamines and S. Szulman. 2002. “Prise en compte de l’application dans la constitution de produits terminologiques.” In Actes des 2e Assises Nationales du GDR I3. 289–302. Nancy, France. Dancette, J. and C. Réthoré. 2000. Dictionnaire analytique de la distribution. Analytical Dictionary of Retailing. Montréal: Les Presses de l’Université de Montréal. Gamper, J., W. Nejdl and M. Wolpers. 1999. “Combining ontologies and terminologies in information systems.” In Proceedings of the 5th International Congress on Terminology and Knowledge Engineering. 152–168. Innsbruck, Austria. Halpin, T. 2001. Information Modeling and Relational Databases. From conceptual analysis to logical design. Salt Lake City: North Face University. Jarrar, M. and R. Meersman. 2002. “Formal ontology engineering in the DOGMA approach.” In Meersman, R. et al. (eds.). On the Move to Meaningful Internet Systems 2002: CoopIS, DOA, and ODBASE; Confederated International Conferences CoopIS, DOA, and ODBASE 2002 Proceedings. 1238–1254. Berlin: Springer-Verlag.

04 Koen Kerremans, Isabelle Desmeytere, Rita Temmerman and Patrick Wille

Kerremans, K., R. Temmerman and J. Tummers. 2003. “Representing multilingual and culture-specific knowledge in a VAT regulatory ontology: Support from the termontography approach.” In Meersman, R. and T. Zahir (eds.). On the Move to Meaningful Internet Systems 2003: CoopIS, DOA, and ODBASE; Confederated International Conferences CoopIS, DOA, and ODBASE 2003 Proceedings. 662–674. Tübingen: Springer Verlag. Kerremans, K., R. Temmerman and J. Tummers. 2004. “Discussion on the requirements for a workbench supporting termontography.” In Proceedings of the Eleventh EURALEX International Congress. 559–570. Lorient, France. Kingston, J., W. Vandenberghe, R. Leary and J. Zeleznikow. 2003. User Requirements Analysis for an Ontology of Financial Fraud. FF POIROT, technical report. Meersman, R. 1999. “The use of lexicons and other computer-linguistic tools in semantics, design and cooperation of database systems.” In Proceedings of the Conference on Cooperative Database Systems (CODAS 99). 1–14. Wollongong, Australia. Meyer, I. 2001. “Extracting knowledge-rich contexts for terminography: A conceptual and methodological framework.” In Bourigault, D., C. Jacquemin and M.-C. L’Homme (eds.). Recent Advances in Computational Terminology. 279–302. Amsterdam/Philadephia: John Benjamins. Moreno Ortiz, A. and H. Pérez. 2000. “Reusing the Mikrokosmos ontology for conceptbased multilingual terminology databases.” In Proceedings of the 2nd International Conference on Language Resources and Evaluation. 1061–1067. Athens, Greece. Néal, J. Le. 2001. “Preparing multi-volume illustrated terminological dictionaries.” In Wright, S. E. and G. Budin (eds.). Handbook of Terminology Managament. Volume 2. Application-Oriented Terminology Managament. 645–665. Amsterdam/Philadephia: John Benjamins. Pianta, E., L. Bentivogli, and C. Girardi. 2002. “MultiWordNet. Developing an aligned multilingual database.” In Proceedings of the 1st International Conference on Global WordNet. 293–302. Mysore, India. Spyns, P., R. Meersman and M. Jarrar. 2002. “Data modelling versus ontology engineering.” In Proceedings of the SIGMOD Record. Special Issue on Semantic Web and Data Management. 12–17. Georgia, USA. Sure, Y. and R. Studer. 2003. “A methodology for ontology-based knowledge management.” In Davies, J., D. Fensel and F. Van Hamelen (eds.), Towards the Semantic Web. Ontologydriven knowledge management. 33–46. New York: John Wiley & Sons. Temmerman, R. 2000. Towards New Ways of Terminology Description. The sociocognitive approach. Amsterdam: John Benjamins. Temmerman, R. 2003. “Innovative methods in specialised lexicography.” Terminology 9(1), 117–135. Temmerman, R. and K. Kerremans. 2003. “Termontography: Ontology building and the sociocognitive approach to terminology description.” In Proceedings of the International Congress of Linguists (CIL17). Prague, Czech Republic. Zhao, G., J. Kingston, K. Kerremans, R. Verlinden, F. Coppens, R. Temmerman and R. Meersman. (2004) “Engineering an ontology of financial securities fraud.” In Meersman, R. et al. (eds.). OTM 2004 Workshops: OTM Confederated International Workshops and Posters, GADA, JTRES, MIOS, WORM, WOSE, PhDS, and INTEROP 2004. 605–620. Heidelberg: Springer Verlag.

Application-oriented terminography in financial forensics 05

Authors’ addresses Koen Kerremans Erasmushogeschool Brussel Departement Toegepaste Taalkunde Centrum voor Vaktaal en Communicatie (http://cvc.ehb.be) Trierstraat 84 B–1040 Brussel Belgium [email protected]

Rita Temmerman Erasmushogeschool Brussel Departement Toegepaste Taalkunde Centrum voor Vaktaal en Communicatie (http://cvc.ehb.be) Trierstraat 84 B–1040 Brussel Belgium [email protected]

Isabelle Desmeytere VAT Applications NV (http://www.vatat. com) O. L.Vrouwstraat 6 b4 B–1850 Grimbergen Belgium [email protected]

Patrick Wille VAT Applications NV (http://www.vatat. com) O. L.Vrouwstraat 6 b4 B–1850 Grimbergen Belgium [email protected]

About the authors Koen Kerremans obtained his degree in Germanic philology from the University of Antwerp and his Master’s in computational linguistics from the University of Ghent. He works as a researcher for the Centrum voor Vaktaal en Communicatie (CVC) and is currently involved in the European FF POIROT project. He is also a lecturer at the Applied Linguistics Department of the Erasmushogeschool Brussel. Isabelle Desmeytere studied Applied Economic Sciences at the University of Ghent. She started work as a research fellow in the tax department of the University of Ghent and obtained a Master’s degree in Taxation at the Vlerick Ghent — Leuven Management School and the Faculty of Economics and Business of the University of Ghent. She is Managing Director of VAT Forum CV, director of VAT Applications NV and author of several articles on VAT issues. Rita Temmerman is co-ordinator of the Centrum voor Vaktaal en Communicatie (CVC) and professor at the Applied Linguistics Department of the Erasmushogeschool Brussel. She obtained her degree in Germanic Philology from the University of Antwerp, her Masters in Translation from the State University of New York (USA) and her PhD in Linguistics from the University of Leuven. In 2000, she published Toward New Ways of Terminology Description. The Sociocognitive Approach. Patrick Wille graduated from Ghent University as a licentiate in Business Economics and obtained a degree in fiscal sciences from the Fiscale Hogeschool. He is Managing Director of The VAT House, VAT Forum CV and VAT Applications NV. He is a lecturer at the Universities of Brussels and Ghent and at the Fiscale Hogeschool. He is a frequent speaker at Belgian and European VAT seminars and is author and co-author of various books and articles in taxation magazines and newspapers.

Suggest Documents