University of Pretoria etd – Kroeze, J H (2008)
Developing an XML-based, exploitable linguistic database of the Hebrew text of Gen. 1:1-2:3
Thesis by Jan Hendrik Kroeze (21345393)
Submitted in fulfilment of the requirements for the degree Philosophiae Doctor (Information Technology)
in the School of Information Technology
in the Faculty of Engineering, Built Environment and Information Technology
University of Pretoria
Pretoria
Promoter: Prof. Dr. T.J.D. Bothma Co-promoter: Dr. M.C. Matthee
Open Rubric
April 2008
University of Pretoria etd – Kroeze, J H (2008)
Developing an XML-based, exploitable linguistic database of the Hebrew text of Gen. 1:1-2:3
JH Kroeze
Abstract
The thesis discusses a series of related techniques that prepare and transform raw linguistic data for advanced processing in order to unveil hidden grammatical patterns. A threedimensional array is identified as a suitable data structure to build a data cube to capture multidimensional linguistic data in a computer's temporary storage facility. It also enables online analytical processing, like slicing, to be executed on this data cube in order to reveal various subsets and presentations of the data. XML is investigated as a suitable mark-up language to permanently store such an exploitable databank of Biblical Hebrew linguistic data. This concept is illustrated by tagging a phonetic transcription of Genesis 1:1-2:3 on various linguistic levels and manipulating this databank. Transferring the data set between an XML file and a threedimensional array creates a stable environment allowing editing and advanced processing of the data in order to confirm existing knowledge or to mine for new, yet undiscovered, linguistic features. Two experiments are executed to demonstrate possible text-mining procedures. Finally, visualisation is discussed as a technique that enhances interaction between the human researcher and the computerised technologies supporting the process of knowledge creation. Although the data set is very small there are exciting indications that the compilation and analysis of aggregate linguistic data may assist linguists to perform rigorous research, for example regarding the definitions of semantic functions and the mapping of these functions onto the syntactic module.
ii
University of Pretoria etd – Kroeze, J H (2008)
Summary
Title: Developing an XML-based, exploitable linguistic database of the Hebrew text of Gen. 1:1-2:3 Candidate: Jan Hendrik Kroeze (1958) Promoter: Prof. dr. T.J.D. Bothma Co-promoter: Dr. M.C. Matthee Departments: Information Science; Informatics School: School of Information Technology Degree: Philosophiae Doctor (Information Technology)
The thesis touches on various sub-disciplines of computational linguistics, investigating the use of XML tagging to capture linguistic categories in the Hebrew text of Genesis 1:1-2:3 and to construct a threedimensional databank; using string processing algorithms to round-trip the data to and from the databank and computer program; using array processing to explore the semantic patterns hidden in the marked-up text; and using a graphical visualisation to investigate the mapping of semantic and syntactic functions. The thesis hopes to make a contribution by demonstrating the rigour enforced by the application of data-warehousing and datamining concepts to a linguistic databank. It proposes a macro-structure that may be used in future to package and integrate multidimensional linguistic data.
Chapter 2 experiments with a threedimensional data structure, using Visual Basic 6, and finds that a threedimensional array could be used to represent inherently multidimensional linguistic data regarding Biblical Hebrew clauses. Various layers of linguistic knowledge can be integrated by stacking various modules of analysis onto each other.
Chapter 3 explores data warehousing and online analytical processing concepts to find ways to render meaningful subsets of linguistic data stored in a threedimensional
iii
University of Pretoria etd – Kroeze, J H (2008)
array. Concepts like slicing and dicing are adjusted to make them useful for the processing of linguistic data.
Chapter 4 tries to find a more elegant solution for the permanent storage of the databank using XML technology. Due to its flexibility XML is chosen to build a textbased databank. The experiment indicates that XML is indeed a very suitable markup technology that can be used to permanently store the linguistic data in a separate databank because it allows users to create their own tag sets which may simulate a multidimensional database structure.
Chapter 5 investigates round-tripping in order to satisfy the requirement of finding a stable platform for the data, while also allowing editing and advanced processing of the data. In addition, various viewing and searching functions are discussed. Create, update and delete functionalities are added to enable users to populate and edit the clause cube while it is in the array state and to save these updates both to the RAM and on permanent storage in XML format.
Chapter 6 focuses on the benefits of text data mining facilitated by the preceding technologies. Some data mining concepts are applied in two experiments by aggregating aspects of the semantic and syntactic modules tagged in Genesis 1:12:3. Computer-assisted explorations of the semantic and syntactic data captured in the XML database illustrate the rigour enforced by such a text-mining venture.
In Chapter 7 projects are suggested (one of which is implemented) that could use the XML-based data cube of Genesis 1:1-2:3 in visualisation ventures to clearly show linguistic patterns uncovered by means of a computer program. These techniques may be used to create user-friendly interfaces that may facilitate easier and more intuitive mining of linguistic data.
Keywords The following keywords represent the most important aspects covered in the thesis: •
Threedimensional array
•
Online analytical processing (OLAP)
•
XML iv
University of Pretoria etd – Kroeze, J H (2008)
•
Round-tripping
•
Database management
•
Data warehousing
•
Text-data mining
•
Computational linguistics
•
Visualisation
•
Hebrew Bible
v
University of Pretoria etd – Kroeze, J H (2008)
Acknowledgements
I would like to thank the following people: •
My supervisors, Prof. Theo Bothma and Dr. Machdel Matthee, for your excellent supervision and feedback on the related papers and articles produced during the execution of the doctoral project
•
The Head and Acting Head of the Department of Informatics, Prof. Carina de Villiers and Prof. Trish Alexander, the Dean of the EBIT Faculty, Prof. Roelf Sandenbergh, as well as the Department's management committee, for granting me various shorter and longer periods of study and research leave to work on the research project, and for subsidising various overseas trips to read the papers that developed parallel to the chapters of this thesis
•
Mr. Danie Malan of the University of Pretoria's library who helped me with literature searches and to find sources that were not always easy to get hold of
•
All the editors and peer reviewers, as well as other colleagues, who assisted me with advice, help and comments that helped to steer the research in the right direction
•
Ms. Mariëtte Postma for her excellent language editing
•
My wife, Irma, for allowing me and assisting me to embark, for the second time, on the lengthy process of writing a doctoral thesis, and for helping to get aspects regarding the philosophy of science in the introduction right
•
My son, Jan, for writing the Java progam discussed in Chapter 7 as an example of the graphical visualisation of linguistic data
•
My daughter, Christien, for creating the beautiful graphics of threedimensional data cubes, used as illustrations in Chapters 2 and 3
•
My parents, Jan and Clasie Kroeze, and my mother-in-law, Babs Jansen van Rensenburg, as well as my other family, in-laws and friends for your continued support and encouragement
vi
University of Pretoria etd – Kroeze, J H (2008)
Declaration of originality I declare that this thesis, Developing an XML-based, exploitable linguistic database of the Hebrew text of Gen. 1:1-2:3, is my own work, that all the sources that I have used or quoted have been indicated and acknowledged by means of complete references, and that it has not been submitted for a degree at another university. J.H. Kroeze
vii
University of Pretoria etd – Kroeze, J H (2008)
Table of contents
Chapter 1: Introduction
1
1.1
Introduction
1
1.2
Background: forty years of Biblical Hebrew information systems
1
1.2.1
Levels of analysis
2
1.2.2
Underutilisation of existing tools
4
1.2.3
Integration as a solution to enhance utilisation
7
1.2.4
Visualisation and flexibility
9
1.3
Problem statement
11
1.4
Research questions
12
1.4.1
Main research question
12
1.4.2
Secondary research questions
13
1.5
Hypothesis
14
1.6
Positioning of Linguistic Information Systems within a research
15
discipline 1.7
Research style and methodology
21
1.8
Research plan
23
1.9
Structure of the thesis
26
1.10
Contribution to the field of ICT
28
1.11
Definition of terms
29
1.12
Conclusion
31
Chapter 2: Towards a multidimensional linguistic database of Biblical
33
Hebrew 2.1
Introduction
33
2.2
The need for integration
34
2.3
A clause cube as the ideal data structure
37
2.4
Implementing the clause cube in cyber space
47
2.5
Building and using a multidimensional database for Biblical Hebrew
54
2.6
Conclusion
55
viii
University of Pretoria etd – Kroeze, J H (2008)
Chapter 3: Slicing and dicing the clause cube
57
3.1
Introduction
57
3.2
Using a data cube to integrate complex sets of linguistic data
58
3.3
Processing the information in a clause cube
63
3.3.1
Rotation
64
3.3.2
Slicing
67
3.3.3
Dicing
72
3.3.4
Searching
73
3.4
Application: slicing and dicing the Genesis 1:1-2:3 clause cube
74
3.5
A comparison of data-cube concepts and clause-cube concepts
79
3.6
Conclusion
82
Chapter 4: Building and displaying the clause cube using XML
85
4.1
Introduction
85
4.2
Linguistic databases and computational linguistics
87
4.3
Linguistic layers
89
4.4
The phrase as basic building block of the database structure
90
4.5
Why should XML be explored as an option to build an exploitable
91
database of linguistic data? 4.5.1
Why is XML suitable for implementing a database?
92
4.5.2
Why is XML suitable for linguistic data?
94
4.5.3
Why is XML suitable for data exploration?
97
4.5.4
What are the disadvantages of XML?
98
4.6
How can XML be used to build an exploitable linguistic data cube?
100
4.7
How can XML represent the syntactic and semantic analyses of free
101
text? 4.8
How can XML represent inherently multidimensional data?
102
4.9
The structure of the Genesis 1:1-2:3 database in XML
103
4.10
Critical discussion of the XML clause cube implementation
109
4.11
Validating the XML document
113
4.12
Viewing the XML file in a web browser
119
4.13
Conclusion
122
ix
University of Pretoria etd – Kroeze, J H (2008)
Chapter 5: Conversion of the Genesis 1:1-2:3 linguistic data between the
123
XML database and the array in Visual Basic 5.1
Introduction
123
5.2
Conversion between VB6 and XML (round-tripping)
128
5.2.1
From XML to VB6
130
5.2.2
From VB6 to XML
144
5.3
Editing the data in the clause cube
147
5.4
Conclusion
153
Chapter 6: Advanced exploration of the clause cube
155
6.1
Introduction
155
6.2
The need for advanced, computer-assisted exploration of linguistic data 156
6.3
Text mining as a knowledge-invention venture
159
6.4
Extraction and analysis of semantic role frameworks as a venture in
163
text data mining 6.4.1
Slicing off the semantic functions layer from the data cube
164
6.4.2
Sorting the elements in each row
171
6.4.3
Concatenation of each row
175
6.4.4
Ordering the rows as units with reference to each other
179
6.4.5
Identifying and counting unique semantic frameworks
184
6.5
Analysis of the mapping of syntactic and semantic functions as another
193
linguistic data-mining venture 6.6
Conclusion
242
Chapter 7: Visualisation of the Biblical Hebrew linguistic data in the XML
245
cube 7.1
Introduction
245
7.2
What is visualisation?
247
7.3
Various approaches of visualisation
249
7.3.1
Text-based visualisation
249
7.3.2
Graphical visualisation tools
251
7.4
The purpose of visualisation
252
7.5
Requirements of visualisation tools
255
x
University of Pretoria etd – Kroeze, J H (2008)
7.6
XML’s suitability for visualisation
256
7.7
Some examples of visualisation of linguistic data
257
7.8
Application: a graphical topic map of semantic and syntactic mappings
262
7.9
Conclusion
269
Chapter 8: Conclusion
271
8.1
Introduction
271
8.2
Summary of thesis contents
272
8.3
Revisiting the research questions
276
8.4
Future research
279
283
Bibliography
Addenda (files on CD) A
Addendum A: Linguistic data regarding Gen. 1:1-2:3 represented
CD
by a threedimensional array in Visual Basic 6 AddendumA_DatabankModule_20080411_Fin.pdf B
Addendum B: Source code of Chapter 3
CD
AddendumB_SourceCode_Chapter3_20080411_Fin.pdf Program: Gen1Version15.exe C
Addendum C: Phonetic transcription of the Hebrew alphabet
CD
AddendumC_PhoneticTranscription_20080411_Fin.pdf D
Addendum D: Identification of phrase types
CD
AddendumD_WordGroups_20080411_Fin.pdf E
Addendum E: Syntactic functions
CD
AddendumE_SyntacticFunctions_20080411_Fin.pdf F
Addendum F: Semantic functions
CD
AddendumF_SemanticFunctions_20080411_Fin.pdf G
Addendum G: Clause cube schema
CD
AddendumG_ClauseCubeSchema_20080411_Fin.pdf Schema: Gen1_InputV15.xsd H
CD
Addendum H: XML clause cube AddendumH_XMLClauseCube_20080411_Fin.pdf xi
University of Pretoria etd – Kroeze, J H (2008)
Databank: Gen1_InputV15.xml I
Addendum I: XML clause cube corrected
CD
AddendumI_XMLClauseCube_Corrected_20080411_Fin.pdf Databank: Gen1_InputV15b.xml J
Addendum J: XML clause cube style sheet
CD
AddendumJ_XMLClauseCube_StyleSheet_20080411_Fin.pdf Style sheet: Gen1XMLdb03c.css K
Addendum K: Viewing the clause cube in a browser
CD
AddendumK_ViewingClauseCubeInBrowser_20080411_Fin.pdf L
Addendum L: Source code of Chapter 5
CD
AddendumL_SourceCode_Chapter5_20080411_Fin.pdf Program: Gen1_XML_VB6_CRUD_Beta15_Ch5.exe Databank: Gen1_InputV15_RT1.xml M
Addendum M: Source code of Chapter 6
CD
AddendumM_SourceCode_Chapter6_20080411_Fin.pdf Program: Gen1_XML_VB6_CRUD_SemFSynF_Ch6.exe Databank: Gen1_InputV15_RT1.xml N
Addendum N: Visualisation program (graphical topic map) Program: semantics.bat
xii
CD
University of Pretoria etd – Kroeze, J H (2008)
List of Figures
Figure 1.1.
An example of WIVU's morphological and syntactic information,
6
which is interactively available on the web (www.th.vu.nl/~wiweb/const/index.htm). Figure 2.1.
An excerpt of the Lexham Hebrew-English Interlinear Bible
35
(http://www.logos.com/products/details/2055). Figure 2.2.
An interlinear analysis of Jonah 1:1a in an HTML table format
36
(Kroeze, 2002). Figure 2.3.
A threedimensional data structure that consists of a set of 27
38
sub-cubes arranged according to three rows, three columns and three depth layers. Figure 2.4. The knowledge that is represented by a collection of interlinear
40
tables, rendered threedimensionally as a clause cube consisting of layers of clauses and analyses stacked onto each other. Figure 2.5a. Gen. 1:1a analysed according to phrases and linguistic levels.
42
Figure 2.5b. Gen. 1:4c analysed according to phrases and linguistic levels.
43
Figure 2.5c. Gen. 1:5a analysed according to phrases and linguistic levels.
44
Figure 2.6.
45
A clause cube (orthographic view) containing real linguistic data of three BH clauses, Gen. 1:1a, 4c and 5a.
Figure 2.7a. Revealing data contained in the second slice of the cube by
46
removing the top slice. Figure 2.7b. Revealing data contained in the bottom slice of the cube by
47
removing the top and middle slices. Figure 2.8.
A part of the code that creates a threedimensional array and
50
populates it with the selected layers of linguistic data. Figure 2.9.
A representation of a hierarchical syntactic structure using
53
various members of the same dimension and by allowing measures to occupy more than one cell of a member. Figure 3.1.
A series of twodimensional tables, each containing a
59
multidimensional linguistic analysis of one clause. Figure 3.2.
A threedimensional clause cube. xiii
62
University of Pretoria etd – Kroeze, J H (2008)
Figure 3.3.
Information revealed on the front side of the clause cube.
64
Figure 3.4.
Information revealed on the top side of the clause cube.
65
Figure 3.5.
Information revealed on the bottom side of the clause cube.
65
Figure 3.6.
Information revealed on the right side of the clause cube.
66
Figure 3.7.
Information revealed on the left side of the clause cube.
66
Figure 3.8.
Information revealed by rotating the clause cube 180 degrees in
67
a clockwise manner. Figure 3.9.
Information revealed by rotating the clause cube 180 degrees
67
head over heels. Figure 3.10. The top slice of the data cube, revealing the multi-modular
68
analysis of the first clause. Figure 3.11. The middle slice of the data cube, revealing the multi-modular
69
analysis of the second clause. Figure 3.12. The bottom slice of the data cube, revealing the multi-modular
69
analysis of the third clause. Figure 3.13. Information revealed by slicing off the first three planes from the
70
front side of the clause cube. Figure 3.14. Information revealed by slicing off the first four planes from the
71
front side of the clause cube. Figure 3.15. Information revealed by slicing off the first two planes from the
71
front side of the clause cube. Figure 3.16. Information revealed by slicing off the first plane from the front
71
side of the clause cube. Figure 3.17. A slice of the Genesis 1:1-2:3 clause cube that reveals the multi-
76
modular analysis of Gen. 1:17a-18a (one clause spanning two verses). Figure 3.18. The clause cube searched on a specific parameter.
77
Figure 3.19. Using the "Scroll through slice of syntactic frameworks" button.
78
Figure 3.20. Using the "Scroll through slice of semantic frameworks" button.
79
Figure 4.1.
The hierarchy of the Genesis 1:1-2:3 clause cube as reflected by
103
its XML implementation. Figure 4.2.
An example of an XML schema used to annotate text (Witt et al., 2005: 105). xiv
105
University of Pretoria etd – Kroeze, J H (2008)
Figure 4.3.
The basic structure of the XML database of Genesis 1:1-2:3.
105
Figure 4.4.
Two populated clause elements in the XML database.
107
Figure 4.5.
The XML Schema used to validate the XML database of Genesis
114
1:1-2:3. Figure 4.6.
The XML style sheet used to display the XML clause cube as a
119
series of twodimensional tables in the Firefox or Opera web browser. Figure 4.7.
The first two clauses of the XML clause cube as displayed in the
121
Firefox web browser as two twodimensional tables. Figure 5.1.
An extract of the Genesis 1:1-2:3 XML clause cube, which is
125
representative of the hierarchy and structure of the file. Figure 5.2.
VB6 code that could be used to create a threedimensional array
127
and populate one clause element with several layers of linguistic data. Figure 5.3.
VB6 code used to convert linguistic data from XML format into a
131
threedimensional array. Figure 5.4.
Example of VB6 code that could be used to validate syntactic
137
function elements during the array state. Figure 5.5.
VB6 code used to display one clause's linguistic analysis in a
138
series of textboxes and labels on the interface. Figure 5.6. The end-result after converting data from the XML clause cube
140
into a threedimensional array in VB6. Figure 5.7.
VB6 code used to scroll through the clause cube data.
140
Figure 5.8.
VB6 code used to display a required clause using its array index.
141
Figure 5.9.
VB6 code used to perform exact searches.
142
Figure 5.10. VB6 code used to perform searches on parts of strings.
143
Figure 5.11. VB6 code used to save clause cube data from the
145
threedimensional array into permanent XML-formatted storage. Figure 5.12. The VB6 code used to make space for a new clause record to
149
precede the current one. Figure 5.13. The VB6 code used to make space for a new clause record to
149
follow the current one. Figure 5.14 The VB6 code used to save new or edited clause data to the xv
151
University of Pretoria etd – Kroeze, J H (2008)
RAM. Figure 5.15. The VB6 code used to delete a clause record.
152
Figure 6.1.
161
The knowledge creation paradigm (adapted from Cannataro et al., 2002: 34).
Figure 6.2.
The interface of the completed program with buttons added to
162
direct the user to the data-mining experiments. Figure 6.3.
The interface of the experiment that analyses semantic
163
frameworks in the clause cube. Figure 6.4.
VB6 code used to slice off the module representative of the
164
semantic role frameworks, while validating and numbering the semantic functions. Figure 6.5.
VB6 code used to display the semantic functions slice in a
167
textbox. Figure 6.6.
The semantic frames of the 108 extracted, validated and
167
numbered clauses, shown in a textbox. Figure 6.7.
VB6 code used to sort the semantic functions in each row.
171
Figure 6.8.
The logically-ordered semantic role frameworks of the 108
172
clauses. Figure 6.9.
Code used to concatenate the semantic functions of each clause
175
into a unit. Figure 6.10. Concatenated semantic role frameworks of the 108 clauses.
176
Figure 6.11. VB6 code used to order the rows, representing individual
179
semantic role frames, as units with reference to each other. Figure 6.12. Semantic role frames of all clauses ordered logically with regard
181
to each other. Figure 6.13. VB6 code used to identify and count unique semantic role
184
frames. Figure 6.14. The final results of the semantic role framework experiment.
185
Figure 6.15. The interface of the experiment that analyses the mapping of
195
semantic functions on syntactic functions. Figure 6.16. VB6 code used to validate and number the syntactic function
195
slice. Figure 6.17. The syntactic function slice represented by a logical numbering xvi
197
University of Pretoria etd – Kroeze, J H (2008)
system. Figure 6.18. VB6 code used to validate and number the semantic function
200
slice. Figure 6.19. The semantic function slice represented by a logical numbering
202
system. Figure 6.20. VB6 code used to test all possible combinations of semantic and
205
syntactic functions. Figure 6.21. Initial, human-unfriendly, results of the algorithm that tests for all
206
possible combinations of semantic and syntactic functions in the data set. Figure 6.22. VB6 code used to build and display a human-readable list of
207
combinations of semantic and syntactic functions. Figure 6.23. Human-readable list of all possible combinations of syntactic and
211
semantic mappings that may appear in Genesis 1:1-2:3. Figure 6.24. VB6 code used to prepare and show human-readable output,
223
listing all occurring mappings of semantic and syntactic functions and their frequencies. Figure 6.25. Final results of the experiment that analyses the mapping of
224
semantic and syntactic functions in Genesis 1:1-2:3. Figure 7.1.
A twodimensional visualisation of a sentence’s analysis on
250
various levels (Petersen, 2004b). Figure 7.2.
Proposal for a graphical-spatial topic map showing associations
259
between some semantic and syntactic functions based on colocations in phrases, based on an idea for literary analysis by Bradley (2003: 198). Figure 7.3.
A proposal for a graphical visualisation of a network of semantic
261
frameworks in Genesis 1:1-2:3 containing purpose as one of their elements (based on an idea for lexical visualisation by Manning et al., 2001: 139). Figure 7.4. Topic map of all phrases' syntactic and semantic functions as
264
marked up in Genesis 1:1-2:3. Figure 7.5.
A textual representation of the phrases in the database, viewable in the visualisation program "semantics.bat". xvii
265
University of Pretoria etd – Kroeze, J H (2008)
Figure 7.6.
Interface used to define and fine-tune filters in the visualisation
266
tool. Figure 7.7.
A screen shot of a visualisation of the network linking the
267
semantic function of product to the syntactic functions of either complement or object, as found in various clauses in the dataset. Figure 7.8.
Updated graph showing the network linking the semantic functions of product and patient to the syntactic functions of complement and object, as found in various clauses in the dataset.
xviii
268
University of Pretoria etd – Kroeze, J H (2008)
Chapter 1 Introduction1
1.1 Introduction This chapter discusses important background issues of the thesis, such as the problem statement, research questions and hypothesis. The setting of Linguistic Information Systems 2 within the philosophy of science is also highlighted. The relation of this project to other Biblical information systems is also touched upon. The history of the creation of the thesis is discussed because it is important to understand the unique structure and set-up, which does not conform to the traditional thesis format. Following the discussion of the research approach, methodology and plan, an outline of the various chapters is given. The contribution to the field of Information and Communication Technologies (ICT), which the candidate hopes to make, is indicated. Finally, a short list of working definitions of technical terms is provided.
1.2 Background: forty years of Biblical Hebrew information systems To write a complete historical overview of the development of Biblical information systems in a section of an introductory chapter is an impossible task. Twenty years ago a whole book was already written on the early years of this new discipline (see Hughes, 1987). More recent updates are available in Poswick (2004) and Tov (2003 & 2006). The author will, therefore, rather highlight the major trends and research topics that received attention in the past. Deficiencies in existing products will also be
1
Sections on the philosophical nature and place of Linguistic Information Systems have been published as an article ("Linguistic information [systems] - a humanistic endeavour") in Innovate (see Kroeze, 2007a).
2
When referring to software, information system(s) is spelled with small letters. The name of the scientific discipline of Information Systems (or Informatics as the discipline is often called in South Africa) is spelled with initial capital letters.
1
University of Pretoria etd – Kroeze, J H (2008)
indicated because they reveal the opportunities for future research that could build on the amazing body of work that has already been done.
1.2.1 Levels of analysis Biblical Hebrew grammar can be and has been studied from many different angles. Over the past forty years much of this knowledge has been captured in various computer software systems and databases. The most basic levels are the digital representation of the text in Hebrew characters and the transliteration level, which is an exact representation in the Roman alphabet, if needed. The transliteration may be used to rebuild the text in the Hebrew alphabet, while a separate phonological transcription could embody the pronunciation of the Hebrew text, but the letters or signs used do not correspond exactly to the Hebrew spelling. These basic levels are followed by the morphological, morpho-syntactic and syntactic levels. The more advanced linguistic modules, such as the semantic and pragmatic levels, have received less attention, and one can only hope that knowledge databases and expert systems that deal with these levels will become more readily available. Bothma (1992a) indicates various levels of grammatical information that should, ideally, be available
in
Biblical
information
systems,
i.e.
"[d]escriptions
of
phonetic,
morphological, morphosyntactic, syntactic and semantic phenomena".
There are currently fourteen software tools containing the Codex Leningradensis version of the Hebrew Bible and one tool containing the Aleppo Codex (Tov, 2006: 343). Eleven of these tools offer morphological analyses (Tov, 2006: 356). Only a select few contain syntactic data, for example the database of the Werkgroep Informatica of the Free University in Amsterdam (WIVU). In addition to these grammatically oriented software, ancient and modern translations are available, as well as critical apparatuses and other tools such as dictionaries and even one reference grammar (Tov, 2006).
According to Tov (2006: 346) "[t]he most sophisticated programs ... allow for the search of morphological features ... and also the search for combinations of lexical 2
University of Pretoria etd – Kroeze, J H (2008)
and grammatical information". Five main groups of morphological analysis exist (these analyses are subjective and do not agree entirely), i.e.: •
Westminster Hebrew Old Testament Morphology (Groves-Wheeler)
•
Werkgroep Informatica (Talstra)
•
Bar-Ilan analysis
•
Academy of the Hebrew Language
•
Additional commercial and private morphological analyses (ibid.)
The latest version of the Hebrew Old Testament linguistic database, developed over the past three decades by the Werkgroep Informatica at the Free University in Amsterdam (WIVU), has been included in the Stuttgart Electronic Study Bible (SESB) which
has
been
published
on
the
Libronix/Logos
platform
(Deutsche
Bibelgesellschaft: www.SESB-ONLINE.com, 2001; Stuttgart Electronic Study Bible (SESB) Version 2.0 Logos, 2002-2008). This tool allows researchers to perform advanced syntactic queries, for example, to find examples of clauses having a conjunction and proper name as subject preceding an imperfect verb (Talstra, 2007: 93). According to Talstra (2007:96), "[t]he search for syntactic data offers one way to get a better handle on the function of ordinary and extraordinary constructions in a literary composition" in order to discover grammatical tensions purposefully built into a literary text. The search engine operates mainly on formalistic characteristics and the researcher needs to "translate"/break down his/her query into these terms (cf. Talstra, 2007: 91, 93, 95), but the SESB provides a user interface allowing users to use buttons and checkboxes to select various "combinations of syntactic or grammatical features and functional categories", either on word, phrase or clause level (Kummerow, 2005: 2-3). Although Gómez (2004) regards the search function as "one of the jewels of the crown" of the SESB, and although plain morphological searches are simple to execute, "there is a steep learning curve to overcome" for more complex queries. The search engine was developed in cooperation with prof. C. Hardmeier of the University of Greifswald and prof A. Groves of Philadelphia, Texas
(Ernst-Moritz-Arndt-Universität
Greifswald,
Forschungsschwerpunkt,
Computergestützte Philologie und Bild-Erschliessung, 2001). The WIVU-team is currently working on a more advanced version of the database that captures pragmatic information. The researchers who are involved in this project claim that the 3
University of Pretoria etd – Kroeze, J H (2008)
tool will be "a powerful linguistic instrument for the research into the languages of the Old Testament" (www.th.vu.nl/~wiweb/const/quest2.htm). The ability of the tool to produce phrase parsing, clause level parsing and clause hierarchies distinguishes it from other Biblical Hebrew information systems. There is, however, not a clear distinction between the various linguistic modules in the tagging system.
Another project that offers analysis on an advanced syntactic level is the private database developed by Andersen & Forbes. Their project analyses the elements of each clause up to the most atomic elements (morphemes). Syntactic information and structures are presented as horisontal trees. Andersen & Forbes (2003) trust that their proposal will make a contribution to the field of Biblical Hebrew linguistic information systems by moving beyond the limits of single clauses. Although their work is without any doubt very useful for the study of syntax, it neither differentiates clearly between the various linguistic modules nor does it facilitate multi-modular linguistic studies. In addition, one also needs extensive knowledge of the symbols used to make sense of the myriad of labels used to tag the nodes and leaves of their representations. Their tagging of semantics is limited to word level.
1.2.2 Underutilization of existing tools Poswick (2004) gives an overview of Biblical information system projects of the period 1985-2004. His impression is that, although various tools provide morphological analyses and even other levels of analysis, "classical Biblical exegesis would not appear to be benefiting as yet from the results of this type of analysis". Tov (2006: 337) agrees with Poswick that Biblical scholars still do not make optimal use of these tools. This may be due to the fact that scholars have been so focused on creating and improving the tools themselves that they have not yet maximised indepth exploration of the huge amounts of data that have been made available by these tools (Tov, 2006: 338).
However, one has to note that many exegetical articles have been produced as a result of the Werkgroep Informatica's databases (cf. the (outdated, incomplete) 4
University of Pretoria etd – Kroeze, J H (2008)
bibliography available at http://www.th.vu.nl/~wiweb/cons/publicaitions.htm). A recent example of the use of their database of syntactic hierarchies may be found in Talstra (2006: 231-232) to investigate the use of yiqtol verbs in narrative prose found in Exodus, shedding new light on the exegesis of sentences where these verbal forms follow wayyiqtol forms. In her review of the SESB, which includes this database, Conybeare (2005) states that "[t]he student who was most excited by the possibilities of the SESB was the one most closely engaged already with biblical exegesis". In order to fully exploit this tool, a user would probably have to make a careful study of the "fascinating essay" in the manual that explains "how the Hebrew text was analyzed to facilitate more complex syntactical searches" (ibid.).
A feeling of information overload may be another reason for the underutilization of Biblical information systems. Claassen & Bothma (1988: 83) highlight the problem of (electronic) information overload that already existed in Biblical research twenty years ago. According to Bothma (1992b) hypermedia may be used to minimise problems of information overload because the network of hyperlinks allows the user to access only relevant information.
A third reason for the underutilization of these systems may be the lack of ease of use. According to Tov (2006:338) Biblical research software is under-used because "[t]here remains a wide gap between the knowledge of the experts creating the tools and that of the scholars for whom the tools are intended". Indeed, many of these tools are not easy to use and are in desperate need of user-friendly interfaces. The Werkgroep Informatica has already started a pilot project to make available the contents
of
their
database
in
a
familiar
web
format
(www.th.vu.nl/~wiweb/const/index.htm) – see Figure 1.1 below for an example. Regrettably, only five chapters of the Old Testament are available in this format, providing information on morphological and syntactic levels. Subordinate clauses are indicated by means of indentation. The presentation of the data is done by means of Unicode and HTML (www.th.vu.nl/~wiweb/html/learn.htm).
5
University of Pretoria etd – Kroeze, J H (2008)
Figure 1.1. An example of WIVU's morphological and syntactic information, which is interactively available on the web (www.th.vu.nl/~wiweb/const/index.htm).
6
University of Pretoria etd – Kroeze, J H (2008)
This thesis attempts to address the problem of underutilization by giving pointers for the integration and advanced processing and mining of sets of linguistic data. The author believes that visual presentations of the data and patterns in the data that are easy to understand could enhance the usage of Biblical information systems.
1.2.3 Integration as a solution to enhance utilisation Having various electronic aids for the study of the Hebrew Bible is wonderful, but also overwhelming and even frustrating, due to the fact that various tools have to be used to study different levels and to get various perspectives. Bothma (1990) proposes the use of integrated Biblical information systems to enhance the process of computerbased education and to solve the problem of Biblical languages that are often studied in isolation. These systems should integrate introductory grammars, reference grammars, sources on the cultural background of the Bible and research databases. Various levels of granularity of data should be available for users with different levels of knowledge and requirements. Poswick (2004) also indicated the use of hypermedia to take Biblical research to a new level, "from the accumulation of electronic texts to the construction of hyper-textual links between them with all the cultural data which permit their interpretation".
Systems have been suggested and at least one has already been developed to display multi-level analyses of Hebrew clauses, integrating the various dimensions of clausal analysis in an interlinear table format on one screen, for example the Lexham Hebrew-English interlinear Bible (Van der Merwe, 2005). These tables resemble those found in relational databases, and this gives birth to the wish of being able to do ad hoc queries on the stored data. However, these tables cannot simply be transformed into relational database tables, because there is a separate table for each record (or clause) and the rows do not represent unique records. A closer inspection reveals that the rows actually represent various dimensions or levels of data-analysis that are strongly linked to the elements in the upper row. This type of interlinear table is in fact a twodimensional representation of three- (or multi-) dimensional linguistic data structures. Bothma (1992a) proposed and successfully 7
University of Pretoria etd – Kroeze, J H (2008)
tested the use of SGML, of which XML is a derivative subset, to provide a platformindependent databank of the linguistic and other related data.
This thesis addresses the need, expressed by Bothma (1992b), for syntactic and semantic databases of Biblical Hebrew. Such databases may enhance grammatical research because "manual searching for complex syntactic examples is extremely difficult and inadequate in that retrieved information is very often incomplete due to the size of the corpora of texts" (Bothma, 1992b: 340). Although various syntactic databases have become available, the author is not aware of any databases containing a separate module of semantic functions that may help users to understand the logical relations between the constituents of clauses and sentences.
The XML data structure suggested in this thesis may make a contribution to find ways to find an "appropriate information model for presenting Biblical information in an electronic form", with reference to integrating and storing information from various linguistic modules (cf. Bothma, 1992b: 345).
The advantages of XML, however, are not limited to the creation of a database structure. According to Van der Merwe (1995: 419) the purpose of an electronic reference grammar "plays a major role in determining its structure and content". The extensibility and adaptability of advanced mark-up languages such as XML make them ideal to implement a custom-made macro-structure, which should, for example, fulfil the following requirements: "An electronic BH [reference grammar – JHK] should serve as a cheap up-to-date, as well as updateable, source of easily retrievable information on BH for readers of the BH text of the OT. These readers may have various degrees of receptive competency of BH" (Van der Merwe, 1995: 420).
The use of XML as mark-up language to tag the data in a bank of Biblical data may also enable learners to move between teaching and reference textbooks and to emulate deductive grammars, according to Bothma (1992a). Furthermore, it could also facilitate the move in Biblical research focus from textual aspects to communicative aspects (Poswick, 2004).
8
University of Pretoria etd – Kroeze, J H (2008)
The combination of hypermedia, such as XML, and database concepts forms a strong and promising alliance of techniques, which facilitates solutions to cater for a diversity of domains, users and applications, including integrated Biblical information systems (Claassen & Bothma, 1988: 84). This thesis may be a step in the right direction to solve the problem of new requirements that may be laid down by the "shift of paradigm from exegesis based on a philological approach, to hermeneutic based on a linguistic and socio-linguistic approach" (sic) (Poswick, 2004), since the use of an extensible, multidimensional data structure could facilitate the accommodation of other types of linguistic and non-linguistic data.
1.2.4 Visualisation and flexibility Adding visualisation techniques to the mixture of XML and databases could provide even more exciting possibilities. Claassen & Bothma (1988: 88-89) suggest the use of visualisation to direct users in finding their way through the convoluted sets of paths in hyperspace. Advanced processing and visualisation techniques may also make a contribution towards the development of user-friendly interfaces (Bothma, 1992b: 348). This thesis aims to contribute to the attainment of this goal by proposing a macro-structure for the integration and packaging of Biblical Hebrew linguistic information and by experimenting with some visualisation techniques to render captured data in innovative ways.
According to Andersen & Forbes (2003: 44) one of the requirements of a proper rendering of syntactic structures of Biblical Hebrew is that it should be pictorial, that is "clearly and concisely diagrammed". They use graphs and trees to visualise ("represent") the hierarchical syntactic structures of Biblical Hebrew clauses.
Scalability is a serious issue that needs to be addressed if one would like to represent aggregate, linguistic information on a lateral level across the single units of the textual corpus. According to Andersen & Forbes (2003: 45) the text of the Hebrew Bible approximately consists of 59 000 main clauses and 13 000 embedded clauses. Although this thesis will propose ways to compile such aggregrate 9
University of Pretoria etd – Kroeze, J H (2008)
information, it is still limited to one chapter only. Visualising lateral information of larger sections, books or the whole Hebrew Bible will surely create new and difficult challenges for researchers.
Tov (2006: 337) differentiates between non-flexible and flexible Biblical Hebrew software. Non-flexible tools reflect only the result of computer-assisted software in textual format, and the reader does not have access to the original data or tool itself; these may even have become obsolete. Flexible tools, however, allows interactive use of the tool and data. These tools, especially the flexible versions, may be used as "an extension of our own thinking" and to "improve and expand the areas of our research". The tools that are already available may be categorised according to their intended purpose, i.e. to serve as aids in authorship studies, analyses of stylistics and linguistics, as well as statistical and text-critical studies (Tov, 2006: 338-342). Making use of interactive visualisation tools could pave the way to more flexible Biblical Hebrew linguistic software.
In addition to the representation of linguistic data, a "comprehensive Biblical information system" should, according to Bothma (1995), include images of textualcritical material and cultural-historical objects in order to facilitate the preservation, publication and research of the ancient manuscripts. Multiple disciplines are involved in such a system which necessitates team work since no researcher could have all the skills needed to construct the various building blocks. Bothma (1992b: 348) highlights the necessity of cooperation between linguists, theologians and IT specialists that is needed to build well-designed Biblical information systems. Although there might not be many researchers who have an in-depth command of all of these disciplines, the members of the team should have a basic understanding of the complex nature of each other's abilities and fields.
Although some of the projects, discussed above, do facilitate rather advanced searches, they do not clearly differentiate between the linguistic levels of syntax and semantics; neither do they facilitate comparative studies on and between these levels.
10
University of Pretoria etd – Kroeze, J H (2008)
1.3 Problem statement As indicated above, there is still a need for more language-oriented, multidimensional Biblical information systems, in which the linguistic characteristics of the Hebrew text of the Old Testament are embedded, to enable researchers to do advanced ad hoc queries. For example, a researcher may want to do a specific search in order to find good examples of a certain syntactic structure, to study semantic role frameworks in Biblical Hebrew, or to explore the mapping of syntactic functions onto semantic functions.
One of the core foci of Information Systems is the study of databases. Among other approaches, such as object-oriented database systems, relational database management systems remain the technology most often used and taught, usually within the realm of business management. This widely used and standardised technology is, however, not of its own accord the most applicable for text-based databanks.
Standard (relational) database management systems such as MS Access are, in fact, not ideal for the storage and retrieval of linguistic data, because they require groups of similar records having highly structured (usually twodimensional) data. Free text, however, is covertly structured. If the linguistic elements of a clause, for example, are captured in a standard database, the word order of the clause's elements are lost, and it results in tables with an unacceptably high number of columns, many of which contain null values, because every sentence's structure may differ considerably from the former one. According to Bourret (2003) the structure of sentences "varies enough that mapping it to a relational database results in either a large number of columns with null values (which wastes space) or a large number of tables (which is inefficient)".
Furthermore, to capture information of various linguistic modules, the relational database should contain various tables with columns for each of the syntactic and semantic functions, word groups, etc., causing even more overhead. Therefore, storing text-related data in conventional databases is not ideal. It "artificially creates 11
University of Pretoria etd – Kroeze, J H (2008)
lots of tuples/objects for even medium-sized documents" (Xyleme, 2001: 3) and "requires multiple index lookups and multiple disk reads" (Bourret, 2003). Complex format mappings would be needed to convert the data into a table-based relational database management system. Many joins would be needed to reconstruct the original document when queries are run (Vakali et al., 2005: 62).
Since traditional, relational databases seem to be problematic, one has to look for an efficient solution elsewhere. XML has been identified by various computational linguists as viable technology for linguistic databases and computing (cf. Witt, 2002; Witt, 2005; Bayerl et al., 2003; Burnard, 2004). Due to its flexibility, XML may be used to create either a less structured, text-oriented file or a more structured, database-oriented document.
There are, of course, exceptions to the rule that XML is the best option for languagerelated database projects. According to Bradley (2005: 133) some projects do begin with text, but hidden below the surface is data-oriented material that better suits a relational database approach. For the foreseeable future he proposes a complementary approach that uses both relational database management systems for such projects and XML-based systems for materials that are saying "more 'subtle' things" (ibid.: 134, 141). Since XML also provides opportunities for the integration of existing databases (Golfarelli et al., s.a.), this thesis will investigate its usability to capture and explore data from various linguistic modules. The proposed structure of the XML database could be used in future as a model for the integration of information extracted from existing but divergent Biblical Hebrew databanks.
1.4 Research questions
1.4.1 Main research question The main research question that is addressed in this thesis, may be formulated as follows: How can XML be used to build an exploitable linguistic database of the text of the Hebrew Bible? 12
University of Pretoria etd – Kroeze, J H (2008)
The scope of the empirical study will be limited to Genesis 1:1-2:3, the first pericope of the Hebrew Bible. Scalability issues to cover more extensive texts, such as the whole Hebrew Bible, will not be dealt with.
1.4.2 Secondary research questions The main focus of the thesis falls on the use of XML for the permanent, platformindependent, storage of linguistic data, which may be used by various algorithms, programming languages and visualisation techniques for in-depth processing. Flowing from the main research question, the following secondary research questions will be addressed in the various chapters of the thesis: •
Chapter 2: How can multidimensional Biblical Hebrew linguistic data be captured and stored in the computer's temporary memory using a programming language such as Visual Basic?
•
Chapter 3: How can multidimensional Biblical Hebrew linguistic data be processed with a programming language such as Visual Basic?
•
Chapter 4: How can multidimensional Biblical Hebrew linguistic data be stored permanently to allow a stable environment for editing and processing? o Why should XML be explored as an option to store an exploitable database of linguistic data? o How can XML represent inherently multidimensional data? o How can XML represent the phonological representation and translation, as well as the morpho-syntactic, syntactic and semantic analyses of the Biblical Hebrew text?
•
Chapter 5: How can linguistic data be recovered from and saved to a permanent storage device (such as an XML database)?
•
Chapter 6: How can linguistic data be explored to unveil hidden patterns in and between the various language modules?
•
Chapter 7: How can visualisation be used to enhance text-mining of multidimensional linguistic data?
13
University of Pretoria etd – Kroeze, J H (2008)
1.5 Hypothesis The central theoretical statement of the thesis is: Taking advantage of the flexibility and inherent hierarchical nature of XML provides a suitable technology to transform free text and its covert linguistic characteristics into a platform-independent database that may be explored and mined to uncover hidden linguistic patterns in the Hebrew Bible text.
The candidate expects to prove the following propositions: •
A clause's word order can be kept intact while other features such as syntactic and semantic functions are marked up as related elements.
•
The inherent hierarchical nature of XML is ideal to design a well-structured database containing linguistic data from various modules in one, centralised, data structure that is suitable as a permanent storage device.
•
The elements from the XML database can be accessed, processed and visualised by a third-generation programming language, such as Visual Basic and Java.
•
A threedimensional array is an effective programming tool to process and mine the data.
To test the hypothesis and expected findings, a multi-modular analysis of Genesis 1:1-2:3 will be implemented using XML, while Visual Basic 6 will be used for the online analytical processing (OLAP) on the data. A visualisation tool, created in Java, will be used to investigate visualisation of the linguistic data as a data-mining tool.
14
University of Pretoria etd – Kroeze, J H (2008)
1.6 Positioning of Linguistic Information Systems within a research discipline Science is traditionally regarded as having two major branches, natural sciences and human sciences. The natural sciences study subjects such as Chemistry and Physics and use mainly empiricist methods. 3 Human sciences consist of the social sciences (such as Economics and Psychology) and the humanities (Arts, Language and Philosophy). Although empiricist methods are used in some sciences, others in the humanities do not necessarily share this methodological approach. Most of the humanities use a rationalist approach that is not empiricist.
A non-empiricist point of departure for research is applicable when phenomena are studied which cannot be proven by hard, concrete facts and empirical observation. Cilliers (2005) pleads for alternative scientific methodologies regarding complex systems, which are modest and provisional, acknowledging that our understanding is limited and changing. Restricting the term research to empiricism is anachronistic, given the contemporary insight that knowledge is never final and beyond dispute. Modest claims about knowledge, however, invite knowledge workers to persevere in an ongoing search for meaning and generation of understanding. These "softer" research goals are the unique foci of especially the humanities.
Although it is tempting to view Information and Communication Technology (ICT) as "the epitome of rational expression", empiricist research methods are only appropriate for the study of engineering and algorithmic issues related to it. Therefore, the study of ICT is divided into three branches, i.e. Computer Science (the natural science branch), Information Science (the humanities branch) and Information Systems (the social sciences branch) 4 . However neat this division may seem, many phenomena often require a mixture of these perspectives for
3
Empiricism is an approach based on the idea that scientific knowledge can only be valid if it is based on empirical observation and measurement.
4
Information Systems is primarily regarded as a social science focusing on the influence of computer technology on human society.
15
University of Pretoria etd – Kroeze, J H (2008)
comprehensive research. For example, Information Systems (IS) is primarily regarded as a social science, because it investigates socially constructed issues such as the influence of ICT on organisations. But it also sometimes studies harder (only factual) phenomena, typical of the natural sciences, such as programming techniques and algorithms, in order to build efficient software solutions for organisations and industry. Furthermore, it also has links with the humanities when it focuses on the use and application of ICT in education, health care and other humanistic 5 focus areas.
Information Systems should, therefore, be regarded as an interdisciplinary science. It should not only aim to add value to other disciplines, but also borrow from other contributing ICT (and non-ICT) disciplines in order to strengthen their alliances. "The power and not the weakness of IS research models is precisely that they situate IS constructs within constructs that other disciplines study" (Agarwal & Lucas, 2005: 390). For example, in one of the research foci of Information Systems, namely Human Computer Interaction, there are elements of all three branches of ICT: it studies the behaviour of computer system users, the use of professional algorithms to produce human-oriented output, and friendly design of interfaces by means of inputs from graphical design and multi-media.
As a science with strong links to the human sciences, there has indeed been a growing acceptance in Information Systems that empiricist research is not the only valid scientific methodology that could be used to produce good research. Avgerou (2005:105), for example, argues for critical research using interpretive methods in Information Systems to complement empirical and formal cognitive methods. She regards critical research as a process that aims to make sense of the investigated scenario, a radical procedure in which researchers' human capacities such as tacit knowledge and moral values are involved. "I see research as the art of putting together research questions with a critical content, multiple theories and epistemological awareness to develop claims of truth. This art cannot place
5
The term humanistic is used in this thesis as an adjective of the noun humanities. It does not refer to the philosophy of Humanism. (Compare TheFreeDictionary, 2008: "humanistic - pertaining to or concerned with the humanities". Also see Aarseth, s.a.)
16
University of Pretoria etd – Kroeze, J H (2008)
confidence for producing valid knowledge on adhering to a testable theory or research practice" (ibid.: 108). Although the knowledge claims contributed by interpretive case studies should be regarded as soft facts, they are still valid and should be generalised in clear formulations aimed at identified target audiences (Barret & Walsham, 2004: 298, 310).
Bondarouk & Ruël (2004) argues for the use of discourse analysis to enable a hermeneutic approach in the analysis of information systems documents. Discourse analysis is another non-empiricist scientific method. It is essentially interpretive and constructivist. It tries to "give a meaning to a text within a framework of the interpreter's experience, knowledge, time, epoch, culture, and history". It believes that understanding is an open, continuous process and that there is no final, authoritative interpretation.
Other non-empiricist approaches in Information Systems, which will not be discussed here in detail, are (cf. Carlsson, 2003; Du Plooy, 1998: 53-68): •
action research (the researcher collaborates with members of the organisation to experiment with possible solutions for a problem)
•
actor network theory (the researcher studies the technical and social aspects of IT as a unity because values are believed to be built into software) 6
•
critical realism and adaptive theory (the researcher attempts to combine and synthesise empiricism and interpretivism)
•
ethnography (the researcher participates in activities of the organisation that is studied)
•
grounded theory (the researcher derives theory by means of qualitative data analysis)
6
Like adaptive theory which is epistemologically neither positivist nor interpretivist (Carlsson, 2003), actor network theory (ANT) is positioned between deterministic and constructivist theories (Cordella & Shaikh, 2003). It studies the reciprocal influence of technology and society, the interaction between the human and non-human actors that constitute a network. Reality is believed to come into existence through this interplay.
17
University of Pretoria etd – Kroeze, J H (2008)
•
structuration theory (the researcher regards human agency and social structure as an inseparable duality)
From the discussion above it should already be clear that it has become acceptable to use other, non-empiricist and interpretive methods in Information Systems research. Humanities Computing may be another tool that could introduce a "softer" view and use of computers that would be more applicable in the human sciences than the "harder" approaches that are typical of the natural sciences. The representation of socially related data is one of the basic ventures of Humanities Computing (Neyt, 2006: 2-5). This approach has a broader scope than the mere use of computers only to empirically confirm or reject hypotheses, which constrains meaning (Ramsay, 2003). According to Aarseth (s.a.) Humanistic Informatics is the discipline that studies "the changing role of technology in the Humanities, as in society in general".
The study of databases also forms part of the discipline of Information Systems (IS). According to Vessey et al. (2002: 167) database management is one of the topics "at the heart of the IS discipline in that they are central to IS curricula and therefore to IS careers". The creation of knowledge databases and the exploration of these electronic repositories are thus part and parcel of Information Systems research, even if the encoded data come from other disciplines.
Mark-up languages, traditionally, fall more within the research focus of Information Science. Unsworth (2001) regards the marking up of texts as a form of knowledge representation within the field of Humanities Computing: "For humanities computing, knowledge representation is a compelling, revelatory and productive way of doing humanities research--and in many ways, it is what humanities computing has been doing, implicitly, for years." The study of mark-up and ontologies (taxonomies) is closely related to the study of language and semantics, which forms part of the broader discipline of humanities (the study of the arts, language and literature, philosophy and human culture). Research on the structure and use of mark-up languages to indicate metadata, therefore, operates within the grey area between Linguistics and Information Science. 18
University of Pretoria etd – Kroeze, J H (2008)
Due to the flexibility of XML, an extensible mark-up language, it may also be used to create text-based databases. As argued above, the study of databases forms part of the discipline of Information Systems. The study of XML databases therefore falls within the field of Information Systems, with strong ties to Information Science. Aarseth (s.a.), for example, regards research on digital document representation and hypertext uses as part and parcel of Humanistic Informatics (also called Alfainformatica): "In particular, text mark-up systems such as SGML, and the potential and limits of exploratory data analysis, can and should provide extremely interesting subjects for the field." The humanistic informatician fulfils an interstitial role between two (or more) different disciplines and needs to be acquainted with the methods and problems of all of these (ibid.).
The transformation of language and texts, via mark-up, into linguistic databases may, therefore, be viewed as a subfield of Humanistic Information Systems. This subfield, which is also the broader focus of this study, may be called Linguistic Information Systems or Natural Language Information Systems.
Natural Language Processing (NLP) provides another perspective on the study of the relations between ICT and language. This interdisciplinary field focuses on the simulation of language understanding and production by means of computer algorithms. This field combines Linguistics and Computer Science, a natural science. For an example of Natural Language Processing of Biblical Hebrew, compare Petersen (2004a), who wrote a program to automatically create conceptual graphs of the semantics of Genesis 1:1-3 using data from an Old Testament database. 7 The combination of Linguistic Information Systems and Natural Language Processing is called Computational Linguistics.
The creation of linguistic databases is, of course, not an end in itself. Besides being used to prove hypotheses, it may also be used to suggest new ideas and theories. Ramsay (2003) suggests that computing humanists should rather use software to
7
Petersen used XML as input and output format for his lexicon and ontology. This part of his study may be regarded as more information system-oriented.
19
University of Pretoria etd – Kroeze, J H (2008)
discover a multiplicity of meanings in literary sources. Such an approach will deepen the subjectivity that is essential for the creation of critical insight. Researchers and software creators should therefore work towards alternatives for the traditional, statistics-based "forensic semiotics" in the processing of texts in order to change the computer into a tool that supports interpretive processes: "[R]ather than to extol the computer as a scientific tool that can supposedly help prove particular facts about a text, we would do better to focus on its ability to help read, explore, experiment, and play with a text" (Sinclair, 2003: 176).
This thesis focuses on the use of an XML database to facilitate storage, exploration and visualisation of multidimensional linguistic data. It involves a combination of research on linguistics, database management and mark-up, as well as the visualisation of the patterns entrenched in tagged data. Since it focuses on the use of a database for the study of language it may, therefore, be regarded as a research endeavour in Linguistic Information Systems.
In a study with a similar approach to this one, T. Sasaki (2004) proposed design principles for XML documents to facilitate lexicographical and grammatical studies of Modern Hebrew. He reserves a place for the creation and use of annotated text databanks within the field of Computational Linguistics, which he defines as the "interface between Hebrew linguistics and computer science" (2004: 17). 8 Even if the construction of such an annotated databank could "seem rather naïve to NLPoriented computational linguists" (T. Sasaki, 2004: 22), it is an essential task for the computational linguist who is more interested in the use of existing information technology to enhance his/her study of language than in the creation of algorithms that simulate human language understanding or production.
An XML schema defines the structure and content of the databank containing the XML mark-up tags (Clark et al., 2003). A schema is preferred to a DTD (document type definition) since it is more advanced and "more closely maps to database
8
According to the exposition above it should rather be the interface between Hebrew Linguistics and Information Systems.
20
University of Pretoria etd – Kroeze, J H (2008)
terminology and features", allowing the definition of variable types and valid values for the elements (Rob & Coronel, 2007: 579). Tags that are used to digitalise text "are not merely structural delineations, but patterns of potential meaning woven through a text by a human interpreter" (Ramsay, 2003: 171).
The discovery of meaningful patterns in numerical data (data mining) and in textual data (text mining) is another essential part of Information Systems research. It should be noted that this discovery of new information is an intentional process. Any new knowledge produced by it, is not simply discovered, but created (cf. Du Plooy 1998: 54, 59). The patterns that the candidate wants to unveil in the Genesis 1:1-2:3 XML document are covertly embedded within other visible patterns, i.e. the overt patterns specified by the schema. The creative process of knowledge discovery should be a stimulating, but "careful and responsible development of the imagination" (Cilliers, 2005: 264). This could eventually improve linguists' understanding of language as a complex social system, because "our understanding of complex systems cannot be reduced to calculation" (ibid.).
The thesis also contains elements of Computer Science and Information Science. Algorithms are developed to process and mine the linguistic data in the XML database, and visualisation is discussed and illustrated as a means to augment the exploratory searches for patterns hidden in the data. The study aims to show how a graphical visualisation tool could be used to stimulate imaginative knowledgecreation processes in a responsible way. In order to reach this goal the researcher must be enabled to perform experimental, trial-and-error investigations of the text that could reveal exciting new patterns built on re-orderings of marked-up text.
1.7 Research style and methodology The study will entail various literature studies and empirical programming experiments to investigate the various research questions. Although the topic and the concepts for the various subsections of the project were planned as a whole (see the main and secondary research questions above), each chapter was approached as a 21
University of Pretoria etd – Kroeze, J H (2008)
unit so that it could be submitted as a conference paper or journal article during the course of the PhD study. Therefore, each chapter has a literature study and practical component. After completion of all the papers or articles, it was decided not to rewrite the whole thesis, consolidating all literature review sections into one chapter, since the information is closely related and interwoven with the application in each chapter. Keeping the basic structure of the thesis as a collection of independent, but related papers, each building on the preceding one, will also enable readers of the (unpublished) thesis to read each chapter as a unit. However, there is a logical flow from the second to the seventh chapter. Likewise, the practical programming concepts and examples also progress from very basic concepts to rather advanced data-mining and visualisation applications. The study first explores ways to store linguistic data in the computer's temporary memory and the various presentations of subsets facilitated by these data structures. Next, permanent storage and conversion betweeen temporary and permanent storage is discussed. Once a stable platform for storage and editing has been established, advanced processing and graphical interfaces for text mining are investigated.
While the suitability of threedimensional array processing, XML and visualisation concepts are presented and evaluated by the literature studies, the empirical and experimental components are used to test these ideas against the test data provided by Genesis 1:1-2:3. Therefore, the overall, dominant, research style of the thesis is constructive, i.e. developing a new framework and pursuing technical developments (cf. Cornford & Smithson, 1996: 43). Although the basic philosophical point of departure is positivistic, a large element of interpretation is built into the dataanalyses used as data. This part of the study may be regarded as anti-positivistic, qualitative and interpretative.
The Visual Basic programming code presented in this thesis could still be optimised and improved to a large extent. Since the main focus of this study is not on the algorithms themselves, but on their usability to store and mine linguistic data, it has been left with a view to follow-up work to make the code more efficient and elegant. This could be done, for example, by migrating the project to a fully object-oriented language, such as Visual Basic 2005, and by implementing more procedures, functions, classes and objects. 22
University of Pretoria etd – Kroeze, J H (2008)
1.8 Research plan Although the six content chapters of the thesis (Chapter 2-7) had all initially been planned to constitute a coherent body of work (see 1.4 above), they were written and submitted separately as conference papers, some of which appeared in published conference proceedings, scholarly publications or on the internet after the comments in peer reviews have been taken into account and processed in the final versions. Other papers were revised, using the valuable comments of international and local reviewers, and have eventually been published in accredited journals. Parts of the introductory chapter (Chapter 1) were published in the Innovate magazine, published by the EBIT Faculty of the University of Pretoria. An overview article, which forms part of the conclusive chapter (Chapter 8), has been accepted as a paper by an international conference (Conf-IRM 2008). The feedback provided by many colleagues over the past five years has resulted in an organical development of the ideas in the thesis and is highly appreciated. See the table below for more specific details on the mapping between the various chapters of the thesis and the related research outputs.
As indicated above the literature review does not form a separate chapter, but is integrated into the content chapters, as well as the introductory chapter. The literature studies are interleaved with interpretative and constructive research experiments, namely the analysis of a Hebrew text to be used as test data and the creation of Visual Basic and Java programs to transform, store, process and datamine the linguistic data. The Hebrew text of Genesis 1:1-2:3 has been analysed linguistically on various levels and this data was used as test data throughout Chapters 2 to 7. The text of Leviticus 1 was also analysed and coded in the same XML structure to test the plug-in scalability of the programs. Although this test was successful, this XML document will not be included or discussed in this thesis.
Although the original papers and articles have been revised, some extensively, others only to a lesser extent, in order to constitute a coherent and consistent text, the history of the origin is still present in the underlying structure of each chapter. 23
University of Pretoria etd – Kroeze, J H (2008)
Therefore, some information is repeated, summarised or extended in the various chapters, partly due to the fact that the information was interwoven in the original research outputs, but also having the added benefit that each chapter may still be read as single, independent unit.
The following chapters correspond to the research outputs referred to above:
Chapter 1
The philosophical section(s) on Linguistic Information Systems is a revised version of: •
KROEZE, J.H. 2007. Linguistic information [systems] - a humanistic endeavour. Innovate, 02: 38-39. (Published by the University of Pretoria, EBIT.)
Chapter 2
•
KROEZE, J.H. 2004. Towards a multidimensional linguistic database of Biblical Hebrew clauses. Journal of Northwest Semitic Languages (JNSL), vol. 30, no. 2, pp. 99-120.
•
Revised version of 2004 AIBI VII paper, Leuven, Belgium: Processing Hebrew clauses using threedimensional arrays.
Chapter 3
•
KROEZE, J.H., BOTHMA, T.J.D. & MATTHEE, M.C. 2008? Slicing and dicing a linguistic data cube. Accepted for publication in Handbook of Research on Text and Web Mining Technologies (Nov. 2007), edited by M. Song.
•
Revised
version
of
2004
SASNES
paper,
RAU,
Johannesburg: Slicing and dicing cyber cubes of Biblical Hebrew clauses. Chapter 4
•
KROEZE, J.H. 2006. Building and displaying a Biblical Hebrew linguistics data cube using XML. Paper read at Israeli Seminar on Computational Linguistics (ISCOL), Haifa, Israel, 29 June 2006. Available: http://mila.cs.technion.ac.il/english/events/ISCOL2006/ISCOL 20060629_KroezeJH_XML_Paper.pdf.
Chapter 5
•
KROEZE, J.H. 2007. Round-tripping Biblical Hebrew linguistic data.
In
Managing
Worldwide
Operations
and
Communications with Information Technology (Proceedings 24
University of Pretoria etd – Kroeze, J H (2008)
of 2007 Information Resources Management Association, International
Conference,
Vancouver,
British
Columbia,
Canada, May 19-23, 2007), edited by M. Khosrow-Pour, pp. 1010-1012. Published in book format and on CD by IGI Publishing, Hershey, PA. Chapter 6
•
KROEZE, J.H. 2007. A computer-assisted exploration of the semantic role frameworks in Genesis 1:1-2:3. Journal of Northwest Semitic Languages (JNSL), vol. 33, no. 1, pp. 5576.
•
Revised version of 2006 SASNES paper, Unisa, Pretoria: Semantic role frameworks extracted from a multidimensional database of Gen. 1.
Chapter 7
•
KROEZE, J.H., BOTHMA, T.J.D., MATTHEE, M.C. & KROEZE, J.C.W. 2008. Visualizing mappings of semantic and syntactic functions. Proceedings of the Sixth International Conference on Informatics and Systems (INFOS2008), Cairo, Egypt, 27-29 March 2008. (On CD. ISBN: 977-403-290-X.) Available
online
(http://www.fci-cu.edu.eg/infos2008/infos/
MM_10_P061-072.pdf). •
KROEZE,
J.H.,
BOTHMA,
T.J.D.,
MATTHEE,
M.C.,
KROEZE, J.C.W. & KRUGER, O.C. 2008? Designing an interactive network graph of modular linguistic data in an XML database of Biblical Hebrew. Paper abstract accepted for AIBI VIII (2008). (The empirical part of this paper does not form part of the thesis, but should be regarded as a post-doctoral project building on the thesis.) Chapter 8
•
KROEZE, J.H., BOTHMA, T.J.D. & MATTHEE, M.C. 2008.
(conclu-
From tags to topic maps: using marked-up Hebrew text to
sion and
discover linguistic patterns. Paper accepted by Conf-IRM
summary)
2008. •
Beta version read as guest lecture, University of Leiden, 31 May 2007.
25
University of Pretoria etd – Kroeze, J H (2008)
1.9 Structure of the thesis The thesis is structured as follows:
Front matter The front matter consists of the cover page, title page, preface, table of contents, abstract and English and Afrikaans summaries.
Chapter 1: Introduction This chapter discusses the background and goals of the thesis (see 1.1).
Chapter 2: Towards a multidimensional linguistic database of Biblical Hebrew This chapter discusses the use of threedimensional arrays in Visual Basic 6 to build a data cube in the computer's RAM (random access memory). The theory is applied to the Hebrew text of Genesis 1:1-2:3, using the Visual Basic 6 programming language. The linguistic data is declared and initialised in a module of the program. The linguistic data cube is called a "clause cube". The clause cube provides a structure that may be used to integrate data from various linguistic modules. Data integration is a typical warehousing problem (Xyleme, 2001: 2). More data warehousing concepts are discussed in Chapter 3.
Chapter 3: Slicing and dicing the clause cube This chapter discusses and illustrates the application of typical online analytical processing techniques and data warehousing operations, like rotation, drilling down, and slicing and dicing on the clause cube while it resides in the RAM.
Chapter 4: Building and displaying the clause cube using XML This chapter discusses the use of the mark-up language XML to build a database for permanent storage of the linguistic data in the clause cube. An XML database is designed and programmatically built to store the test data.
26
University of Pretoria etd – Kroeze, J H (2008)
Chapter 5: Conversion of the Genesis 1:1-2:3 linguistic data between the XML database and the array in Visual Basic This chapter discusses the process of reading the linguistic data from the XML database (permanent storage) into the threedimensional array (temporary storage) for processing, as well as saving the updated data back to permanent storage. A program is written and used to "round-trip" (convert) and edit the test data.
Chapter 6: Advanced exploration of the clause cube This chapter discusses data mining on the linguistic data using algorithms coded in Visual Basic. The logic of the program is discussed and the source code is provided. The results of two experiments are provided in textual format and compared to existing linguistic knowledge and hypotheses. The subsets of the threedimensional array, especially those that contain processed, aggregated information, may be regarded as temporary datamarts of the XML clause cube. These datamarts may be implemented as one-, two- or threedimensional arrays. The slicing and dicing and other operations done on the data during the data-mining process may be regarded as XOLAP, OLAP done on XML databases (Wang & Dong, 2001: 50).
Chapter 7: Visualisation of the Biblical Hebrew linguistic data in the XML cube This chapter discusses theoretical aspects of visualisation and evaluates the results of a graphical text-mining tool that maps the syntactic and semantic functions in the test data.
Chapter 8: Conclusion This chapter gives an overview of the content of the whole thesis and recapitulates the most important conclusions.
Bibliography All the sources referenced in the thesis are listed in the bibliography, using the Harvard method, based on the guidelines found in Botha & Du Toit (1999) and Van der Walt (2002).
27
University of Pretoria etd – Kroeze, J H (2008)
Addenda The fourteen addenda contain additional material, such as the transcription and "ontologies" (taxonomies) of word groups, syntactic functions and semantic functions, used in the tagging of the linguistic database of Genesis 1:1-2:3. The XML database, source code and executable program files are also included as addendas. Some of the addenda have two versions, an executable program as well as a textual version of the source code. Due to the extensive size of this collection it is only provided on CD. Essential parts, however, are reproduced in the chapters where the material is discussed.
1.10 Contribution to the field of ICT This study illustrates that a threedimensional data structure can be used to represent inherently multidimensional linguistic data regarding Biblical Hebrew clauses. Knowledge from various linguistic modules is captured and integrated in a single clausal data cube (clause cube), which is regarded as an efficient way of using a threedimensional database, implemented both in an array and XML structure. The structure of the database may in future be used as a model for the integration of linguistic data that have been captured in various other computer software systems. The chapter on round-tripping shows how integrated linguistic data may be converted between data structures in permanent and temporary storage.
The captured data can be viewed and manipulated in various ways, for example to create stacks of twodimensional interlinear tables showing required aspects of the data of clauses. In this way the threedimensional data cube facilitates actions that are typical of online analytical data processing and data warehousing. Such software can facilitate the linguistic analysis with which any exegetical process should commence, which in turn can benefit a multidimensional approach to biblical exegesis (cf. Van der Merwe, 2002: 94). It also facilitates a format in which the biblical text is presented for readers "succinctly enough to be handled by the shortterm memory", thus enhancing the success of the communication process (ibid.).
28
University of Pretoria etd – Kroeze, J H (2008)
The thesis also illustrates how text data mining may be performed on the linguistic information of an ancient language. Another contribution is the application of visualisation concepts to enhance text mining procedures by using i.a. graphical topic maps.
To sum up, the thesis makes a contribution to the field of Information and Communication Technology by demonstrating how software tools and concepts borrowed from Information Systems, Information Science and Computer Science may be used and adapted in Linguistic Information Systems for knowledge representation and processing.
1.11 Definition of terms The definitions of the terms that are provided below are not official definitions quoted from other literature, but working definitions to indicate how these concepts are understood and used in this thesis:
Clause cube: A threedimensional data structure, either in temporary or permanent storage, used to capture inherently multidimensional linguistic data on the clausal level.
Data cube: A threedimensional data structure used to store related aspects of data in a single structure for efficient processing.
Data mart: A data mart is a subset of a data warehouse and contains extracted and summarised data related to a specific, required perspective on the data.
Data mining: The uncovering of hidden patterns and trends in (usually numerical) data.
Data warehouse: A collection of basic and aggregated data used to discover business intelligence. The clause cube in this thesis only contains detailed data, but 29
University of Pretoria etd – Kroeze, J H (2008)
the slices and aggregated data mined from it may be regarded as data marts, while the collection of all of these data structures may be seen as a linguistic data warehouse.
Dicing: In this thesis dicing is used to refer to the extraction of specific data nuggets within the clause cube or its slices. (Dicing is also often used as a synonym for rotation.)
Drilling down: Moving from aggregated data to the underlying detailed data in the data cube.
Hyper cube: A multidimensional data cube, having four or more dimensions.
MOLAP: Multidimensional online analytical processing, including processing on a threedimensional data cube and clause cube.
OLAP: Online analytical processing, i.e. the discovery of (usually business) intelligence using software to explore databases, data marts and data warehouses to trace (business) trends. In this thesis OLAP is performed on linguistic data captured in a clause cube.
Rotation: Getting various perspectives on the data by "spinning" the cube and looking at the different external planes.
Round-tripping: The conversion of data in both directions between a permanent data storage facility (such as an XML file) and temporary computer memory (RAM).
Slicing: Filtering the data in a data cube to reveal a specific subset or the data needed for business decisions or to test academic hypotheses.
Text mining (= text data mining): The discovery of information hidden in textual data.
30
University of Pretoria etd – Kroeze, J H (2008)
Threedimensional array: A collection of related variables used to store various aspects of interconnected data, which can be used to implement a data cube.
Visualisation: A graphical display of subsets of a dataset, based on attributes that are linked by means of keys, array indexes or mark-up tags in order to facilitate a preferably interactive exploration of the data.
XML: eXtensible Markup Language, similar to HTML but containing semantic value in the tags, which enhances its functionality and facilitates the creation and rendering of databases.
XOLAP: MOLAP done on a three- or multidimensional XML data structure.
1.12 Conclusion This chapter stated the point of departure for the thesis, namely the proposition that threedimensional data structures can be used to capture integrated multidimensional linguistic data. The background, aims and outline of the thesis have been discussed. The use of threedimensional arrays and XML structures to implement and test this assumption has been indicated as the core part around which the rest of the thesis is built. Advanced data exploration and visualisation, discussed in later parts of the thesis, make use of these underlying data structures and data warehousing processes for knowledge creation. Although the various divisions have been structured and organised to form a coherent and logically flowing work when the thesis is read as a whole, enough information has been repeated so that each chapter may also be read as an independent unit. Readers are also informed and referred to the various conference proceedings or journals where large sections of this thesis have been or will be read and published.
31
University of Pretoria etd – Kroeze, J H (2008)
Chapter 2 Towards a multidimensional linguistic database of Biblical Hebrew 9 2.1 Introduction Biblical Hebrew clauses can be and have been studied from many different perspectives. These perspectives or layers mirror the "modules" of the human mental language machine (cf. Van der Merwe, 2002: 89). Over the past forty years much of this knowledge has been captured in various computer software systems and databases. 10 Van der Merwe (2002: 96-97) refers to some of these products. The most basic layer is the digital representation of the Hebrew text, which can be called the transliteration layer. The second layer is the phonological layer, followed by the morphological, morpho-syntactic and syntactic layers. 11 More advanced layers such as the semantic and pragmatic layers have received less attention, but it is very probable that knowledge bases and expert systems that deal with these layers will, more and more, become available. Compare Link (1995) who proposes an algebraic perspective on the semantic analysis of human language and the computerized version of Dik's functional grammar for English, French and Dutch (Dik, 1992). Van der Merwe (2002: 94) suggests the use of the notions topic and focus to mark-up pragmatic functions in Biblical Hebrew (BH).
9
This chapter is a revised and extended version of a paper read at the AIBI VII conference, Leuven, July 2004 ("Processing Hebrew clauses using threedimensional arrays"), and of an article published in Journal of Northwest Semitic Languages, 2004, vol. 30, no. 2, pp. 99-120.
10
Cf. Talstra (1989: 4). In 1987 ten machine-readable versions of the Masoretic Text and various Bible concordance programs already existed (Hughes, 1987: 343-384; 498-545).
11
Sowa (2000: 182) refers to morphological, syntactic and semantic parsing as stages in analyzing a natural language sentence, saying: "Each of the three stages in sentence processing depends on a repository of linguistic knowledge". The proposed multidimensional linguistic database of Hebrew clauses can be regarded as such a repository, which integrates various layers of analysis. Also cf. Hughes (1987: 497).
33
University of Pretoria etd – Kroeze, J H (2008)
From these suggestions it is already clear that there are two main approaches in creating computerized biblical information systems. According to Talstra (1989: 2), the ideal linguistic database should be created by programs applying imitated rules, "otherwise a database of biblical texts will consist only of an echo of a personal, subjective knowledge and contain linguistic information not being produced by rules but by arbitrary personal choice". 12 Ultimately, however, this is an unattainable goal, because subjectivity will also influence the formulation of the linguistic rules that are to be imitated. Even Talstra & Postma (1989: 20) had to admit that it is impossible to formulate and refine rules that will attain a correct analysis in all cases. Therefore, there should also be a place for systems that capture the tacit knowledge that exists in the heads of experienced exegetes. Database solutions that capture existing linguistic data can fill this gap. Chiaramella (1986: 129) also refers to the "strong discussion about the best way to store knowledge" (either data structures or procedural objects) and says that "successful experiments have been made for both". This thesis follows the second route by proposing a database that integrates multimodular clausal analyses.
2.2 The need for integration Having available a number of electronic aids for the study of the Hebrew Bible is wonderful, but also overwhelming and even frustrating, due to the fact that various tools have to be used to study different layers and to get various perspectives. Therefore, systems have been suggested or have been developed to display multilayer analyses of Hebrew clauses, integrating the various dimensions of clausal analysis in an interlinear table format on one screen.
12
"Experimental results in cognitive psychology suggest that humans apply model-based reasoning for problem solving in a variety of domains. Consequently, a formalism that captures the representations and processes associated with model-based reasoning would facilitate the implementation of computational reasoning systems in such problem solving domains" (Glasgow & Malton, 1994: 31).
34
University of Pretoria etd – Kroeze, J H (2008)
The Lexham Hebrew-English Interlinear Bible, for example, shows the Hebrew text, transliteration, lemma, lemma transliteration, lexical value and literal English translation of each word in a grid format (Van der Merwe, 2005). See Figure 2.1.
Figure 2.1. An excerpt of the Lexham Hebrew-English Interlinear Bible (http://www.logos.com/products/details/2055).
Also compare Kroeze (2002) for a proposal to include syntactic and semantic functions (in Jonah) in an interlinear web format (see Figure 2.2).
35
University of Pretoria etd – Kroeze, J H (2008)
Figure 2.2. An interlinear analysis of Jonah 1:1a in an HTML table format (Kroeze, 2002).
Van der Merwe (2002) suggests the use of hypertext as one possible solution to integrate various perspectives or exegetical approaches. However, it could be very difficult or even impossible to integrate all available analyses due to the huge differences in the authors' assumptions and points of departure. Compare Andersen & Forbes (2002) who demonstrate the various divergent approaches, even on elementary layers such as morphology or parts of speech. A possible solution is to show the various analyses in a parallel manner leaving the final decision to the user. Also compare De Troyer's (2002) plea for integrated biblical tools, which implies that such a semi-integrated tool could be very useful to scholars.
Interlinear tables, similar to the illustrations in Figures 2.1 and 2.2, resemble the tables found in relational databases that capture data about entities, and this gives birth to the wish to be able to do ad hoc queries on the stored data. A database allows easy access to the data and the possibility of adding new data easily (Tov,
36
University of Pretoria etd – Kroeze, J H (2008)
1989: 90). This is not possible with flat files 13 or text files. RDBMSs' 14 structural and data independency feature facilitates these requirements. In order to work dynamically with the stored data it is important to use a proper database management system, which facilitates the use of linked files and complicated search and sorting functions (Niewoudt, 1989: 102).
2.3 A clause cube as the ideal data structure Unfortunately, interlinear tables cannot simply be transformed into relational database tables, because there is a separate table for each record (or clause) and the rows do not represent unique records. 15 A closer inspection of an interlinear table reveals that the rows actually represent various dimensions or layers of data-analysis that are strongly linked to the elements in the upper row. This type of interlinear table is in fact a twodimensional subset of three- (or multi-)dimensional linguistic data structures. According to Koutsoukis et al. (1999: 7) a stack of twodimensional spreadsheets (rows and columns) is a threedimensional cube. 16
This can be
conceptualised as a data structure that consists of a set of sub-cubes arranged according to rows, columns and depth layers (see Figure 2.3).
13
Although flat files may have rows and columns and thus look like relational database tables, they do not support relational operators such a joins, projects and selects (Hughes, 1987: 497).
14
Relational database management systems.
15
Chiaramella (1986: 122) identified the problem of representing text in relational, hierarchical and network database management systems: "Nothing currently exist [sic] for efficient description of texts within database systems".
16
Compare Pietersma's (2002: 351) discussion of an interlinear Greek-Hebrew text. According to Pietersma an interlinear text is twodimensional because it has a vertical and horizontal dimension.
37
University of Pretoria etd – Kroeze, J H (2008)
Figure
2.3.
threedimensional
A data
structure that consists of a set
of
27
sub-cubes
arranged according to three rows, three columns and three depth layers.
The linguistic knowledge that is represented by a collection of interlinear tables can therefore be rendered threedimensionally as a clause cube consisting of a cluster of phrases and their analyses. To explain this concept a very simplified example will first be used below. Three simple English sentences will be used, and the reader should note that these are not real data from the Hebrew Bible. The word cube suggests that the lengths of the height, width and depth are equal, which is indeed the case in this simplified example (see Figure 2.4). However, as will be explained later on, this is usually not the case when working with real data. The data-cube concept has been borrowed from data warehousing terminology where the various dimensions may also have different sizes.
The horizontal dimension is divided into rows representing the various clauses - each row is a unique record or clause. The vertical dimension (columns) represents the various word groups or phrases in the clauses. Having attributes in this dimension called phrase 1, phrase 2, phrase 3, etc., at first does not seem very informative, especially if one is used to the descriptive attributes typical of relational databases. 17 However, "it is crucial to preserve the document structure (books, chapters, verses, half-verses, words) of the data, to allow access in terms of traditional categories"
17
Compare the discussion on the choice of the phrase as primary structuring unit in Chapter 4 (4.4).
38
University of Pretoria etd – Kroeze, J H (2008)
(Talstra, 2002: 4). And this method seems to be the most straightforward way to preserve word order. 18 Yet, the combination of these obvious elements on the horizontal and vertical dimensions with the layers on the depth dimension is indeed very illuminating. The depth dimension represents the various modules of analysis, for example, graphemes, syntactic functions and semantic functions. The unique intersections of the members of the various dimensions are the cells, and the contents of the cells the measures (Chau et al., 2002: 216). As in business data "the dimensions provide a 'natural way' to capture the existing real-world information structure" (Koutsoukis et al., 1999: 11).
18
In the application of this principle to complex examples embedded phrases and clauses will be indicated by square brackets where they occur. These embedded constituents will then be analysed separately.
39
University of Pretoria etd – Kroeze, J H (2008)
Module 3: Semantics Module 2: Syntax Module 1: Grapheme
Clause 1
Clause 2
Clause 3 Phrase 1 Phrase 2 Phrase 3
Figure 2.4. The knowledge that is represented by a collection of interlinear tables, rendered threedimensionally as a clause cube consisting of layers of clauses and analyses stacked onto each other. 19
When one applies this concept a micro-text consisting of three real clauses from the Hebrew Bible (Gen. 1:1a, 4c and 5a) 20 one has to use three rows to represent three
19
In order to enhance legibility, the clause cube, drawn with Blender 2.3 is shown here in the orthographic/orthonormal view, where distant objects' sizes are not rendered as smaller as in intuitive perspective viewing (Roosendaal & Selleri, 2004: 48).
40
University of Pretoria etd – Kroeze, J H (2008)
clauses, four columns to represent the number of phrases in each clause, and five depth levels to represent five layers of analysis. (Although this structure will be sufficient for this very small corpus, it should be enlarged by adding more rows, columns and depth levels to cater for more possibilities in a larger corpus and to add a unique identifier for every clause.) The three clauses are: •
bre$it bara elohim et ha$amayim ve'et ha'arets (in the beginning God created the heaven and the earth)
•
vayavdel elohim ben ha'or uven haxo$ex (and God separated the light and the darkness)
•
vayikra elohim la'or yom (and God called the light day)
If each clause is analysed in the suggested way, one needs 20 cells (individual data containers) to capture all of the relevant data (see Figures 2.5a – 2.5c). Each clause in this minute corpus consists of four phrases, each of which is analysed on five levels. These levels are, from bottom to top: phonetics 21 , translation, word groups, syntax, semantics.
20
These clauses were chosen because all of them have four phrases but represent different syntactic structures. Many of the other clauses have less than four phrases which would imply empty cells. One clause in Gen. 1:1-2:3a has five phrases.
21
The decision to render the Hebrew text phonetically was made to facilitate accessibility and implies that the word order will be presented from left to right. If the Hebrew text were presented in the Hebrew alphabet the word order should have been shown from right to left.
41
University of Pretoria etd – Kroeze, J H (2008)
Time
Action
Adjunct
Main verb
PP
VP
in the beginning
he created
bre$it
Agent
Subject
NP
God
elohim
bara
Product
Object
NP
the heaven and the earth
et ha$amayim ve'et ha'arets
Figure 2.5a. Gen. 1:1a analysed according to phrases and linguistic levels.
42
University of Pretoria etd – Kroeze, J H (2008)
Action
Agent
Main verb
VP
and he separated
vayavdel
Patient
Subject
Complement
NP
God
elohim
Source
Complement
PP
PP
between the light
and between the darkness
ben ha'or
uven haxo$ex
Figure 2.5b. Gen. 1:4c analysed according to phrases and linguistic levels.
43
University of Pretoria etd – Kroeze, J H (2008)
Action
Agent
Main verb
VP
and he called
vayikra
Patient
Subject
IndObj
NP
God
Complement
PP
to the light
elohim
Product
la'or
NP
day
yom
Figure 2.5c. Gen. 1:5a analysed according to phrases and linguistic levels.
44
University of Pretoria etd – Kroeze, J H (2008)
If these individual cells are lined up to form a clause cube, the data structure may be visualised as a threedimensional clause cube (the word cube is still used although the length, width and depth have different sizes). See Figure 2.6a.
Figure 2.6. A clause cube (orthographic view) containing real linguistic data of three BH clauses, Gen. 1:1a, 4c and 5a.
Of course, looking at the whole cube from the outside only shows the contents of the cells placed on the outside. More data is hidden inside the cube. If one could slice off the top and middle rows, one would be able to see these hidden data (see Figures 2.7a – 2.7b). This concept will be discussed in more detail in Chapter 3.
45
University of Pretoria etd – Kroeze, J H (2008)
Figure 2.7a. Revealing data contained in the second slice of the cube by removing the top slice.
46
University of Pretoria etd – Kroeze, J H (2008)
Figure 2.7b. Revealing data contained in the bottom slice of the cube by removing the top and middle slices.
2.4 Implementing the clause cube in cyber space Such a clause cube can be implemented on a computer using a threedimensional array, which can be called a cyber cube or data cube. 22 An array can be used as a knowledge representation scheme, which models entities and the relations between them in a certain problem domain, and array functions are used to generate, inspect and transform these representations (Glasgow & Malton, 1994: 8). Arrays have probably already been used in many biblical information systems, for example to sort sets of lemmatised language into sets of "identically parsed items" (Hughes, 1987: 502).
47
University of Pretoria etd – Kroeze, J H (2008)
A data cube can easily be created in many computer languages by declaring a multidimensional array, for example, in Visual Basic 6 23 a data cube with 3 rows, 3 columns and 3 layers is declared by the following statement: "Public Clause(3,3,3) As String". 24
The code to create such an exemplary clause cube using a threedimensional array in Visual Basic 6 may be found in Addendum A (on the included CD). It captures linguistic data describing the first 108 clauses in the Hebrew Bible (Genesis 1:1-2:3). Only the first clause is shown here (see Figure 2.8). The first dimension represents the 108 rows of clauses, the second dimension the phrases with a maximum of five per clause, 25 and the third dimension represents the layers of analysis: •
Module 1: Clause number
•
Module 2: Transcription
•
Module 3: Translation
•
Module 4: Phrase type
•
Module 5: Syntactic function
•
Module 6: Semantic function
One extra layer is added on the depth dimension to those shown in Figures 2.5 – 2.7 to record the verse number, for example, Gen01v01a, as a unique identifier for each clause. This identifier, or primary key, will be recorded in the first column of the first
22
A data cube is a multidimensional, electronic data structure, often used in data warehousing to facilitate multidimensional views of data.
23
Visual Basic was chosen as programming language for this experiment because it allows the use of threedimensional arrays and easy creation of executable files. More advanced features, such as four or more dimensions in arrays, as well as extensive connectivity to database management systems (Anderson, 2003: 59, 116) could be used in more detailed and complex versions of the clause cube.
24
Such an array can be visualised as a "cube of side length m subdivided into m3 unit cubes" (cf. Banchoff, 1996: 15).
25
Although the first clause only has four phrases, the size of this dimension is declared as five to accommodate clauses with five phrases in the rest of the data set. Empty columns are not populated.
48
University of Pretoria etd – Kroeze, J H (2008)
row on the first layer, implying that empty cells will exist on the second to fifth columns of the first row on the first layer. These could have been used to record additional information about the clause, but are left empty in this project. Therefore an empty element, indicated by a hyphen in places Clause(1,2,1), Clause(1,3,1), Clause(1,4,1) and Clause(1,5,1) of the array, implies a hierarchical dependency on the clause number recorded in Clause(1,1,1). Repeating the same verse number reference in all these places would have been redundant. It should be noted that this extra layer represents a theoretical sub-level which is difficult to implement in a natural way using a threedimensional array. The array structure suggests that this element, Clause(1,1,1), is closely linked to the first word group on all the other layers. This, however, is not the case, because the information stored in Clause(1,1,1) actually is metadata that pertains to the whole clause.
These modules, of course, do not represent all possible levels of analysis. More levels could have been added to represent both lower and higher levels of linguistic analysis. On a more basic level, morphological analysis is possible, for example to indicate the various morphemes of each word (bre$it = preposition be + noun re$it) and to list the conjugational and declensional characteristics of verbs and nouns. Links to various available dictionaries could also be included, and on a higher level, pragmatic functions could be added. These, and other, possibilities are ignored in this study, because the primary goal is not to provide a complete linguistic analysis, but to show how various analyses could be integrated in a multidimensional, computerised data structure. 26
26
Not only grammatical, but also literary analyses of texts are multidimensional. According to McGann (2003: 14-15) there exists an indefinite and dynamic number of perspectives on textuality, “an array of interpretations”. He says: “Since interpretive agency is a continuously evolving variable, and since the object of interpretation is a codependent function of that unfolding interpretive action, this field of textual relations must be understood as n-dimensional.” In this project the grammatical analyses of the text of Genesis 1:1-2:3 is also understood as a multidimensional array of interpretations, some of which are conceptualised, quite literally, as a clause cube.
49
University of Pretoria etd – Kroeze, J H (2008)
Below follows a part of the code that creates the threedimensional array and populates it with the selected layers of linguistic data (Figure 2.8).
Option Explicit Public Clause(1 Sub Main() Clause(1, 1, 1) Clause(1, 1, 2) Clause(1, 1, 3) Clause(1, 1, 4) Clause(1, 1, 5) Clause(1, 1, 6) Clause(1, 2, 1) Clause(1, 2, 2) Clause(1, 2, 3) Clause(1, 2, 4) Clause(1, 2, 5) Clause(1, 2, 6) Clause(1, 3, 1) Clause(1, 3, 2) Clause(1, 3, 3) Clause(1, 3, 4) Clause(1, 3, 5) Clause(1, 3, 6) Clause(1, 4, 1) Clause(1, 4, 2) Clause(1, 4, 3) Clause(1, 4, 4) Clause(1, 4, 5) Clause(1, 4, 6) … End Sub
To 108, 1 To 5, 1 To 6) As String = = = = = = = = = = = = = = = = = = = = = = = =
"Gen01v01a" "bre$it" "in the beginning" "PP" "Adjunct" "Time" "-" "bara" "he created" "VP" "Main verb" "Action" "-" "elohim" "God" "NP" "Subject" "Agent" "-" "et ha$amayim ve'et ha'arets" "the heaven and the earth" "NP" "Object" "Product"
Figure 2.8. A part of the code that creates a threedimensional array and populates it with the selected layers of linguistic data.
50
University of Pretoria etd – Kroeze, J H (2008)
Data cubes are usually used to implement multidimensional databases or data warehouses 27 to enable users to "explore and analyse a collection of data from many different perspectives, usually considering three factors (dimensions) at a time" (Kay, 2004). According to Kay (2004) "we can think of a 3-D data cube as being a set of similarly structured 2-D tables stacked on top of one another." In our case the data cube consists of the various interlinear clause tables all linked together in one data structure in order to enhance the analytical possibilities. Such a data warehouse is a database solution that can capture and integrate linguistic data from various sources.
One of the benefits of multidimensional arrays is the use of indexes referring to the specific position of a piece of data. These indexes can be used to extract subsets of the data very efficiently and quickly (cf. Kay, 2004). Therefore, multidimensional arrays form the basis for multidimensional online analytical processing tools (OLAP). The possibility to do ad hoc queries is one of the essential characteristics of OLAP (Karayannidis & Sellis, 2003: 157). In business data cubes are used for multidimensional queries, for example, how many of a certain product was sold in a specific period in a specific place? (See, for example, Marchand et al., 2004: 3.)
Some programming languages, such as Visual Basic 6, even allow for the use of multidimensional arrays, which could represent a hypercube of clauses. 28 Such a 4-D cube consists of a series of 3-D data cubes (Kay, 2004). In our application a fourth to sixth dimension could be used to break down clause constituents hierarchically into their smallest parts, 29 for example, the NP et-ha$amayim ve'et ha'arets (Gen. 1:1) consisting of 2 NPs and a conjunction, with the 2 NPs each consisting of an object
27
A data warehouse is a multidimensional analytic database that "links otherwise disparate data items" and "allows for customised user views of the data" (Koutsoukis et al., 1999: 3).
28
In geometry a hypercube is a basic fourdimensional structure having 16 corners and "consisting" of (bounded by) 8 cubes. A cube, which is a basic threedimensional structure, has 8 corners and "consists" of 6 squares. A square, which is a basic twodimensional object, has 4 corners and "consists" of 4 lines or segments.
A line is the segment between two points, a basic
onedimensional object with no corners. A point is a zerodimensional object (Banchoff, 1996: 9). 29
Glasgow & Malton (1994: 31) found that "an array representation scheme provides an effective and efficient means for spatial reasoning" and suggests that more research should be done to test its applicability to other domains including hierarchical worlds.
51
University of Pretoria etd – Kroeze, J H (2008)
marker and NP, which again consists of an article and a noun. The higher level attributes, which represent summarized values, are called aggregates, while the lower level attributes are called grouping attributes (Lee et al., 2003: 124). This level of detail, however, falls outside the scope of this thesis.
Although it is very easy to add another dimension (for example, "Public Clause(3,3,3,3) As String") or even more dimensions in cyber space it becomes more difficult to visualize this type of data structures. Multidimensional arrays have another downside. With every dimension added the number of memory slots needed increases exponentially, for example, a 3x3 table needs 9 spaces, a 3x3x3 data cube needs 27, and a 3x3x3x3 hypercube needs 81. 30 The more dimensions the hypercube has, the sparser it becomes: more and more cells are empty and this wastes memory and processing time. Although compression techniques do exist to manage the problem of sparsity, they tend to destroy the multidimensional data structure's natural indexing (Kay, 2004). 31
An alternative to adding more dimensions for hierarchical data, as suggested above, could be to add more members on the depth dimension, allowing measures to occupy more than one cell in each member of the array (cf. Glasgow & Malton, 1994: 7, 13). With reference to linguistic data the typical hierarchical Chomskyan tree structure of the syntactic structure of a clause could be represented by such an array structure (see Figure 2.9).
30
Cf. Banchoff (1996: 15).
31
To solve this problem Karayannidis & Sellis (2003: 156-157) proposed a chunk-based storage manager for OLAP data cubes that is both space conservative and uses a location-based dataaddressing scheme. This system is also able to capture hierarchical data.
52
University of Pretoria etd – Kroeze, J H (2008)
NP ((1,1,1), (1,1,2), (1,1,3), (1,1,4), (1,1,5), (1,1,6), (1,1,7)) NP ((1,2,1), (1,2,2), (1,2,3))
Particle
NP ((1,2,5), (1,2,6), (1,2,7))
(1,2,4) Particle
NP ((1,3,2), (1,3,3))
(1,3,1)
Particle
Particle
(1,3,4)
(1,3,5)
NP ((1,3,6), (1,3,7))
Obj.
Article
Noun
Conjunction
Obj.
Article
Noun
marker
(1,4,2)
(1,4,3)
(1,4,4)
marker
(1,4,6)
(1,4,7)
ha-
'arets
(1,5,6)
(1,5,7)
(1,4,1) et (1,5,1)
(1,4,5) ha-
$amayim
(1,5,2)
(1,5,3)
ve- (1,5,4)
'et (1,5,5)
Figure 2.9. A representation of a hierarchical syntactic structure using various members of the same dimension and by allowing measures to occupy more than one cell of a member.
Due to huge space implications in the computer's memory and the difficulty of visualizing four or more dimensions, this research is restricted to three dimensions. Because it is possible to exactly declare the required number of rows, columns and depth layers of a threedimensional array, enough members can be created on the depth dimension to store all modules of clausal analysis. 32
Other kinds of technology exist to implement multidimensional databases, such as relational online analytical processing systems (ROLAP), which are collections of cuboids or twodimensional relational tables and do not suffer as much from the sparsity problem, but they do not have implicit indexes (Kay, 2004). Although this technology can be researched to evaluate its suitability for solving the problem of this project, it is expected that, due to the rigorous table structures that are inherent in relational databases, this option does not lend itself as well as multidimensional arrays to capture and extract clausal data. According to Koutsoukis et al. (1999: 6)
32
An alternative approach is followed by Koutsoukis et al. (1999: 12) who combine sparse dimensions (year and season) "to create a 'conjoint dimension'".
53
University of Pretoria etd – Kroeze, J H (2008)
"MDDBs 33 are better suited for OLAP-type applications because of their structure and embedded functionality". 34
One difference between a business data cube and a clausal data cube is that firstmentioned contains data that have already been processed and aggregated (Kay, 2004) while a clause cube contains the basic raw data. However, Karayannidis & Sellis (2003: 157) argue that, in order to support ad hoc queries, users should be able to drill down "to the most detailed data in order to compute a result from scratch". It could, therefore, contain hierarchical data consisting of both raw and aggregated data. Compare Chau et al. (2002: 214): "The contents of a data warehouse may be a replica of part of some source data or they may be the results of preprocessed queries or both". A clause cube that contains hierarchical data, such as syntactic tree-structure information, will be similar to such a business data cube. Another important similarity between a business data cube and a clausal data cube is that both types of data are stable, if one assumes that the process of analysis and tagging has been finalised. It does not get updated or changed like data in an online transaction processing system. The data cube concept was developed to focus on powerful analysis of business data, rather than on the fast and efficient capturing of transaction data. These characteristics support the hunch that this technology is very suitable for the storing and analysis of clausal data.
2.5 Building and using a multidimensional database for Biblical Hebrew To build a clause cube one could integrate the results of various computerized clausal analysis systems. The process that one should follow is similar to the steps used for building a data warehouse, i.e. (Chau et al., 2002: 216):
33
Multidimensional database management systems.
34
Compare Cheung et al. (2001: 2) for a summary of the advantages and disadvantages of both ROLAP (relational online analytical processing) and MOLAP (multidimensional online analytical processing) – they propose a combination of the two approaches. For an alternative solution compare Chun et al. (2004).
54
University of Pretoria etd – Kroeze, J H (2008)
•
Extraction of data from existing databases and flat files
•
Cleaning and integration of data
•
Loading of data in the data cube or hypercube
•
Transformation of data into a format that is suitable for a graphical user interface
Once a proper multidimensional data warehouse has been designed and created it can be populated using data from existing marked-up products: hypertext into hypercube! Products using the mark-up language XML 35 are especially suitable for this purpose because the XML tags can be used to convert free text into a database. "Unlike HTML, XML is meant for storing data, not displaying it" (Holzner, 2004: 40). Using XML to convert existing texts into data sources for a Biblical Hebrew linguistic data warehouse will necessitate cooperation, even more than when using HTML to tag hypertext (see Bulkeley, 2002: 649), especially if the various sources are to be integrated properly.
Combining nested loops with threedimensional arrays makes it possible to process the stored information in an efficient way. For example, it becomes possible to sliceand-dice the data cube of clauses to reveal various dimensions. Slicing the cube from the front reveals the Hebrew text, syntactic frameworks, semantic frameworks, etc. Slicing the cube from the top reveals the multi-layer analyses of subsequent clauses. One can also drill down into the cube to reveal other information that is linked to a specific cell. These aspects will be discussed in detail in the next chapter.
2.6 Conclusion This experiment with a threedimensional data structure indicated that a threedimensional array could be used to represent inherently multidimensional linguistic data regarding Biblical Hebrew clauses. Various layers of linguistic
35
XML (eXtensible Mark-up Language) can be regarded as a subset of SGML (Standard Generalized Mark-up Language) (DeRose, 1997: 186, 233, 235).
55
University of Pretoria etd – Kroeze, J H (2008)
knowledge can be integrated by stacking various modules of analysis onto each other. On a linguistic level the corresponding elements are linked by means of the phrases constituting the clauses. On a programmatic level they are linked by means of the indexes which are inherent and essential to the array structure.
Storing linguistic data in such a threedimensional data cube should enable ways in which this data could be used in an efficient way. The captured data can be viewed and manipulated in various ways, for example to create stacks of twodimensional interlinear tables showing required aspects of clauses' data. In this way the threedimensional array facilitates actions that are typical of online analytical data processing and data warehousing. These issues will be discussed in more detail in the following chapter.
An array, as such, cannot be stored permanently on a hard disk. Therefore, a VB6 program module is used in the initial phases of this project to declare and populate the array (see Addendum A). In follow-up phases XML will be discussed as a suitable mark-up technology that can be used to permanently store the linguistic data in a separate databank (cf. Chapter 4). This will not only facilitate the recovery of the data for advanced processing in the current VB6 project (cf. Chapters 5 and 6), but it will also enable the re-use of the data on other platforms, for example in a Java program that graphically visualises the links between the layers of linguistic data (cf. Chapter 7).
56
University of Pretoria etd – Kroeze, J H (2008)
Chapter 3 Slicing and dicing the clause cube36
3.1 Introduction This chapter suggests a way in which data warehousing concepts, such as slicing and dicing, may be used to reveal various perspectives on the linguistic data stored in a threedimensional clause cube. After a short recapitulation of the concepts discussed in Chapter 2 regarding the concept and creation of a clause cube, various ways of processing the captured information will be illustrated using a micro-text. Slicing is such an analytical technique which reveals various dimensions of data and their relationships to other dimensions. By using this data warehousing facility the clause cube can be viewed or manipulated to reveal, for example, phrases and clauses,
syntactic
structures,
semantic
role
frames,
or
a
twodimensional
representation of a particular clause's multidimensional analysis in table format.
These techniques will then be applied to the Gen. 1:1-2:3 clause cube, rendered in Addendum A. The source code is available for perusal in Addendum B (see the included CD). An executable program file,"Gen1Version15.exe", is also available on the CD. This program may be run in order to see in detail all the functionalities that are referred to in this chapter.
36
This chapter is a revised version of a paper read at the SASNES2004 conference, Rand Afrikaans University (now University of Johannesburg), August 2004, ("Slicing and dicing cyber cubes of Biblical Hebrew clauses" (Kroeze, 2004b)), and of an article accepted for publication in Handbook of Research on Text and Web Mining Technologies ("Slicing and dicing a linguistic data cube" by Kroeze, Bothma & Matthee, 2008), edited by Min Song, and to be published by Idea Group Inc, USA.
57
University of Pretoria etd – Kroeze, J H (2008)
3.2 Using a data cube to integrate complex sets of linguistic data37 The clauses constituting a text can be analysed linguistically in various ways depending on the chosen perspective of a specific researcher. These different analytical perspectives regarding a collection of clauses can be integrated into a paper-based or word-processing medium as a series of twodimensional tables, where each table represents one clause and its multidimensional analysis.
This concept can be explained with a simplified grammatical paradigm and a very small micro-text consisting of only three sentences (e.g. Gen. 1:1a, 4c and 5a) 38 : •
bre$it bara elohim et ha$amayim ve'et ha'arets (in the beginning God created the heaven and the earth)
•
vayavdel elohim ben ha'or uven haxo$ex (and God separated the light and the darkness)
•
vayikra elohim la'or yom (and God called the light day) 39
An interlinear multidimensional analysis of this text can be done as a series of tables (see Figure 3.1).
37
This section is a short summary of the main ideas that were discussed in Chapter 2 and Kroeze (2004a). Including the most salient points here enables readers to study the chapter as an independent unit.
38
These clauses were chosen because all of them have four phrases and because they represent different syntactic structures. Many of the other clauses have less than four phrases which would imply empty cells. Only one clause in Gen. 1:1-2:3 has five phrases.
39
See Section 3.4 for a discussion of the phonetic transcription used.
58
University of Pretoria etd – Kroeze, J H (2008)
Phrase 1
Phrase 2
Phrase 3
Phrase 4
Phonetic transcription
bre$it
bara
elohim
et ha$amayim ve'et ha'arets
Literal translation
in the beginning
he created
God
the heaven and the earth
Word groups
PP
VP
NP
NP
Syntactic function
Adjunct
Main verb
Subject
Object
Semantic function
Time
Action
Agent
Product
Phrase 1
Phrase 2
Phrase 3
Phrase 4
Phonetic transcription
vayavdel
elohim
ben ha'or
uven haxo$ex
Literal translation
and he separated
God
between the light
and between the darkness
Word groups
VP
NP
PP
PP
Syntactic function
Main verb
Subject
Complement
Complement
Semantic function
Action
Agent
Patient
Source
Phrase 1
Phrase 2
Phrase 3
Phrase 4
Phonetic transcription
vayikra
elohim
la'or
Yom
Literal translation
and he called
God
to the light
Day
Word groups
VP
NP
PP
NP
Syntactic function
Main verb
Subject
IndObj
Complement
Semantic function
Action
Agent
Patient
Product
Figure 3.1. A series of twodimensional tables, each containing a multidimensional linguistic analysis of one clause.
59
University of Pretoria etd – Kroeze, J H (2008)
The linguistic modules 40 that are represented here were chosen only to illustrate the concept of an integrated structure of linguistic data, as well as the manipulation thereof, and should not be regarded as comprehensive. In more detailed analyses additional layers of analyses, such as morphology, transliteration 41 and pragmatics could be added. A data cube provides a way in which the results of various divergent research projects may be integrated.
Although such series of tables can be regarded as a database, if it is electronically available, these tables are not combined into a single coherent data structure and they do not allow for flexible analytical operations. Knowing the advanced ad hoc query possibilities that are facilitated by database management systems on highly structured data, the ability to perform similar operations on implicitly structured linguistic data becomes attractive. Such queries would be facilitated if all the separate tables could be combined into one complex data structure. This is an example of document processing that "needs database processing for storing and manipulating data" (Kroenke, 2004: 464).
The obvious suggestion for solving this problem would be to use a relational database to capture linguistic data, but there are some prohibiting factors. There are many differences among the structures of clauses and the result will be a very sparse database (containing many empty fields) if one were to create attributes for all possible syntactic and semantic fields. Even in the event that this could work, an extra field will be needed to capture the word-order position for every phrase. Furthermore, relational database management systems are restricted to two dimensions: "The table in an RDBMS can only ever represent multidimensional data in two dimensions" (Connolly and Begg, 2005: 1209).
40
Cf. Van der Merwe (2002: 89). The term module is preferred here to refer to the different layers of linguistic analysis, because level is used in data cube terminology to refer to the members of a hierarchical dimension (cf. Ponniah, 2001: 360-362).
41
A transliteration is a precise rendering of text written in one alphabet by means of another alphabet. The transcription given in this thesis is a rough phonetic rendering which cannot be used to mechanically reconstruct the Hebrew text.
60
University of Pretoria etd – Kroeze, J H (2008)
Closer inspection of the above-mentioned twodimensional clause tables reveals that they actually represent multidimensional data. The various rows of each table do not represent separate records (as is typical of a twodimensional relational database), but deeper modules of analysis, which are related to the data in the first row. A collection of interlinear tables is in fact a twodimensional representation of three- (or multi-)dimensional
linguistic
data
structures.
Each
table
represents
one
twodimensional "slice" of this threedimensional structure, and the whole collection is a stack of these slices. This insight holds the key to solving the problem of capturing and processing this data.
If the data is essentially multidimensional, the ideal computerised data structure with which to capture it would be a multidimensional database. This type of data structure already exists and is usually employed in businesses' data warehouses to enable multidimensional on-line analytical processing (MOLAP) (cf. Connolly and Begg, 2005: 1209; Ponniah, 2001: 365). Data cubes are used to capture threedimensional data structures and hyper cubes 42 for multidimensional data structures. They are based on threedimensional or multidimensional arrays.
Before the implementation of these concepts in terms of programming is discussed, it should first be made clear how the linguistic data referred to above can indeed be regarded as threedimensional. The knowledge that is represented by a collection of interlinear tables can be conceptualised threedimensionally as a cube similar to the famous "magic" cube toy (Rubik's cube). The cube is subdivided into rows and columns on three dimensions. The sizes of these dimensions, however, do not have to be the same and will be determined by requirements of the unique data set. Each sub-cube is a data-container and can store one piece of information. The information cube therefore consists of a cluster of clauses and their analyses. The horizontal dimension is divided into rows representing the various clauses - each row being a unique record or clause. The vertical dimension is divided into columns and represents the various phrases in the clauses. The depth dimension represents the
42
Cf. Kroenke (2004: 553).
61
University of Pretoria etd – Kroeze, J H (2008)
various modules of analysis, for example, phonetic rendering, literal translation, word groups, syntactic functions and semantic functions.
The linguistic data captured in the twodimensional tables of the micro-text above can thus be stored in a threedimensional data-structure in the following way (see Figure 3.2) – cf. Figure 2.6, which is repeated here for easy reference:
Figure 3.2. A threedimensional clause cube.
Such a clausal data cube can be implemented on a computer using a threedimensional array. 43 A threedimensional array is a stack of twodimensional data variables.
43
Compare Chapter 2 and/or Kroeze (2004a) for a detailed discussion on the design and implementation of a clause cube.
62
University of Pretoria etd – Kroeze, J H (2008)
Some programming languages, such as Visual Basic 6, also allow the use of multidimensional arrays (with four or more dimensions), which could represent a hyper cube of clauses, but due to huge space implications for the computer's memory 44 and the difficulty to visualize four or more dimensions, this thesis deals with three dimensions only. Since it is possible to declare the exact number of rows, columns and depth members of a threedimensional array, enough members can be created on the depth dimension to store all modules of clause analyses.
3.3 Processing the information in a clause cube Combining repetition control structures such as nested loops with threedimensional arrays makes it possible to process the stored information in an efficient manner. Using three- or multidimensional tables to represent abstract data is not only a tool to store information, but also an important intermediate step in creating computerized visualizations of this information (cf. Card et al., 1999: 17, 23). 45 Koutsoukis et al. (1999: 8) differentiate between manipulation and viewing functions performed on multidimensional data. Slicing, rotating and nesting are viewing functions, while drilling-down and rolling-up are manipulation functions. A slice is a twodimensional layer of the data and implies that the dimension which is being sliced, is dropped. To rotate, dice 46 or pivot the cube means to reveal another perspective or view that consists of a different combination of dimensions. Nesting is "to display values from one dimension within another dimension" (Koutsoukis et al., 1999: 8). Drilling-down is the revelation of more detailed data, linked to a specific cell, on the deeper levels of a hierarchical dimension, while rolling-up (or drilling-up, consolidation, aggregation) refers to summarised data on the higher levels of a hierarchical dimension. In this
44
''As the number of dimensions increases, the number of the cube's cells increases exponentially'' (Connolly and Begg, 2005: 1209).
45
Compare Chapter 7.
46
Some authors use ''slicing-and-dicing'' as one concept, while others – like Koutsoukis et al. (1999: 8) here – regard dicing as a synonym for rotation. This chapter uses dicing to indicate the retrieval of subsections of a slice of data.
63
University of Pretoria etd – Kroeze, J H (2008)
way the threedimensional array facilitates actions that are typical of data warehousing and on-line analytical processing (OLAP).
In this chapter rotation, slicing and dicing, as well as simple searching functions, will be discussed in more detail. Nesting is probably not applicable to linguistic data, and rolling-up and drilling-down can only be explained by means of hierarchical analyses, such as syntactic tree diagrams. These more complex operations, including searches on more than one parameter and fuzzy searches, as well as the ordering and filtering of the sub-arrays of the clause cube, fall outside the scope of this chapter. Some of these will be explored in Chapter 6.
3.3.1 Rotation Rotation can be regarded as a computerized version of the human ability to reflect on problem domains from various perspectives. "Different external views can be achieved … by applying rotational transformations to a multidimensional array" (Glasgow & Malton, 1994: 24).
Viewing the clause cube from the front reveals the phonetic representation of the individual sentences of the text. Retrieving these elements can be used to display the phonetic rendering of the text (compare Figures 3.2 and 3.3).
bre$it
bara
elohim
et ha$amayim ve'et ha'arets
Vayavdel
elohim
ben ha'or
uven haxo$ex
Vayikra
elohim
la'or
yom
Figure 3.3. Information revealed on the front side of the clause cube.
If the cube is rotated to show the top side, the first clause's multi-modular analysis is revealed (compare Figures 3.2 and 3.4). The upside down order is due to the structure and rotation of the cube. A more logical order can be obtained by dicing the 64
University of Pretoria etd – Kroeze, J H (2008)
separate nuggets of information by means of array processing and presenting it in the required order, or by slicing the cube from the bottom (see below).
Time
Action
Agent
Product
Adjunct
Main verb
Subject
Object
PP
VP
NP
NP
in the beginning
he created
God
the heaven and the earth
bre$it
bara
elohim
et ha$amayim ve'et ha'arets
Figure 3.4. Information revealed on the top side of the clause cube.
Similarly, rotating the cube to display the original bottom side as the front side will reveal the last clause's multi-modular analysis (see Figure 3.5). This time the information is presented in an expected, logical order.
Vayikra
elohim
la'or
yom
and he called
God
to the light
day
VP
NP
PP
NP
Main verb
Subject
IndObj
Complement
Action
Agent
Patient
Product
Figure 3.5. Information revealed on the bottom side of the clause cube.
Looking at the original right side of the cube does not, however, reveal any meaningful perspective (unless the researcher wants to focus, for some reason, on the last constituent of each clause, for example in a study on word order) (compare Figures 3.2 and 3.6).
65
University of Pretoria etd – Kroeze, J H (2008)
et ha$amayim
the heaven
ve'et ha'arets
and the earth
uven haxo$ex
and between
NP
Object
Product
PP
Complement
Source
NP
Complement
Product
the darkness yom
day
Figure 3.6. Information revealed on the right side of the clause cube.
The original left side is similar, but reveals data about the first element of each clause. This information could be used for studies in pragmatics on fronting of clausal elements serving as a topic or focus (see Figures 3.7).
Time
Adjunct
PP
in the
bre$it
beginning Action
Main verb
VP
and he
vayavdel
separated Action
Main verb
VP
and he called
vayikra
Figure 3.7. Information revealed on the left side of the clause cube.
The original back side is again very meaningful, from a semantic perspective, because it reveals the combinations of semantic functions per clause. This information can be used in a study on semantic frameworks, for example, to construct an ontological dictionary such as WORDNET or WORDNET++ (cf. Dehne et al., 2001), and to create a conceptual data model by the COLOR-X method (cf. Dehne et al., 2000). Rotating the cube from its original position in a clockwise manner to see the original back side reveals the semantic role frameworks of the clauses (see Figure 3.8), with, however, the hind part foremost.
66
University of Pretoria etd – Kroeze, J H (2008)
Product
Agent
Action
Time
Source
Patient
Agent
Action
Product
Patient
Agent
Action
Figure 3.8. Information revealed by rotating the clause cube 180 degrees in a clockwise manner.
Rotating it head over heels reveals the same information but in a different, upside down, order (see Figure 3.9). The correct order can be revealed by slicing (see below).
Action
Agent
Patient
Product
Action
Agent
Patient
Source
Time
Action
Agent
Product
Figure 3.9. Information revealed by rotating the clause cube 180 degrees head over heels.
3.3.2 Slicing Rotation is a relatively easy way to demonstrate some of the various perspectives that a researcher can glean from a multidimensional data set. However, the last two examples illustrate the fact that rotation can be confusing because the ordering of constituents differ due to the fact that top can become bottom, left can become right, et cetera, depending on the manner in which the cube is spun. Slicing is better in this regard because a meaningful, easy-to-understand plane can be chosen and all the records can be viewed in the same order.
The clause cube shown in Figure 3.2 could, for example, be sliced from the top to show the three clauses' multi-modular analyses, which brings one back to where this 67
University of Pretoria etd – Kroeze, J H (2008)
chapter started, namely the twodimensional representation 47 of multi-modular clausal data (although in a different order of presentation when left in the default data cube ordering) (see Figures 3.10 – 3.12).
Figure 3.10. The top slice of the data cube, revealing the multi-modular analysis of the first clause.
47
Cf. Kroenke (2005: 178-179), who discusses twodimensional projections of three dimensions of student data.
68
University of Pretoria etd – Kroeze, J H (2008)
Figure 3.11. The middle slice of the data cube, revealing the multi-modular analysis of the second clause.
Figure 3.12. The bottom slice of the data cube, revealing the multi-modular analysis of the third clause.
69
University of Pretoria etd – Kroeze, J H (2008)
A slice is a "twodimensional plane of the cube" (Ponniah, 2001: 362). The designer of the graphical interface for the output of a slicing or dicing operation actually has the freedom to place data elements wherever they will appear in a most user-friendly way. They do not have to be displayed in a fixed and rigid order that represents their position in the data cube.
It would be very easy to change the order of the rows in these slices to a more userfriendly version of the display to show the phonetic rendering in the top row and the semantic functions in the bottom row. As indicated above, this option is only one of many possibilities offered by the clause cube.
Another advantage of slicing is that it can reveal the elements inside the cube that cannot be seen by rotating it (like the multidimensional analysis of the second clause of Figure 3.11 revealed above). In larger cubes containing hundreds or thousands of clausal analyses, a large number of constituents will be hidden inside the cube. The more members each dimension has, the more data will be out of direct sight.
Slicing can also be used to reveal a specific, required perspective that is hidden inside the cube. Say, for example, a researcher wants to see all the syntactic frameworks of the micro-text. Even in the simple 4x3x5 cube of Figure 3.2 this perspective cannot be acquired by looking at the six outer sides of the cube. One can only see the syntactic frameworks of the first and last clauses, which would not be satisfactory had the clause cube contained many clauses. However, this perspective can be obtained by slicing off the first three planes of the front side and looking at the fourth layer to reveal the syntactic frameworks of all the clauses in the cube (see Figure 3.13).
Adjunct
Main verb
Subject
Object
Main verb
Subject
Complement
Complement
Main verb
Subject
IndObj
Complement
Figure 3.13. Information revealed by slicing off the first three planes from the front side of the clause cube. 70
University of Pretoria etd – Kroeze, J H (2008)
Similarly, slicing off four layers from the front will reveal the semantic function frameworks of all clauses in the cube (see Figure 3.14).
Time
Action
Agent
Product
Action
Agent
Patient
Source
Action
Agent
Patient
Product
Figure 3.14. Information revealed by slicing off the first four planes from the front side of the clause cube.
Slicing off the first two layers will reveal all the combinations of word groups, which may be relevant for a morpho-syntactic study (see Figure 3.15).
PP
VP
NP
NP
VP
NP
PP
PP
VP
NP
PP
NP
Figure 3.15. Information revealed by slicing off the first two planes from the front side of the clause cube.
Slicing off only the first layer reveals the literal translation of the text (see Figure 3.16).
in the beginning
He created
God
the heaven and the earth
and he separated
God
between the light
and between the darkness
and he called
God
to the light
day
Figure 3.16. Information revealed by slicing off the first plane from the front side of the clause cube.
71
University of Pretoria etd – Kroeze, J H (2008)
It should already be clear by now that a multidimensional data structure provides much more versatility in data viewing and manipulation functions than mere twodimensional tables. Slicing is not only more flexible and satisfactory than rotating, but is also closer to the manner in which a computer processes a threedimensional array. There is, of course, not a real cube that can be rotated inside the computer's memory. 48 But there are millions of memory spaces that can be numbered and filled and called up in any required order. Any slice can be acquired relatively easily by using a repetition control structure (for-loop) containing the specific number that represents the required slice as a constant index in the array reference. 49
Slicing can also be used as an alternative to rotation: slicing off and viewing the external layer on every side of the cube is the equivalent of rotating the cube. This is exactly how "rotation" is implemented in a threedimensional array in the computer's memory. Valuable slicing options in this problem space are slicing the cube from the front to reveal the Hebrew text (phonetically), literal translation, word group combinations, syntactic frameworks and semantic frameworks; and slicing the cube from the top to reveal multi-modular analyses of subsequent clauses. Slicing from the sides may be valuable in studies on word order and pragmatics.
3.3.3 Dicing In this thesis the term dicing is used to indicate the subdivision of data slices into smaller pieces. Dicing can be used to retrieve very specific required data. One could, for example, retrieve only syntactic functions and their related semantic functions in order to study the mapping of these linguistic modules. In the micro-text above one would probably discover that the semantic function of patient may either be mapped on the syntactic function of complement or indirect object. Dicing may also be used to reorder a set of related data into a logical order on the user interface in order to
48
A cube is a ''conceptual representation of multidimensional data .... A MOLAP system stores data in an MDBMS, using propriety matrix and array technology to simulate this multidimensional cube'' (Rob & Coronel, 2004: 587).
49
''[T]he dimension(s) that are held constant in a cube are called slices'' (Kroenke, 2004: 554).
72
University of Pretoria etd – Kroeze, J H (2008)
present user-friendly information. 50 In fact, slicing is actually also acquired by means of iterative sets of dicing. Dicing requires knowledge of the structure of the data cube (implemented
as
a
threedimensional
array)
because
there
is
very
strict
correspondence between the array index and the clause number, phrase number and language module.
3.3.4 Searching Simple search functions can be used to look up clauses or phrases. If a specific clause's array index (which acts as a candidate key) is known, one can use it to search for the clause; for example, if one knows that one wants to look at the fiftieth clause in the databank, one could use a loop to display array elements Clause (50,1,1) – Clause (50,5,6). One can also search for examples of specific elements, such as rare syntactic or semantic functions. When a function has to search through the whole multidimensional array to find all possible matches, execution of the program must be paused after each hit to allow the user to study a relevant example before moving on to the next one. Although this may not be an optimised solution, one should remember that the research environment differs from the production environment. The functionality could be made more elegant and efficient for such a purpose.
50
The elements in the sub-arrays of one dimension can be ordered according to a specific parameter to reveal interesting patterns (Choong et al., 2003). Reordered representations can be used to easily spot syntactic constructions containing peculiar combinations such as the so-called "double accusative" (one construction containing two distinct complements) (cf. Gesenius et al., 1976: 370).
73
University of Pretoria etd – Kroeze, J H (2008)
3.4 Application: slicing and dicing the Genesis 1:1-2:3 clause cube 51 The principles discussed above were applied to the Hebrew text of Genesis 1:1-2:3. The program was created in VB6. The databank was included as a module in the program and consists of a clause cube comprising of the analyses of all clauses containing a main verb in Genesis 1:1-2:3, as done by the author (see Chapter 2 and Addendum A). The linguistic modules that were analysed are: •
Phonetic transcription of phrases 52 (compare Addendum C)
•
Literal translation of phrases
•
Identification of phrase types (compare Addendum D)
•
Syntactic functions of phrases (compare Addendum E)
•
Semantic functions of phrases (based on Dik, 1997a, 1997b; compare Addendum F).
These analyses were done by the author, based on his personalised and tacit knowledge of Biblical Hebrew grammar, summarised in Addenda C – F (cf. Kroeze, 2000a and 2000b). Not everybody will necessarily agree with these categories and analyses; however, the analysis itself is not the main focus of this thesis. The primary goal is to illustrate how integrated existing knowledge can be retrieved in various informative ways.
51
This section again builds on the clause cube concepts and implementation discussed in Chapter 2 and Kroeze (2004a). The essential ideas of the construction and structure of the clause cube are repeated here to facilitate the understanding of the analytic operations performed on the data.
52
It should be possible to use Hebrew characters by means of Unicode because Visual Basic 6 uses Unicode to represent character strings. A phonetic transcription, however, makes this study more accessible for a wider audience. The same ideas could be applied in any language, and knowledge of Hebrew writing should not be a prerequisite for participating in the academic debate on the validity of this concept. An alternative could be to use the Westminster or Michigan-Claremont transliteration (see Groves, 1989: 65).
74
University of Pretoria etd – Kroeze, J H (2008)
Embedded clauses have been indicated as a unit in the main clause and separately analysed in a subsequent row. 53 Embedded phrases containing an infinitive or participle have not been analysed in more detail.
The number of columns on the vertical dimension had to be enlarged to five to facilitate the analysis of a clause with five phrases in the rest of the data set (Gen. 1:17a-18a). No clause in the data set had more than five phrases. Provision was also made to capture the unique verse number of each clause, e.g. Gen01v01a, as a user-friendly, primary key.
The viewing and manipulation processes performed on the Genesis 1:1-2:3 clause cube reveal that it is not only possible to view the stored data in a typical interlinear manner, but that any meaningful perspective on the data can be acquired relatively easily. Once the data has been captured in a data structure that represents its natural multidimensionality, 54 basically any query can be answered by using arrayprocessing functions. 55
Below follow a few examples (screen shots) of the perspectives that are facilitated by slicing the Genesis 1:1-2:3 clause cube (see Figure 3.17-3.20).
53
Instead, a fourth dimension could have been used to capture and represent data of embedded clauses. However, it has been decided to view them as separate clauses, in order to keep the conceptualisation simpler and to minimise sparsity (empty elements in the multidimensional array).
54
''Multidimensional structures are best visualized as cubes of data, and cubes within cubes of data'' (Connolly and Begg, 2005: 1209).
55
A well-designed data cube ''obviates the need for multi-table joins and provides quick and direct access to arrays of data, thus significantly speeding up execution of multidimensional queries'' (Connolly and Begg, 2005: 1211).
75
University of Pretoria etd – Kroeze, J H (2008)
Figure 3.17 shows a slice of the cube that reveals the multi-modular analysis of Gen. 1:17a-18a (one clause spanning two verses). The user can scroll forward or backward through the stack of twodimensional analyses to study any clause's multimodular analysis. If the clause number is known it can be used to display that clause directly. The verse number can also be used to access data directly. Vertical scroll bars are activated in some cells to enable the user to see all the text recorded when the window is too small to show all at once.
Figure 3.17. A slice of the Genesis 1:1-2:3 clause cube that reveals the multimodular analysis of Gen. 1:17a-18a (one clause spanning two verses).
76
University of Pretoria etd – Kroeze, J H (2008)
Figure 3.18 shows that the clause cube can be searched on a specific parameter. For example, if one wants to find an example of the semantic function of reason a search through the threedimensional array will find and display the multi-modular analysis of Gen. 2: 3b, indicating embedded clauses in verses 3c-3d. Reason is an embedded predication, in this case an embedded clause cluster (ECC), indicated by square brackets in the phonetic transcription and literal translation. 56 The embedded predications are then analysed in more detail as separately numbered clauses. The clause number (106 in this example) is displayed in a textbox at the top of the screen; one can also type any clause number (1-108) in the same textbox and click on the "Show clause detail" button to view the required clause's analysis.
Figure 3.18. The clause cube searched on a specific parameter.
56
As discussed in Chapter 6, there are still some inconsistencies in the databank, for example the level of detail of embedded clauses and embedded clause clusters rendered in the various language modules, which should be addressed in follow-up work.
77
University of Pretoria etd – Kroeze, J H (2008)
In Figure 3.19 the "Scroll through slice of syntactic frameworks" button is used to scroll through the slice that reveals the syntactic structures of all 108 clauses in the cube (six per screen). See Addendum E for a detailed discussion of the syntactic theory used. 57
Figure 3.19. Using the "Scroll through slice of syntactic frameworks" button.
57
Copula-predicate is a synonym for the complement of a copula (i.a. a copulative verb) (Du Plessis, 1982: 85-86). The copula is often omitted in Biblical Hebrew. The whole predicate then consists of the copula-predicate, which may be expressed by a noun phrase, adjective phrase, participle phrase, adverb phrase or preposition phrase.
78
University of Pretoria etd – Kroeze, J H (2008)
In Figure 3.20 the "Scroll through slice of semantic frameworks" button is used to scroll through the slice that reveals the combinations of semantic functions in all 108 clauses in the cube (six per screen).
Figure 3.20. Using the "Scroll through slice of semantic frameworks" button.
3.5 A comparison of data-cube concepts and clause-cube concepts Many of the ideas and concepts used in this project have been borrowed and adapted from the theories regarding data warehouses, data marts and online analytical processing. Although the same basic ideas and technologies are used, it is, however, not exactly the same thing. In Table 3.1 below some salient business data-warehousing concepts (cf. Kudyba & Hoptroff, 2002: 5; Bellatreche, Karlapalem & Mohania, 2002: 25; Rajagopalan & Krovi, 2002: 77; Nazem & Shin, 2002: 108; Davidson, 2002: 114, 117, 123-127, 132-133; Cavero, Marcos, Piattini & Sanchez, 2002: 185-188; Viktor & Du Plooy, 2002: 198; Abramovicz, Kalczynski & Wecel, 2002: 207-210; Jung & Winter, 2002: 221; Gopalkrishnan & Karlapalem, 2002: 243, 255; Ng & Levene, 2002: 285-286; Data warehouse, 2007; Data mart, 2007; Online 79
University of Pretoria etd – Kroeze, J H (2008)
analytical processing, 2007) will be compared with their adapted meanings in this thesis.
Concept Definition
Business realm
Linguistics realm
Data warehousing is the
In a data warehouse of
processing of consolidating
linguistic data (e.g. a clause
related business data, thus
cube) the analyses of various
revealing patterns about
language modules are
specific business events and
consolidated in one data
objects related to time periods,
structure in order to facilitate
in order to facilitate strategic
the exploration of patterns in
decisions.
and across the interrelated levels.
Subject-oriented
Time variant
All data related to a specific
All (required) data related to the
business event is collected and
units of a specific text is
consolidated.
collected and consolidated.
The data is processed to reveal
The data is processed to reveal
patterns related to periods of
hidden patterns in the linguistic
time.
data, but these are not related to time.
Non-volatile
The data is not updated
After the construction of the
dynamically but static and read-
clause cube the data is usually
only.
not updated, but CRUD facilities may be provided to correct analyses or add more data.
Integrated
Data is gleaned from all related
Linguistic units are analysed on
business operations
various linguistic modules, or existing analyses could be integrated from various sources.
Architecture
A data mart is a subset of a
80
If various data structures were
University of Pretoria etd – Kroeze, J H (2008)
data warehouse.
used to capture information
Relational or normalised
from various linguistic modules,
approach uses tables that are
these could be regarded as
joined using primary and foreign data marts. In this project, however, only one keys. multidimensional data structure Multidimensional approach
is used to pre-join related data.
uses a separate data structure to capture pre-joined data. Storage
A data warehouse stores huge
The clause cube could store
amounts of business data that
huge amounts of linguistic data
has been reformatted to
that has been (re)formatted to
enhance analysis and retrieval.
enhance analysis and retrieval.
Data should be stored in a
In this experiment a short text
format that enables flexible,
was used, however, analysed
advanced processing and
on five levels.
querying.
The linguistic data is grouped per phrase, but this could be broken down to smaller units to enhance flexibility.
Advantages
OLAP facilitates the discovery
"LOLAP" (linguistic OLAP)
of patterns in business data to
facilitates the discovery of
prompt strategic decisions and
patterns in linguistic data to test
improve customer relationship
or prompt linguistic hypotheses.
management. Disadvantages
58
The preparation of data and the
The preparation of data and the
building of a data warehouse
building of a clause cube are
are time-consuming and
time-consuming, especially if
expensive. It is difficult to
the construction of the clause
As an alternative one could consider automatic annotation of texts. However, "[e]xtracting more advanced types of semantic information, for example, types of events (to say nothing about
81
University of Pretoria etd – Kroeze, J H (2008)
integrate data from various
cube is to be done manually. 58
sources in different formats and
Integrating existing data would
platforms.
be challenging since the underlying linguistic theories differ considerably.
Table 3.1. A comparison of data-warehousing concepts in the business and linguistic realms.
Using data-warehousing concepts as metaphors for a multidimensional linguistic databank may have implications that "extend beyond technology design questions" because it may limit researchers' creative thinking by superimposing expectations and roles that are typical of the business environment on a humanities realm; therefore, it may be worthwhile to explore additional perspectives such as data libraries to complement strategies to integrate linguistic data (cf. Davidson, 2002: 115, 133). The quality of the information gleaned from integrated databanks and the results of data mining these repositories also depend on the extent to which "the social context of the work of data capturing" is taken into account (Viktor & Du Plooy, 2002: 203). Unfortunately, these issues and research opportunities fall outside the scope of this thesis.
3.6 Conclusion A multidimensional clause cube can facilitate the linguistic analysis with which any exegetical process should commence, which in turn can benefit a multidimensional approach to biblical exegesis (cf. Van der Merwe, 2002: 94). It also facilitates a format in which the biblical text is processed for readers, that is "succinctly enough to
determining semantic arguments, 'case roles' in AI terminology), is not quite within the current information extraction capabilities, though work in this direction is ongoing" (Java et al., 2007:51).
82
University of Pretoria etd – Kroeze, J H (2008)
be handled by the short-term memory", thus enhancing the success of the communication process (ibid.).
The Genesis 1:1-2:3 clause cube illustrated that linguistic data stored in a data cube can be viewed and manipulated with multidimensional array processing to answer a vast number of queries about the data and relationships between data on various linguistic levels. This implies that linguistic data have been transformed into information, which can again be used to facilitate knowledge acquisition and sharing.
In the following chapter a more elegant solution for the permanent storage of the databank will be investigated, using XML technology. This databank will have to satisfy the requirement of being round-tripped, i.e. imported into the VB6 program and exported to an external storage medium (XML file). It will also have to facilitate advanced processing and visualisation of the networks of linguistic data.
83
University of Pretoria etd – Kroeze, J H (2008)
Chapter 4 Building and displaying the clause cube using XML59
4.1 Introduction The text of the Hebrew Bible is analysed from different linguistic disciplines, such as phonology, morphology, morpho-syntax, syntax, semantics, etc. It is even possible, and very helpful, to integrate these contributions using an interlinear format or table structure. A whole Bible book can, for example, be analysed clause by clause, indicating the various analyses in a collection of interlinear tables. Although this makes perfect sense for someone who studies the work in a linear fashion, it does not facilitate advanced research into linguistic structures and other phenomena. If the data could be transferred into a proper electronic database, one could create a database management system to view and manipulate the data according to the needs of linguists and exegetes.
Although the interlinear tables already resemble the tables in a relational database very closely, there is one important difference: each record or clause is represented by a unique table while records in a relational database table are similar rows in one table, all with the same structure. A typical relational database table for capturing linguistic analyses could use syntactic functions as the names of attributes or fields. Each clause could then be a row and its elements rearranged and categorised accordingly. However, one will need a large number of columns to capture all possible syntactic functions, many of which will contain null values because the structures of sentences vary significantly. Furthermore, for every language module that is added to the data store one will have to add another set of columns, aggravating the sparsity problem even further. Alternatively, one could use a parallel table linked by unique keys or references. To extract the related data one would have to use joins to collect the data from the various tables. This implementation will also 59
This chapter is a revised version of a paper read at the Israeli Seminar on Computational Linguistics (ISCOL), Haifa, Israel, 29 June 2006 ("Building and displaying a Biblical Hebrew linguistics data cube using XML" (see Kroeze, 2006).
85
University of Pretoria etd – Kroeze, J H (2008)
lead to much redundancy, since the words or phrases will have to be repeated in each table.
If one takes the word groups of the clauses as a starting point to structure the database and store data such as NP, subject, agent, etc. as attribute values, the structure problem is solved to a large extent, since each clause contains only a limited number of phrases (a maximum of five per clause in Genesis 1:1-2:3). The problem of redundancy and sparsity is minimised by using a threedimensional data cube instead of a simple twodimensional table. All the records or clauses and their linguistic analyses can then be combined into this single data structure containing more than two dimensions or a "data cube".
Such a language-oriented, multidimensional database of the linguistic characteristics of the Hebrew text of the Old Testament can enable researchers to do ad hoc queries. For example, a researcher may want to do a specific search in order to find good examples of a certain syntactic structure, or to explore the mapping of semantic functions onto syntactic functions. Once the data is stored in a properly structured database, this type of query becomes executable.
XML, a subset of SGML, is a suitable technology for transforming free text into a database. "There is a growing need to annotate a text or a whole corpus according to multiple information levels, especially in the field of linguistics. Language data are provided with SGML-based markup encoding phonological, morphological, syntactic, semantic, and pragmatic structure analyses" (Witt et al., 2005: 103). In such an XML implementation a clause's word order can be kept intact, while other features such as syntactic and semantic functions can be marked as elements or attributes. The elements or attributes from the XML "database" can be accessed and processed by a third generation programming language, such as Visual Basic 6 (VB6). A threedimensional array is probably the most effective programming tool for processing the data (see Chapters 2, 3 and 6). An alternative option could be the use of an XML query language (cf. Bourret, 2003; Deutsch et al., 1999).
86
University of Pretoria etd – Kroeze, J H (2008)
This chapter will focus on the following aspects: •
Why should XML be explored as an option to build an exploitable database of linguistic data? (See Section 4.5.)
•
How can XML be used to build an exploitable linguistic data cube? (See Section 4.6.)
•
How can XML represent the syntactic and semantic analyses of free text? (See Section 4.7.)
•
How can XML represent inherently multidimensional data? (See Section 4.8.)
However, before these questions can be answered, it is necessary to provide some background on linguistic databases and computational linguistics in general, as well as on the various linguistic layers that could be analysed and the basic building blocks that form the backbone of such a database (see sections 4.2-4.4).
4.2 Linguistic databases and computational linguistics Researchers who study natural language processing (NLP) may wonder if a project that studies the use of XML to develop a databank of linguistic data should be regarded as proper computational linguistics since it cannot understand, create or translate human language. However, it should be remembered that, according to Wintner (2004: 113), computational linguistics do not only include "the application of various techniques and results from linguistics to computer science" (NLP), but also "the application of various techniques and results from computer science to linguistics, in order to investigate such fundamental problems as what people know when they know a natural language, what they do when they use this knowledge, and how they acquire this knowledge in the first place". A linguistic database 60
60
The adjective linguistic in the term linguistic database here refers to the linguistic content of the database. Jeong & Yoon (2001) use the same term, but apparently refer to the textual design of the database itself, regardless of the content. However, they do not supply a clear definition for the term. It could also refer to their proposed manipulation language. Other authors, such as Buneman et al. (2002: 480), use the term to refer to the content of the database as it is done in this thesis. Petersen (1999: 10) uses the term "text databases" for databases that store texts together with
87
University of Pretoria etd – Kroeze, J H (2008)
captures and manipulates human knowledge of language, thus focusing on the first one of these basic issues in the second category (what people know about a language). This part of computational linguistics could perhaps be called natural language information systems (NLIS) because it is similar to the application of information technology to business data, studied in Information Systems discipline, of which databases form an integral part. NLIS can improve the storage, extraction, manipulation and exploration of linguistic data. It is, however, not only an end in itself, since tagged corpora are also needed as tools to train natural language processing systems (Wintner, 2004: 131).
Knowledge representation of human language, of which the tagging of documents is a part, is an interdisciplinary methodology that combines the logic and ontology of linguistics with computation (Unsworth, 2001). 61 Like databases, mark-up is a substitute or surrogate of something else (in this case the covertly structured text) 62 , which enables the researcher to make his/her assumptions explicit, to test these hypotheses and to derive conclusions from it (cf. ibid.). The names of the tags, attributes and elements used for the mark-up reflect the researcher's "set of ontological commitments" (cf. ibid.). Since any knowledge representation is a fragmentary theory of intelligent reasoning, it should be accepted that no knowledge representation system can capture all the forms of "intelligent reasoning about a literary text" (cf. ibid.).
This study is limited to the study of word groups, syntactic and semantic functions, excluding other perspectives such as morphology and pragmatics. A simplified version of the semantic functions, according to the functional grammar theory of SC Dik (1997a, 1997b) was used for the semantic analysis. Equally simple systems, compiled by the author, were used for the word-group and syntactic analyses. The reader’s own views may differ from the analyses given here, but it should be kept in linguistic analyses of it (that is, expounded text vs. text-dominated databases that are composed mainly by means of characters). 61
Compare Huitfeldt's (2004) opinion that the semantic web lies "at the intersection of markup technology and knowledge representation".
62
"Semiotic and linguistic forms are incoherent because they have to be marked in order to be perceived at all" (McGann, 2003: 5).
88
University of Pretoria etd – Kroeze, J H (2008)
mind that the main focus of this project is not defining a linguistic theory, but rather illustrating the digital storage and processing of text analyses. Any other linguistic system may be used as the theory underlying the analysis and tagging.
4.3 Linguistic layers Witt (2002) suggests that various levels of linguistic data could be annotated in separate document grammars, which can be integrated via computer programs. He proposes i.a. morphology, syntax and semantics as levels to be annotated: "For the annotation of linguistic data this [i.e. a single level of annotation - JHK] could be e.g. the level of morphology, the level of syllable structures, a level of syntactic categories (e.g. noun, verb), a level of syntactic functions (e.g. subject, object), or a level of semantic roles (e.g. agent, instrument)."
In a later article, Witt (2005: 57) differentiates between linguistic levels and layers. Levels refer to divergent logical units such as text layout versus linguistic analyses, and layers or tiers refer to the various possibilities on one level (for example, syntactic and semantic functions, which are structures that order the text hierarchically). In this study the terms layers or modules are also used to refer to the various perspectives of syntax, semantics, etc. (cf. Chapters 2 and 3). However, the distinction between level and layer is not strictly maintained in references to other authors' work, where the terms are used as synonyms. T. Sasaki (2004: 22), for example, uses the term level to refer to various linguistic annotations of text, i.e. syntactic, morpho-syntactic, lexical and morphological annotation. It should, however, not cause much misunderstanding, since this study focuses only on one "logical unit", the linguistic analyses, while the verse numbers are only used for primary keys and referencing.
Furthermore, the reader should note that linguists do not necessarily use the names of language modules in exactly the same way. For example, Witt's syntactic categories are the same as Sasaki's morpho-syntactic categories (part-of-speech tagging), while morpho-syntax is used in the current study to refer to word groups.
89
University of Pretoria etd – Kroeze, J H (2008)
The use of these terms is theory-bound and the user of a linguistic database should make sure that he/she knows the specific definitions used in a particular implementation. (See Addenda C – F for an overview of the phonetic transcription system, and taxonomies of phrase types, syntactic functions and semantic functions used in this study.)
4.4 The phrase as basic building block of the database structure The problems of redundancy and sparsity were discussed above and it was indicated that using the phrase as the basic building block of structure for a clause cube may minimise these problems. This solution is discussed in more detail in this section.
Witt (2002) proposes that linguistic database creators use the basic written text as a link, which he calls the primary data, between the layers: "… when designing the document grammar it is necessary to consider that the primary data is the link between all layers of annotation". The simplest way to deal with such an implementation is to mark up the various layers of linguistic analysis in separate documents, using the primary data to interrelate the information contained in these documents. Even if the information of all analysed layers are merged into one data structure, such as a data cube, it is still logical to use the basic text (divided into words or phrases), as the basic elements to which all other layers are related.
Depending on the characteristics of the layers to be annotated one should decide whether to use letters, words, phrases, etc., as the reference units. Compare Witt (2005: 65, 70, 72): "… in larger text single words could serve as the reference units" (as opposed to single letters in smaller text). For example, in a project that aims to study morphological analysis it would be necessary to use characters as the smallest units (Bayerl et al., 2003: 165). In this project phrases or word groups are used as the unit of reference. 63 It is, however, important to note that annotations that use
63
See the parallel discussion in Chapter 2 (2.3, 2.6) where the same concepts are discussed in terms of array technology. In this chapter the focus is on the implementation in an annotated, XML databank.
90
University of Pretoria etd – Kroeze, J H (2008)
different units of reference cannot easily be integrated if the text is used as the primary data (the "implicit link" between the layers). This could be solved by numbering the smallest units to be analysed and by referring to the various combinations of these numbers for the divergent layers of analysis (compare Petersen, 1999: 13-14). 64 Although different solutions were researched for the representation of divergent linguistic analyses, "[t]he annotation of multiple hierarchies with SGML-based markup systems is still one of the fundamental problems of text-technological research" (Witt et al., 2005:103). Although this is not a problem in the experiment of this project, it should be researched if one would have to integrate a word group-based analysis with other studies based on letters, morphemes, words or other different units of structure. Compare, for example, Petersen (2004b) who uses words in their original order as the basic units of reference in his textual database. He does, however, add a numbering system to facilitate the mapping of non-congruent linguistic layers.
4.5 Why should XML be explored as an option to build an exploitable database of linguistic data? The sections above have clearly indicated why it is desirable to build a linguistic database for capturing data regarding the various linguistic layers of text using the phrase as a basic unit of structure. The ideal solution is to keep the database separate and independent from the program(s) that operate on it in order to avoid structural dependence and data dependence. Structural dependence refers to the situation where changing the structure of the databank necessitates all access programs to be adapted, while data dependence refers to a "condition in which data representation and manipulation are dependent on the physical data storage characteristics" (Rob & Coronel, 2007: 15, 640, 652). Therefore, it is not ideal to implement the databank as a module within the VB6b program (as it was done in Chapters 2 and 3).
64
The basic elements (for example, letters or words) are numbered in order of appearance using integers called monads (Petersen, 1999: 13).
91
University of Pretoria etd – Kroeze, J H (2008)
This section focuses on the choice of XML to implement a structure-independent and data-indepedent solution. Storing the clause-cube data in a separate, platformindependent, XML file, will make the data available to be used and reused by various access programs. If the structure or content of either the progam or database changes, only the interface between the two needs to be adapted to read the data to and from the threedimensional array, a procedure which will be discussed in the next chapter.
The research question in the heading of this section ("Why should XML be explored as an option to build an exploitable database of linguistic data?") can be broken down into four sub-questions, which will be discussed below: •
Why is XML suitable for implementing a database?
•
Why is XML suitable for linguistic data?
•
Why is XML suitable for data exploration?
•
What are the disadvantages of XML?
4.5.1 Why is XML suitable for implementing a database? The idea for this study originated while working on an earlier project about the use of HTML to represent linguistic data in a table format (Kroeze, 2002). The tables used in HTML prompted the idea to capture the data in a database, but also showed the limitations of HTML because the tags are only used for formatting and do not contain any semantic information which can be used for structuring purposes. 65 XML, on the other hand, allows the designer of the software to define his/her own tags which may be organised in a hierarchical manner to structure the data. 66 This built-in structure can be used, not only to visualise the data in a way similar to the HTML tables referred to above, but also to process the data for more advanced functionality.
65
As is the case with unstructured web data, the lack of structure facilitated by HTML causes serious limitations on information access (Xyleme, 2001: 1).
66
Relational databases use tables or flat structures while XML uses a hierarchical structure that is "arbitrarily deep and almost unrestrictedly interrelated" (Smiljanić et al., 2002: 9).
92
University of Pretoria etd – Kroeze, J H (2008)
The hierarchical nature of XML is a major benefit in comparison to simple relational database management systems that make use of collections of flat, twodimensional tables. Use of this technology would lead to sparsity and redundancy problems (see above). 67 Although more complex types of relational database technology exist that do facilitate multidimensional tables, which could provide alternative solutions for multidimensional linguistic data, this study is limited to the investigation of the use of XML as a solution.
The database facilities of XML can be ascribed to its features of allowing the design of unique tag sets and the separation of formatting and structure. A unique set of tags (schema), which fits the relevant data set in a natural way (Flynn, 2002: 56), can be compiled to be the equivalent of a database structure. The structuring is built into a well-designed mark-up schema, but the formatting is covered by separated style sheets. While the schema of a relational database management system exists separately from the data, in XML it coexists with the data as element names or "tags" (Deutsch et al., 1999: 1156). One of the benefits of "the deferral of formatting choices" includes the facilitation of consistent formatting and avoidance of many opportunities for data corruption (DeRose et al., 1990: 15, 17).
Although XML is very suitable for storing data, it should, however, be remembered that the CRUD functions (create, retrieve, update, delete) are actually not done by the XML document itself but by another program that operates on the data in the XML file. Maybe one should even consider the possibility of rather using the term XML databank rather than database: "An XML document is a database only in the strictest sense of the term" because it is essentially only a simple file containing data, organised in a linear fashion (Bourret, 2003). Combined with its surrounding technologies XML may be regarded as a database system, albeit in the "looser sense of the term" because it does provide some of the typical functionalities of "real databases" but also lacks others (ibid.). However, in conventional database terminology, database refers to the collection of tables containing related data, 68 database management system refers to the program that enables creation, reading, 67
Storing XML data in conventional databases is not ideal since it "artificially creates lots of tuples/objects for even medium-sized documents" (Xyleme, 2001: 3).
68
Or static database – a database without CRUD facilities (cf. Petersen, 1999: 11).
93
University of Pretoria etd – Kroeze, J H (2008)
updating and deletion of data in the database, and database system is the combination of a database and the software used to manage it (Smiljanić et al., 2002: 8). In a database approach one may "consider an XML document to be a database and a DTD to be a database schema" (Deutsch et al., 1999: 1155). Therefore, in this experiment the XML document refers to the database, the VB6 program may be regarded as a (simple) database management system, and the combination as a database system.
Although it is not implemented in this experiment, using XML to structure the data in the clause cube could facilitate the request and delivery of information through the world wide web in a similar way as is the case with business data. Huang & Su (2002), for example, combine XML technology and push and pull strategies to provide users via the Internet only with information relevant to them. Because an XML document is text-based it is ideal for storage and delivery of business data via the web, which requires a onedimensional stream of characters for efficient transfer. This text-based property of XML also renders it quite suitable for the storage and transfer of linguistic data over the Internet.
4.5.2 Why is XML suitable for linguistic data? Since XML itself is text based, it follows that it should provide a suitable way to capture textual data. The source text can be kept intact while additional information is added by means of semantic mark-up. Since humanities scholars do not only use texts to transmit information about other phenomena, but also study the texts themselves, it is important to preserve these texts in a form that will facilitate future research. XML provides a way to store both the original text and the results of research on it for future reuse (Huitfeld, 2004). Due to its widespread use and adaptability to other software packages, Flynn (2002: 59) regards XML as the future "lingua franca for structured text in the humanities and elsewhere". XML was also recommended by the E-MELD project as a mark-up language in order to create a common standard for and sharing of digital linguistic data (Bird et al., 2002: 432).
94
University of Pretoria etd – Kroeze, J H (2008)
XML uses terms to describe texts that are not linked to a specific formatter, such as those suggested by the OHCO model (ordered hierarchy of content objects), and therefore makes documents transportable (platform-independent) (DeRose et al., 1990: 15). "It is a non-propriety public standard independent of any commercial factor and interest" (T. Sasaki, 2004: 19).
According to T. Sasaki (2004:18) researchers of Hebrew linguistics "can benefit enormously" from the use of XML as a medium to store and interchange their research data. An XML database that captures human linguistic analyses and facilitates data warehousing and data mining procedures 69 on this data, for example, could be very helpful to fill the gaps that cannot yet be covered by algorithms that simulate the complex processes of human language. Due to the ambiguity of human language on various layers of phonology, morphology, syntax, semantics and pragmatics, natural language processing systems are not satisfactorily successful, especially on the higher layers of language understanding (Wintner, 2004: 114118). 70 In fact, such a database can also provide more basic data that can be used to improve NLP systems.
XML is a very scalable medium for storing linguistic data. It is very easy to embed another layer into the hierarchical structure to capture additional information. Besides capturing data that pertains to the text itself, information about parallel texts can be represented in the same manner, thus enabling textual criticism (the process of comparing various editions of a text in order to reconstruct the original text). 71 In this regard, Aarseth (s.a.) is very positive about the prospects of hypertext technology: "Not only does hypertext promise a tool for critical annotation and the representation of intertextuality, as well as a useful method for representing complex editions of 69
"Data Warehousing and Knowledge Discovery technologies are emerging as key technologies to improve data analysis ... and automatic extraction of knowledge from data" (Wang & Dong, 2001: 48).
70
Even using semantic information in a dictionary does not guarantee the correct interpretation because a machine's interpretation "does not [always] fit conditions in the real world" (Ornan, 2004).
71
Due to the stability of the text of the Hebrew Bible it is not necessary to consider the use of changecentric management of the XML clause cube, which only contains analyses of a single version of the text. However, in text-critical projects of the text such an approach could be useful for users to obtain snapshots of the text's history (cf. Marian et al., 2001).
95
University of Pretoria etd – Kroeze, J H (2008)
variorum texts, it also has become, for many, an incarnation of the post-structural concept of text."
Word order is an important and often essential characteristic of language. In a database that captures linguistic analyses according to logically organised attributes (for example, subject, object, indirect object), the word order is lost and another field is needed for every word to register its word order position. However, XML's simple linear file characteristic makes it very suitable for textual databases since text is also ordered in a linear fashion. It allows the designer to keep the word order intact and to capture the analytical data by means of mark-up. Not only does this eliminate the need for a word-order field, but it also reduces processing to rebuild the original text for output purposes. Like SGML, 72 XML can be used to annotate either more text-oriented documents or more data-oriented documents. 73 It is therefore very suitable for a linguistic data cube, which is something in between. On the one hand, the text and word order is preserved, 74 and on the other hand, the database is structured to such an extent that it can be represented by a threedimensional array in VB6. This could, therefore, serve as an example where the boundaries between document-centric and datacentric XML documents are blurred (cf. T. Sasaki, 2004: 19). 75
The characteristics of XML discussed above make it very suitable to record linguistic data, for example in a data cube. In combination with a suitable program this data can be read, updated and deleted in various combinations. A data mart could be built
72
Cf. DeRose et al. (1990: 12): "It [SGML – JHK] does not prejudice whether a document is to be treated as a database, a word-processing file, or something completely different".
73
A dictionary is a typical example of a data-oriented linguistic document (cf. Bird et al., 2002).
74
This statement has to be qualified somewhat. Embedded phrases and clauses challenged the ideal to exactly reproduce the original word order. A compromise was to refer to these embedded elements by using square brackets where they do occur and to analyse them separately afterwards as individual phrases or clauses.
75
Document-centric documents are also called narrative-centric or text-centric documents. They "are not so well structured and are meant more for human consumption, while data-centric documents ... are more rigidly structured and meant mainly for machine consumption" (T. Sasaki, 2004: 19).
96
University of Pretoria etd – Kroeze, J H (2008)
to summarise subsets of the data, thus enabling advanced processing and retrieval. The following section will discuss the data exploration facilities in more detail.
4.5.3 Why is XML suitable for data exploration? An XML database facilitates complex searches, for example where two or more conditions are to be true (DeRose et al., 1990: 17). Without a proper database these are done partly manually: the researcher finds all texts that satisfy one condition and then searches within that data for the other conditions. A good program or query language could automate the process of searching for data on more than one parameter within an XML document. It could also facilitate text comparison and the display and correlation of various translations of a text, provided that this data are captured in the XML database (DeRose et al., 1990: 18). This will make the task of a translator or exegete a lot easier by integrating the data from various texts and translations into a single tool.
Data integration from various sources is a typical data warehousing activity. Data marts and data warehouses are often used to integrate and aggregate business data. XML schemas can also be used to interoperate legacy databases when migrating and integrating them into newer databases (Thuraisingham, 2002: 190). XML and its surrounding technology can provide similar benefits for humanistic studies since the OHCO model, for example, facilitates the integration of "a wide variety of different types of data or media into a 'compound document'" (DeRose et al., 1990: 17). The suitability of XML to integrate data from various sources has been demonstrated over and over again. Mangisengi et al. (2001: 337) go one step further in their project to virtually co-locate data warehouse islands using XML as a basis to realise the interoperability of these sources. 76 By not having to physically replicate data into a new enormous data warehouse they ensure an efficient load balance. This demonstrates the scalability of projects built on XML technology. (Compare Chapter
76
According to Wang & Dong (2001: 51) a data warehouse is "a finite set of documents (or data cubes) conforming to one of the XML schema definitions in meta data." A data warehouse is actually a collection of data marts that contain aggregated data.
97
University of Pretoria etd – Kroeze, J H (2008)
3 for a more detailed discussion of typical data warehousing procedures facilitated by a clause cube.)
Having a data warehouse is an important step towards efficient data exploration or data mining. Data mining is the process of discovering hidden patterns within large datasets. "The OHCO model treats documents and related files as a database of text elements that can be systematically manipulated …. full-text searches in textbases can specify structural conditions on patterns searched for and text to be retrieved" (DeRose et al., 1990: 17). The location of patterns is the essence of humanistic inquiry which presumes an openness on the side of the researcher, and "databases are perhaps the most well suited to facilitating and exploiting" this enterprise (Ramsay, s.a.). It should be noted that data mining is not a coincidental process of discovery, but rather a deliberate process of knowledge invention and construction (cf. Du Plooy, 1998: 54, 59).
4.5.4 What are the disadvantages of XML? In comparison to all these benefits of XML there are only a few disadvantages (cf. T. Sasaki, 2004: 19). The XML documents can become rather large since the tags are repeated over and over again for each element. In the clause cube experiment of this project, not only the tags but also the character data is used repetitively because the word groups, syntactic functions and semantic functions are encoded as text elements. This design is, however, very suitable for the eventual conversion to an array structure in VB6. According to Buneman et al. (2002: 475) an XML document may be regarded as a hierarchical structure of elements, attributes and text nodes, of which only "[t]ext and element children are held in what is essentially an array".
In a later version of this project the size of the XML document(s) may be reduced dramatically by defining the names of syntactic and semantic functions as entities (for example, ) and using repetitive entity references in the database (for example, &Ben;) instead (cf. Burnard, 2004). This provides a viable alternative to compressing techniques to reduce the size of an XML document since
98
University of Pretoria etd – Kroeze, J H (2008)
"lossy" compression techniques are more suitable for database-like documents, and "lossless" compression techniques are not nearly as efficient as "lossy" techniques (Cannataro et al., 2001: 3).77
Besides the verbosity and repetitiveness, "access to the data is slow due to parsing and text conversion" (Bourret, 2003). On the other hand, in the case of text databases, an XML implementation can actually be quite fast since whole documents are stored together and logical joins are not needed (ibid.).
If the XML code is typed using a basic text editor such as Notepad, it can be annoying and error-prone to type repetitive tags and elements, but if the file is created by electronic means, or by using special XML editors, this problem can be avoided.
The separation of data and formatting provides certain benefits as discussed above, but necessitates the creation of a separate style sheet to inform a web browser, such as Opera or Firefox, 78 how to display the text in the XML document (Flynn, 2002: 57). This is, however, a small price to pay for the database-like benefits provided by the same characteristic and the option to design different formats to suit unique requirements.
In addition, Huitfeld (2004) mentions the following weaknesses of XML: poor support for documents enriched by multimedia, absence of well-defined semantics, and the inherent inadequacy to express overlapping hierarchies which have to be bypassed by artificial means. Since XML itself does not contain semantics, it is important to add semantic content to mark-up in order to enable the study of the ontology it reflects (cf. F. Sasaki, 2004: 3). 79 77
During "lossy" compression the document structure is changed and the original document cannot be reproduced by reversing the process. If the compression is lossless the compressed data can be decoded to provide a document that is identical to the original (Cannataro et al., 2001: 2).
78
Internet explorer does not render the tables, defined in this project's XML style sheet, correctly.
79
Mark-up semantics studies "the formal description of the meaning of document grammars and instance documents", while semantic markup "is the addition of semantic information to markup" (F. Sasaki, 2004: 3).
99
University of Pretoria etd – Kroeze, J H (2008)
In comparison to the advantages, the disadvantages of XML are rather restricted. Thus, one may conclude that it provides suitable technology to build a linguistic database which can be explored to construct new knowledge.
4.6 How can XML be used to build an exploitable linguistic data cube? XML is not restricted to a predefined set of static mark-up formulas. The user may define his/her own tags to mark up the relevant text in a suitable way. Therefore, tags, elements and attributes can be designed according to the linguistic paradigm within which the researcher works. XML is also very flexible: it is possible and acceptable to map all properties to elements and child elements (Bourret, 2003), and in this experiment it was actually better to code all the linguistic information as primary
data
(most
basic
textual
elements)
to
properly
implement
the
threedimensional data cube concept. 80 Primary data is "simple element types" (Bourret, 2003), which is usually used exclusively for the basic text itself, 81 but XML allows the user to creatively design the structure of the database using the various building blocks available. This is called a tag-based approach versus an attributionbased one. While the attribution-based approach is more readable, the tag-based approach is more expandable and suitable for the representation of multidimensional and hierarchical data (Jeong & Yoon, 2001: 834). Using a tag-based approach to build a linguistic data cube in combination with a VB6 access program will provide a custom-made, but flexible and expandable database management system that is both efficient and user-friendly. It is, of course, very important to use these constructs in a consistent manner. The need to reuse data intelligently (for example, for text mining) depends on a "well-planned tagging scheme" (DeRose et al., 1990: 18). To 80
Compare T. Sasaki's (2004: 42) example of an entry in a data-centric lexical database of Modern Hebrew where all the mark-up is also done as elements and child elements, without using attribute values. According to Deutsch et al. (1999: 1156) "[s]tructured values are called elements".
81
Compare, for example, T. Sasaki (2004: 29-30). See Huitfeldt (2004): "An SGML document therefore has a natural representation as a tree whose nodes represent elements and whose leaves represent the characters of the document."
100
University of Pretoria etd – Kroeze, J H (2008)
facilitate this process, schema languages are available to define the structure of the database and to test the contents of the database to ensure that all entries satisfy the schema rules (cf. T. Sasaki, 2004: 18).
4.7 How can XML represent the syntactic and semantic analyses of free text? The designer has to think about the data structure as a threedimensional object having one row for each clause; five (in the case of Genesis 1:1-2:3) columns per clause, one for each phrase; and various layers of analysis, i.a. one to capture syntactic information and another to record semantic functions. If a phrase does not have a semantic function, for example in the case of conjunctions, an empty value (-) is inserted into the relevant field. Null values would also indicate the absence of a function, but could cause problems during sorting and importing and exporting the XML file to and from a program (round-tripping 82 ). In XML the data cube is represented by a hierarchical structure (see below). It is important to validate the recorded data to ensure the consistent use of terminology. A proper XML schema enforces consistency and the proper organization of stored text which is necessary because "[n]o hardware improvements or programming ingenuity can completely overcome a flawed representation" (DeRose et al., 1990: 4). The creation and use of an XML schema will be discussed in more detail below (4.11). In addition, validation of syntactic and semantic functions will also be done by the VB6 program to ensure clean data before advanced processing will be done (see Chapter 6). A schema is actually a knowledge representation or an ontology 83 that is formulated, consciously or unconsciously, based on a specific theory of language. 84 "If you want 82
Round-tripping will be discussed in detail in Chapter 5.
83
"An ontology is a formal conceptualization of a domain that is usable by a computer. Ontologies ... allow applications to agree on the terms that they use when communicating" (Euzenat, 2001: 21).
84
The XML schema may be regarded as the blueprint for a linguistic ontology since it provides the framework for "a catalog of the types of things that are assumed to exist in a domain of interest" (Sowa, 2003). Because the types are defined only in human language, it should be regarded as an "informal ontology".
101
University of Pretoria etd – Kroeze, J H (2008)
a computer to be able to process the materials you work on, whether for search and retrieval, analysis, or transformation—then those materials have to be constructed according to some explicit rules, and with an explicit model of their ontology in view" (Unsworth, 2001). Various ontologies in linguistic projects reflect the various underlying theoretical paradigms, and one can only hope that these will converge to more standardised systems in future. Divergent ontologies are not optimised to play the role of a "key factor for enabling interoperability in the semantic web" (ibid.) However, one will have to accept that linguistic ontologies are phenomena that evolve in parallel to the underlying philosophies that they reflect; since it is a humanistic field of study, it will never be as rigorous as the natural sciences. XML could at least help the comparison of the various approaches. With reference to literary analysis, McGann (2003: 5) says: "Textuality is, like light, fundamentally incoherent. To bring coherence to either text or to light requires great effort and ingenuity, and in neither case can the goal of perfect coherence be attained." Although "any philosophy is destined to be incomplete", ontologies are important because "[w]ithout it, there is no hope of merging and integrating the ever expanding and multiplying databases and knowledge bases around the world" (Sowa, 2003).
4.8 How can XML represent inherently multidimensional data? According to Witt (2002) using separate annotated document grammars for the various linguistic layers allows "an unlimited number of concurrent annotations". It would indeed be easier to annotate each layer in a separate XML document, but the use would be very limited. In order to study the mappings of the linguistic layers, for example, one needs an integrated structure because "separate annotations do not allow for establishing relations between the annotation tiers" (Witt, 2002). 85 Even Witt et al. (2005: 105) acknowledge the need to integrate multiple notations into a single XML representation. One could, of course, use a system of primary and foreign keys to join the various annotation tiers of separate documents, but it will cause a lot of overhead. Using a threedimensional data structure instead can eliminate a lot of conversion and programming to merge various XML databases into one. There is a 85
Also see Witt et al. (2005: 112).
102
University of Pretoria etd – Kroeze, J H (2008)
natural similarity between data cubes and XML databases since both are multidimensional and hierarchical in character (Wang & Dong, 2001: 50).
A data cube merges all data in one structure, eliminating a lot of overhead in terms of programming needed for the comparison of separate files and the inference of relations between their elements (cf. Witt, 2005: 56), because the various layers are already interrelated by the threedimensional data structure. It is also unlimited since more layers can be added on the depth axis to capture additional layers of analysis. In this experiment one annotation level (the third dimension) serves several linguistic modules (cf. Bayerl et al., 2003: 164): phonology, translation, word groups, syntax and semantics.
An XML database is of course a text-based document which is essentially onedimensional because text represents a stream of language utterances. Therefore, one should "collapse" the (conceptual) threedimensional data cube into a onedimensional stream of tags and primary data. The tagging structure should represent a consistent hierarchy which can be interpreted by a program to convert the stream of text into a data cube. The structure used in this experiment will be discussed in the next section.
4.9 The structure of the Genesis 1:1-2:3 database in XML As discussed above, it is very important to design a proper structure for an XML database. "Like relational databases, there is nothing in native XML databases that forces you to normalize your data. That is, you can design bad data storage with a native XML database just as easily as you can with a relational database. Thus, it is important to consider the structure of your documents before you store them in a native XML database" (Bourret, 2003). The hierarchy of the Genesis 1:1-2:3 clause cube is shown in Figure 4.1.
Hebrew Bible Bible Book
- not used in this study - not used in this study
103
University of Pretoria etd – Kroeze, J H (2008)
Pericope 86 Clause Clause Number Table Headers Language Levels 1-5 Level Description Phrases 1-5
-
root element in this study: each clause represented by one table: each clause's ID: headings for each column:
the various modules of analysis: ... description of module per row: the word groups in a clause: ...
Figure 4.1. The hierarchy of the Genesis 1:1-2:3 clause cube as reflected by its XML implementation.
This hierarchy actually represents various levels and layers. Although other documents could be used to mark up other versions of analyses and the various documents connected by means of the identical textual content, these analyses may also often be combined in a single document - compare Witt et al. (2005: 104, 105): "Sometimes, the single hierarchy restriction is not perceived as a drawback because annotations with concepts from different information levels can often be integrated in a single hierarchy." In the Genesis 1:1-2:3 clause cube the structure of the text (book, pericope, clause, phrase) is mixed in a single hierarchy with the concepts of the linguistic modules (phonology, morpho-syntax, syntax, semantics) since the VB6 management program will use the tag structure to convert the rather flat XML file to build the threedimensional clause cube as a threedimensional array.
The XML schema which describes the structure of the XML database is based on the logical hierarchical structure. An example of an XML schema to annotate text, focusing only on the structure of the text, can be found in Witt et al. (2005: 105). It contains the hierarchy shown in Figure 4.2. 87
86
In this experiment Genesis 1:1-2:3, the first pericope of the Hebrew Bible, is used as the basic text and root element. Although it could be argued that Genesis 2:4a also belongs to this pericope, it was decided not to include this clause, following the masoretic division. If a longer text were used as corpus, one would have to decide whether the segmentations on this level should be done by chapter or pericope.
87
Compare T. Sasaki (2004: 23) for a similar, but different schema of mark-up for a Modern Hebrew corpus. See also Petersen (2004b) and Buneman et al. (2002: 481).
104
University of Pretoria etd – Kroeze, J H (2008)
Figure 4.2. An example of an XML schema used to annotate text (Witt et al., 2005: 105).
This concept can be expanded to cover more than one level of analysis by using the hierarchy of structural and analytical elements above in the design of the structure of the XML database of Genesis 1:1-2:3, as shown in Figure 4.3 below. The five phrases per clause that have been used as the structuring backbone are sufficient for Genesis 1:1-2:3, but may have to be extended for other texts (see 4.4 above). The five linguistic layers that have been chosen here, are sufficient to illustrate the multidimensionality of the data structure and may be extended to cover other needs.
88 Phon:
88
One could argue that the repetitive tagging of structural information, such as "Level", "Phrase1", "Phon:", etc., is superfluous. However, it does help to keep the XML file human-readable.
105
University of Pretoria etd – Kroeze, J H (2008)
Translation: Phrase type: SynF: SemF: ... ... ... etc.
Figure 4.3. The basic structure of the XML database of Genesis 1:1-2:3. 106
University of Pretoria etd – Kroeze, J H (2008)
When this scheme is populated with linguistic data from Genesis 1:1-2:3, it looks as shown in Figure 4.4 (only the first two clauses are shown below as an example).
Gen01v01a Phon: bre$it bara elohim et ha$amayim ve'et ha'arets - Translation: in the beginning he created God the heaven and the earth - Phrase type: PP VP NP NP - SynF: Adjunct
107
University of Pretoria etd – Kroeze, J H (2008)
Main verb Subject Object - SemF: Time Action Agent Product - Gen01v02a Phon: veha'arets hayta tohu vavohu - - Translation: and the earth was an emptiness and void - - Phrase type: NP VP NP -
108
University of Pretoria etd – Kroeze, J H (2008)
- SynF: Subject Copulative verb Copula-predicate - - SemF: Zero State Classification - - etc.
Figure 4.4. Two populated clause elements in the XML database.
4.10 Critical discussion of the XML clause cube implementation The threedimensional cube structure implemented in XML above provides an easy way to resolve identity conflicts, i.e. where elements on the various layers span the same range of words of the basic text (Witt et al., 2005: 107), 89 for example the exact same phrase et ha$amayim ve'et ha'arec in Genesis 1:1, which is analysed on the various levels as NP, object and product. The Genesis 1:1-2:3 experiment has many identity conflicts since the basic unit of reference is the phrase (word group). Actually, the whole clause cube structure is built on identity conflicts – in each clause exactly the same phrases are analysed on the various levels. By ignoring conjunctions which are parts of other words (a commonly found phenomenon in Hebrew) it was possible to use exactly the same demarcations for the linguistic modules that were annotated. 89
"An identity conflict exists when two element instances from the two annotation layers span an identical portion of the text" (Witt et al., 2005: 112).
109
University of Pretoria etd – Kroeze, J H (2008)
This structure facilitates the study of mapping between the chosen linguistic modules. The implication of this implementation is that more detailed information, such as morphological analyses (for example, bre$it = preposition be- + noun re$it) cannot be stored by only adding another level on the depth dimension. In order to facilitate functions like these the structure of the clause cube will have to be changed into a more complex structure where words and/or morphemes are numbered, using ranges of the numbers to demarcate phrases on the higher levels of analysis. (Cf. Witt, 2005: 70, for an example of a textual stream where each character has its own, unique identification.) This, however, falls outside the scope of this study.
In a twodimensional representation identity conflicts have to be resolved either by marking up the same texts in various XML files, or by nesting one layer's elements in another layer's elements (cf. Witt et al., 2005: 107). 90
In this project's
threedimensional structure, however, the layers are described in parallel structures. In XML these parallel structures are implemented using various collections of elements which are hierarchically on the same layer but separated by descriptive tags. The various collections of sibling and child elements are grouped into units and subunits by wrapper tags. 91 This is a direct representation of the inherently threedimensional data underlying the implementation and avoids the necessity to define some layers as attributes of elements on another layer.
Although one may argue that this is a counter-intuitive implementation of inherently hierarchical linguistic data, it is typical of data-oriented XML files (cf. T. Sasaki, 2004: 31-42). 92 If one implemented the linguistic modules as attributes of the phrases, it 90
Compare Witt et al. (2005: 109-114) for a discussion of other types of relations (mappings) between various annotated layers, such as inclusion and overlap conflicts (that is, where the parts of the text that are analysed are not exactly the same). Since these types do not occur in this case study they are not discussed further.
91
Compare T. Sasaki (2004: 32) who also uses a wrapper tag to organise the various child elements of each lexeme into a unit of a data-centric XML lexical database. A wrapper element is a higher level element used to store multiple "entities" in one XML "table" or various "tables" in one XML database (cf. Bourret, 2003).
92
Compare T. Sasaki's (2004) example of a data-oriented lexicographical implementation with his example of a document-centric annotation in which the syntactic role is defined as an attribute of a phrase.
110
University of Pretoria etd – Kroeze, J H (2008)
would become much more difficult (or even impossible) to build a threedimensional XML cube, since attributes cannot be used for document-structuring purposes, while elements can (Holzner, 2004: 67-68). 93 Lack of structure will have detrimental effects on the advanced processing of the linguistic data (for example studying the mapping of linguistic modules). According to Witt (2005: 55-56), the layers of phonology, morphology, syntax and semantics "are (relatively) independent of each other" – this supports the idea to treat them as separate elements and not as attributes of other elements, a concept which is also mirrored by the threedimensional cube consisting of an array of cells of variables organised according to rows, cell and levels (depth dimensions). In the XML schema the legitimate possibilities of the linguistic levels of morpho-syntax, syntax and semantics are defined as enumerations 94 of element values (see the section on validation below).
One may conclude that the hierarchy of an XML document structure does not, and does not have to, reflect the inherent clause structure. Although the phrases do have syntactic and semantic characteristics or attributes, speaking from a linguistic perspective, these may be implemented in XML as elements for the sake of threedimensional structuring and processing. To define these linguistic attributes as XML elements is, therefore, a pragmatic decision, facilitating the database functionalities needed. This "data-centric application of XML" may be quite different from the more conventional "document-centric" applications – data-centric files, which are usually processed by machines, are much more structured (cf. T. Sasaki, 2004: 19).
The original Hebrew text is not marked up using the Hebrew alphabet. Instead a simple phonological rendering is used (see Addendum C). Therefore, one would need another mechanism to link this product to, for example, the Biblia Hebraica Stuttgartensia (BHS), should the need arise. One solution could be to use standoff 93
Since both attributes and elements hold data, one could use Holzner's (2004: 67) guideline (i.e. using elements to structure the file, and attributes for additional information) to choose which one should be used. Another reason for using elements rather than attributes is that "using too many attributes can make a document hard to read" (Holzner, 2004: 68).
94
"An enumeration is a set of labels with values", for example the enumeration syntactic function which has the labels of subject, direct object, indirect object, etc. (cf. Petersen, 2004b).
111
University of Pretoria etd – Kroeze, J H (2008)
mark-up, 95 a way of separating mark-up from the original text to be annotated. This would require the original text (BHS) to contain basic mark-up identifying each word with a unique primary key, which could be referenced in the standoff annotation (cf. Thompson & McKelvie, 1997). For example the phrases in Genesis 1:1 could be numbered in the BHS as follows: Gen1v1a1: bre$it, Gen1v1a2: bara, Gen1v1a3: elohim, Gen1v1a4: et-ha$amayim ve'et ha'arets. These identifiers may then be used to link the original Hebrew text (in the Hebrew alphabet) with the phonological representation used in the clause cube, in this way making explicit the inherent links between the two texts.
Similar to the procedure in T. Sasaki (2004: 24), only the verbal core is marked as VP. 96 Petersen (2004b) follows a similar approach: in the clause "The door was blue" only the copulative verb is marked as VP. 97 Including other phrases such as complements, direct objects and adverbials in the verb phrase would necessitate another layer of analysis and the distinction of inclusive relationships, which fall outside the scope of this study. However, in this study, preposition phrases are regarded as the combination of the preposition and its complement – this is different from T. Sasaki who regards the preposition phrase as a linking unit between the verb and its satellite (which actually is more consistent and in line with the VP scenario).
In this experiment the names of word groups, syntactic functions and semantic functions could be regarded as foreign keys – these could be used as primary keys in other "tables" or documents where definitions are supplied. This is, however, not implemented in this study. If these documents were created, one would have to ensure referential integrity between the foreign keys and primary keys. Textual child elements referring to word groups, syntactic functions and semantic functions are primary data that must be regarded as external pointers (or foreign keys) which point 95
Standoff annotation is necessary when the original text is read-only, copyright protected or prompts overlapping hierarchies (Thompson & McKelvie, 1997).
96
In Functional Grammar a clause (or "predication") is regarded as a combination of a verb with its arguments and satellites (see Dik, 1997a: 77). This is similar to T. Sasaki's principle: "This scheme proposes to annotate syntactic argument structure with verbs as the core and other phrases as their satellites".
97
Also cf. Ornan (2004).
112
University of Pretoria etd – Kroeze, J H (2008)
to valid document fragments in the related documents (cf. Bourret, 2003). One should therefore ensure that the names of these features are used absolutely consistently: it would, for example, be unacceptable to use both subj and Subject to tag the subject of a clause. Although these foreign key elements will be used over and over again, redundancy is acceptable in the case of foreign keys.
The verse number elements in XML (e.g., Gen01v01a) may be regarded as primary (or candidate) keys that uniquely identify every clause. These keys facilitate searches and references to specific clauses. "If XML documents are to do double duty as databases, then we shall need keys for them" (Buneman et al., 2002: 473). When the clause number is used as a reference to an embedded clause, it functions as a foreign key. It may be coded as part of another phrase and one should be able to find it using a "fuzzy" search (where a query searches for a part of a string appearing within a bigger attribute value). In this case, the verse numbers are considered as internal pointers since they refer to another section of the same document. Relative clauses, for example, are regarded as embedded clauses (EC). The whole clause is referred to in the main clause, and the relative clause is then analysed separately. Other ECs and embedded clause clusters (ECC), such as direct speech, are treated in the same way. The ECs and ECCs are similar to the "gaps" used by Petersen (2004b) in his Emdros project. It may therefore be concluded that the clause cube would have been normalised. 98
4.11 Validating the XML document A schema 99 was created using the built-in functionality of Visual Studio.Net 2003 (VS.Net 2003). 100 Although the basic schema was automatically created, three 98
Normalisation is the process of minimising redundant data in a database (Connolly & Begg, 2005: 390).
99
The structure of an XML document is represented by its schema. An "XML schema with a lower case 's' refers to any XML schema – such as a DTD, an XML Schema document, or a RELAX NG schema" (Bourret, 2003).
100
VS2003.Net was used because the XML functionality is not available in VB6. VS2005.Net allows one to automatically create an XML Schema, but not to use it directly to validate XML databases.
113
University of Pretoria etd – Kroeze, J H (2008)
simple types and enumerations of phrases tags, as well as syntactic and semantic function tags, were coded manually and added to the schema. A simple type is a user-defined type, which enables the programmer to create custom-made types that reflect his/her exact requirements (Deitel & Deitel, 2006: 919-921); for example, one may create types to define lists (enumerations) of possible values of phrases (word groups) and syntactic and semantic functions. An enumeration is "a set of values that a
data
item
can
select
from"
(Holzner,
2004:
213).
The
schema
(Gen1_InputV15.xsd 101 ) is shown in Figure 4.5 below (see also Addendum G on the included CD). The XML database itself (Gen1_InputV15.xml) was created by converting a databank module in VB6 (see Chapter 2 and 3) programmatically into a text file, a procedure to be discussed in the following chapter (see also Addendum H on the included CD). The schema was then used to test the XML database of Genesis 1:1-2:3, and this procedure revealed some inconsistencies in the tagging, for example with regard to the use of square brackets to indicate embedded clauses. After correcting these tagging errors (see Addendum I for the corrected file, Gen1_InputV15b.xml) the validation was successful.
Enumeration of phrase types as possible elements of a simple type ("WGenum")
VS2003.Net, however, facilitates both automatic creation and direct validation (using an option on the XML menu). 101
The XSD and XML files can be opened and viewed with Notepad.
114
University of Pretoria etd – Kroeze, J H (2008)
Enumeration of syntactic possible elements of a ("synfenum")
102
functions as simple type
"tempuri.org is the default namespace URI used by Microsoft development products, like Visual
Studio.
'tempuri'
is
short
for
Temporary
Uniform
Resource
Identifier"
(http://en.wikipedia.org/wiki/Tempuri). Namespaces are essential to avoid conflicting sets of tags (Holzner, 2004: 92).
115
University of Pretoria etd – Kroeze, J H (2008)
Enumeration of semantic functions as possible elements of a simple type ("semfenum")
116
University of Pretoria etd – Kroeze, J H (2008)
117
University of Pretoria etd – Kroeze, J H (2008)
Using simple type Genum to validate phrase elements Using simple type synfenum to validate syntactic function elements Using simple type semfenum to validate semantic function elements
118
University of Pretoria etd – Kroeze, J H (2008)
Figure 4.5. The XML Schema used to validate the XML database of Genesis 1:1-2:3.
4.12 Viewing the XML file in a web browser The data in the XML data cube can be visualised in a web browser as a series of twodimensional tables using the style sheet shown in Figure 4.6 (see Addendum J: Gen1XMLdb03c.css).
clause {display:table; border-style:solid; margin-top:20; margin-left:20; padding:10px} clauseno {display:table-caption; font-size: 20pt} headers {display:table-header-group}
header {display:table-cell; padding:6px; background-color:lightblue; border-style:solid} level1 {display:table-row} level2 {display:table-row} level3 {display:table-row} level4 {display:table-row} level5 {display:table-row} leveldesc{display:table-cell; background-color:lightblue; borderstyle:solid; border-top-width:medium; border-bottom-width:medium; border-left-width:medium; border-right-width:medium; padding:6px;} phrase1 {display:table-cell; border-style:solid; border-top-width:thin; border-bottom-width:thin; border-left-width:thin; border-rightwidth:thin; padding:6px;}
119
University of Pretoria etd – Kroeze, J H (2008)
phrase2 {display:table-cell; border-style:solid; border-top-width:thin; border-bottom-width:thin; border-left-width:thin; border-rightwidth:thin; padding:6px;} phrase3 {display:table-cell; border-style:solid; border-top-width:thin; border-bottom-width:thin; border-left-width:thin; border-rightwidth:thin; padding:6px;} phrase4 {display:table-cell; border-style:solid; border-top-width:thin; border-bottom-width:thin; border-left-width:thin; border-rightwidth:thin; padding:6px;} phrase5 {display:table-cell; border-style:solid; border-top-width:thin; border-bottom-width:thin; border-left-width:thin; border-rightwidth:thin; padding:6px;}
Figure 4.6. The XML style sheet used to display the XML clause cube as a series of twodimensional tables in the Firefox or Opera web browser.
When the XML database of Genesis 1:1-2:3 is displayed in the Firefox web browser using the style sheet above, the results look as shown in Figure 4.7 (only the first two clauses are shown; see Addendum K for the whole file).
120
University of Pretoria etd – Kroeze, J H (2008)
Figure 4.7. The first two clauses of the XML clause cube as displayed in the Firefox web browser as two twodimensional tables.
This presentation of the threedimensional XML clause cube as a series of twodimensional tables, viewed in an internet browser, may be regarded as a simple visualisation of the data. The format and appearance of the file could be changed relatively easy by changing the style sheet.
Although this representation at first sight looks very similar to the representation in Figure 3.17, it is actually very limited. It does allow simple searches using the
121
University of Pretoria etd – Kroeze, J H (2008)
browser's built-in functionalities, but does not present the required data clause by clause because the formatted data is presented as one long web page. This limitation could become quite problematic in huge data sets. Furthermore, users cannot search the data specifically on clause numbers or verse numbers; neither can they slice off required linguistic modules or expect any new requirements to be fulfilled. The browser interface is, therefore, only suitable for simple uses and cannot facilitate advanced processing of the data. Later chapters (Chapters 5-6) will, therefore, use third generation programming languages to overcome these limitations. Some of the functionalities discussed in Chapter 3, such as slicing and dicing, will be integrated with more advanced procedures. Create, read and update functionalities will be added. Data mining and visualising the XML data in custommade ways will be utilised to look for interesting patterns in and across the various linguistic modules.
4.13 Conclusion The empirical exercise in this chapter proved to be quite successful. It showed that XML can be used to build a multidimensional database of linguistic data, which can be visualised as a series of twodimensional tables by using a style sheet and web browser. It showed that a database approach to capture and manipulate linguistic data is a viable venture in computational linguistics and an example of natural language information systems. Various layers of linguistic data were captured in an XML document using the phrase as the basic building block of the data cube. The data may also be imported by a VB6 program for user-friendly viewing or editing purposes and rewritten to XML for storage. This process of round-tripping will be discussed in the next chapter. The integration of data in the data cube also facilitates data exploration (see Chapter 6). More complex visualisations of subsets of the data will be discussed in Chapter 7.
122
University of Pretoria etd – Kroeze, J H (2008)
Chapter 5 Conversion of the Genesis 1:1-2:3 linguistic data between the XML database and the array in Visual Basic 103
5.1 Introduction In the electronic processing of language, one can concentrate either on the digital simulation of human understanding and language production, or on the most appropriate way to store and use existing knowledge. Both are valid and important. This thesis falls in the second category, assuming that it is important to capture the results of linguistic analyses in well-designed, exploitable, electronic databases. XML, for example, can be used to mark up free text, to create a well-structured textual database. 104 Since the data is separated from the manipulation and display thereof, the same data can be used for various purposes, and programs or queries can be created to suit the researcher’s individual needs. This, however, necessitates the conversion of the data stored in XML format into a data structure, such as a threedimensional array 105 , which can then be processed efficiently by a computer program 106 .
This chapter will focus on the conversion of linguistic data of Genesis 1:1-2:3 between an XML data cube and a threedimensional array structure in Visual Basic 6 in order to eventually facilitate data access and manipulation. After a short reconsideration of the structures of the VB6 and XML databanks, conversion between the two will be discussed ("round-tripping"), as well as essential database functions (create, read, update and delete) that may be performed on the clause cube. 103
This chapter is a revised and extended version of a short paper, "Round-tripping Biblical Hebrew linguistic data", read at the IRMA 2007 conference, Vancouver, British Columbia, Canada, May 19-23, 2007 (see Kroeze, 2007b).
104
See Chapter 4.
105
See Chapter 2.
106
See Chapters 3, 6 and 7.
123
University of Pretoria etd – Kroeze, J H (2008)
The XML document containing the text and mark-up of Genesis 1:1-2:3 may be regarded as a "native XML database" (i.e. "a database designed especially for storing XML"), while the VB6 program may be regarded as a "content management system" (i.e. "an application designed to manage documents and built on top of a native XML database") (Bourret, 2003). The native XML database stores the XML content, which consists of the original text (a phonetic version of the Hebrew text of Gen. 1:1-2:3) with all the added XML tags and mark-up (syntactic and semantic functions, etc.). The content management system is a database management system that operates on the data to allow editing and various views according to possible user needs. Although it is a very basic system, it does fulfil the basic requirements to qualify as a native XML database (cf. Vakali et al., 2005: 65, 67): the hierarchicallystructured XML document serves "as the fundamental unit of logical storage", the schema serves as the "logical model for the XML document itself", and the XML file saved on the permanent storage device uses a sequential, text-oriented file structure as "underlying physical storage model". 107 A complete discussion on the XML clause cube may be found in Chapter 4 (cf. Kroeze 2006).
The hierarchical structure of the XML database is demonstrated by the extract shown in Figure 5.1, which partially repeats the contents of Figure 4.4 for the purpose of easy reference.
107
According to Smiljanić et al. (2002: 17), however, a native XML database is not required to have the third property: it can be built on various types of databases or proprietary storage formats.
124
University of Pretoria etd – Kroeze, J H (2008)
Gen01v01a Phon: bre$it bara elohim et ha$amayim ve'et ha'arets - Translation: in the beginning he created God the heaven and the earth - Phrase type: PP VP NP NP - SynF: Adjunct Main verb Subject Object -
125
University of Pretoria etd – Kroeze, J H (2008)
SemF: Time Action Agent Product - ...
Figure 5.1. An extract of the Genesis 1:1-2:3 XML clause cube, which is representative of the hierarchy and structure of the file.
The platform independence of XML documents allows the marked-up text to be transported to other programs "capable of making sense of the tags embedded within it" (cf. Burnard, 2004). For this project, Visual Basic 6 (VB6) was chosen for this role because XML is essentially a hierarchical system that fits the threedimensional array data structure facilitated by VB6 perfectly. VB6 was chosen above Visual Basic.Net because it is easier to make an executable file for dissemination in the older version. It would, however, be relatively easy to transform the program(s) into a Visual Basic 2005 format since Visual Studio 2005 provides migration facilities. This would enable the use of pre-programmed classes, for example to extend, delete or edit the data in it. In this chapter, however, these CRUD (create, read, update and delete) functions had to be coded manually, since the size of arrays are static and do not allow automatic insertion and deletion of records (Crawford, 1999: 219).
When converted into VB6 the databank module consists of a threedimensional data structure. A multidimensional array is very suitable for a limited data set, such as the data in this project, due to its built-in indexing. Multidimensional online analytical products (MOLAP) "typically run faster than other approaches, primarily because it’s possible to index directly into the data cube’s structure to collect subsets of data" (Kay, 2004). The VB6 program discussed in this chapter and the following chapter may be regarded as a simple MOLAP tool.
126
University of Pretoria etd – Kroeze, J H (2008)
The threedimensional array in VB6 contains the records of the 108 clauses found in Genesis 1:1-2:3. Each clause has five or less phrases. Each phrase has five levels of analysis. One level of analysis is added to record the verse number as primary key for reference and searching purposes (this will leave five unused data fields per clause, which may later be used for additional data). An array of 200 x 5 x 6 is used to implement this data structure. Although a size of 108 in the first dimension would be sufficient to hold all 108 clauses in the clause cube (cf. Figure 2.8), it was enlarged to 200 to allow room for appending more clauses' analyses, as discussed in section 5.3 of this chapter. If the array were populated manually with data (as was done in Chapter 2), the first clause could be coded as shown in Figure 5.2. The essential contents of Figure 2.8 is repeated in Figure 5.2 to enable readers to easily compare the two versions of the data that will be outcomes of the conversion processes discussed below.
Option Explicit Public Clause(1 Sub Main() Clause(1, 1, 1) Clause(1, 1, 2) Clause(1, 1, 3) Clause(1, 1, 4) Clause(1, 1, 5) Clause(1, 1, 6) Clause(1, 2, 1) Clause(1, 2, 2) Clause(1, 2, 3) Clause(1, 2, 4) Clause(1, 2, 5) Clause(1, 2, 6) Clause(1, 3, 1) Clause(1, 3, 2) Clause(1, 3, 3) Clause(1, 3, 4) Clause(1, 3, 5) Clause(1, 3, 6) Clause(1, 4, 1) Clause(1, 4, 2) Clause(1, 4, 3)
To 200, 1 To 5, 1 To 6) As String = = = = = = = = = = = = = = = = = = = = =
"Gen01v01a" "bre$it" "in the beginning" "PP" "Adjunct" "Time" "-" "bara" "he created" "VP" "Main verb" "Action" "-" "elohim" "God" "NP" "Subject" "Agent" "-" "et ha$amayim ve'et ha'arets" "the heaven and the earth"
127
University of Pretoria etd – Kroeze, J H (2008)
Clause(1, 4, 4) = "NP" Clause(1, 4, 5) = "Object" Clause(1, 4, 6) = "Product" … End Sub
Figure 5.2. VB6 code that could be used to create a threedimensional array and populate one clause element with several layers of linguistic data. A complete discussion of this structure may be found in Chapter 2 (cf. Kroeze, 2004a). The same underlying structure is used in this chapter to convert the data captured in the XML document into the VB6 array.
5.2 Conversion between VB6 and XML (round-tripping) One of the advantages of an XML database is the separation of the data and the manipulation thereof. The same data can thus be used for various purposes, and programs or queries can be created to suit the researcher’s individual needs. An XML document in itself is not very accessible for direct human inspection. Although it may be read in a simple word processor such as Notepad, the abundant use of tags poses an obstacle for human conception. One needs other software to process the data in such a repository efficiently, a tool to "bridge the gap between having a collection of structured documents and having a functional digital library" (Kumar et al., 2005: 118). 108 The VB6 program discussed in this chapter may be regarded as such a bridging tool. Another example is Petersen’s (2004b) MQL query language that enables complex searches for patterns in annotated linguistic corpora such as the database of the Hebrew Bible developed by the Werkgroep Informatica (WI) at the Free University of Amsterdam. 109 However, according to Bourret (2003) "most native XML databases can only return the data as XML".
108
is an open-source tool that aims to provide a customisable repository facilitating the dissemination of XML marked-up texts (see Kumar et al., 2005).
109
Also, compare the description of XML-QL as a relational complete query language in Deutsch et al. (1999).
128
University of Pretoria etd – Kroeze, J H (2008)
Another benefit of XML is that it provides an independent public standard and crossplatform compatibility (T. Sasaki, 2004: 19). Since XML provides a platformindependent organisation of data, conversion is often necessary to make the data accessible for algorithms that implement efficient retrieval and human-friendly interfaces (cf. Ramsay, s.a.). The conversion of data encoded in XML is often necessary to satisfy very specific needs identified by researchers. For example, if different linguistic layers are annotated in separate, but related, XML databanks, it is necessary to programmatically merge these data sets into Prolog facts in order to associate them in a single database (Witt, 2005: 68, 71). Conversion into a standardised format enables researchers to compare various annotated layers in order to discover relations that exist between them (cf. Bayerl et al., 2003: 165, 169). This type of data exploration activities will be discussed in Chapter 6.
If the XML data should be represented in a different, more human-readable, format, it should first be parsed by an application. In this experiment the data should be represented in an interlinear format which is more human-friendly to read. This necessitates the VB6 program to read the data into an array in order to be printed as a series of interlinear tables on the screen (cf. Chapter 3). By removing the XML tags the primary textual data is restored and the layers of analysis become much more comprehensible.
The next sections of this chapter describe the conversion of linguistic data of Genesis 1:1-2:3 between the XML data cube and a threedimensional array structure in Visual Basic 6 in order to facilitate data access and manipulation. The conversion from and to XML format is called round-tripping. Round-tripping is the circular process of storing XML data in a database and recreating the document from the database, a process which often results in a different document (Bourret, 2003). In this experiment round-tripping refers to the process of converting the Genesis 1:1-2:3 XML document to the threedimensional array structure in VB6 and saving it again in XML format. If no changes are done while the data reside in the array the second XML document should be an exact copy of the first (ideal round tripping - Smiljanić et al., 2002: 16). However, the array phase should facilitate updates to be made, which should be reflected in the resulting target XML document after conversion. These 129
University of Pretoria etd – Kroeze, J H (2008)
CRUD facilities will be discussed towards the end of this chapter. The complete code and program may be viewed in Addendum L.
5.2.1 From XML to VB6 All data in an XML document is text (Bourret, 2003). The mark-up itself is also text only: "... markup consists of character strings carrying information about other character strings" (Huitfeldt, 2004). For a linguistic database this poses, of course, no problem since it also contains text data only. Therefore, in VB6, all the variables of the threedimensional array are also of type string only. The limitation of arrays that all the elements should be of the same type (string, integer, boolean, etc.), therefore, poses no problem. To strip the XML code from its tags a lot of string processing will be done (cf. Petroustos, 1999: 784-795).
An efficient way to prepare the Genesis 1:1-2:3 data for ideal round-tripping would be to ensure that empty elements (for example, where a clause has less than five noun phrases) are represented by a dash (-). The loop that reads the clause cube elements into the threedimensional array can then simply assume that the next line in the XML document will be the next element in the data structure. Not all phrases have syntactic or semantic functions and these missing elements may also be rendered by a dash. This simple implementation will be used in this experiment because this will also ensure that after ideal round-tripping the XML document is an exact copy of the original document. However, it is possible, in order to reduce file size and to save memory space, to represent null values by simply omitting these elements in the document. The conversion program will then have to evaluate the content of each line, using a selection structure (such as an if-statement) in order to ensure the correct placement in the array. This procedure causes another form of, and probably more, overhead. 110 On the VB6 side, empty elements could also be represented by zero-length string values in the array variables. To avoid problems
110
The XML schema, discussed in Chapter 4, can only check the validity of data recorded in the XML file. Since absent elements are valid, another mechanism is needed to ensure correct conversion of such elements from the XML file into the threedimensional array.
130
University of Pretoria etd – Kroeze, J H (2008)
during advanced array processing due to the null values the whole array may first be populated with dashes (as symbol of an empty element) which are then partly overwritten when the data is read in from the XML document. This will ensure that all empty elements (or yet unused spaces in the array reserved for new clauses to be appended) contain dashes.
Before the data is converted an algorithm is used to count the number of clauses appearing in the XML file, and the result is stored in variables called countclauses and maxArray. The last-mentioned variable is used to limit processing in the rest of the VB6 program to real data only (ignoring empty clause elements), and, therefore, its value should be adjusted when clauses are added or deleted during the array phase.
An extract of the code for this part of the program is shown in Figure 5.3. It is assumed that all variables have been declared.
'Read XML file from disk into array Public Sub Command1_Click() 'Initialise all array elements with empty element symbols For iniArr1 = 1 To 200 For iniArr2 = 1 To 5 For iniArr3 = 1 To 6 Clause(iniArr1, iniArr2, iniArr3) = "-" Next Next Next 'Count number of clauses in the XML cube: arrayMax = 0 'Reset total number of clauses in array countclauses = 0 'Reset counter that counts number of clauses in XML file filenum1 = FreeFile Open "Gen1_InputV15_RT1.xml" For Input As #filenum1 Line Input #filenum1, tempLine Line Input #filenum1, tempLine Line Input #filenum1, tempLine Line Input #filenum1, tempLine
131
University of Pretoria etd – Kroeze, J H (2008)
While Not EOF(filenum1) Line Input #filenum1, tempLine countclauses = countclauses + 1 Line Input #filenum1, tempLine Line Input #filenum1, tempLine Line Input #filenum1, tempLine Line Input #filenum1, tempLine Line Input #filenum1, tempLine Line Input #filenum1, tempLine Line Input #filenum1, tempLine Line Input #filenum1, tempLine Line Input #filenum1, tempLine Line Input #filenum1, tempLine Line Input #filenum1, tempLine Line Input #filenum1, tempLine Line Input #filenum1, tempLine Line Input #filenum1, tempLine Line Input #filenum1, tempLine Line Input #filenum1, tempLine Line Input #filenum1, tempLine Line Input #filenum1, tempLine Line Input #filenum1, tempLine Line Input #filenum1, tempLine Line Input #filenum1, tempLine Line Input #filenum1, tempLine Line Input #filenum1, tempLine Line Input #filenum1, tempLine Line Input #filenum1, tempLine Line Input #filenum1, tempLine Line Input #filenum1, tempLine Line Input #filenum1, tempLine Line Input #filenum1, tempLine Line Input #filenum1, tempLine Line Input #filenum1, tempLine Line Input #filenum1, tempLine Line Input #filenum1, tempLine Line Input #filenum1, tempLine Line Input #filenum1, tempLine Line Input #filenum1, tempLine Line Input #filenum1, tempLine Line Input #filenum1, tempLine Line Input #filenum1, tempLine Line Input #filenum1, tempLine Line Input #filenum1, tempLine Line Input #filenum1, tempLine Line Input #filenum1, tempLine
132
University of Pretoria etd – Kroeze, J H (2008)
Line Input #filenum1, tempLine Line Input #filenum1, tempLine Line Input #filenum1, tempLine Line Input #filenum1, tempLine Line Input #filenum1, tempLine Line Input #filenum1, tempLine Line Input #filenum1, tempLine Wend MsgBox ("There are " & countclauses & " clauses in the XML cube") arrayMax = countclauses Close #filenum1 'Populate array with data from XML file: Open "Gen1_InputV15_RT1.xml" For Input As #filenum1 Line Input #filenum1, tempLine Line Input #filenum1, tempLine Line Input #filenum1, tempLine For count1 = 1 To arrayMax Line Input #filenum1, tempLine Line Input #filenum1, tempLine Call DecodeXML(XMLstringBeginPos, XMLstringEndPos, XMLstringLength, tempLine) Clause(count1, 1, 1) = Mid(tempLine, XMLstringBeginPos, XMLstringLength) Line Line Line Line Line Line Line Line Line Line
Input Input Input Input Input Input Input Input Input Input
#filenum1, #filenum1, #filenum1, #filenum1, #filenum1, #filenum1, #filenum1, #filenum1, #filenum1, #filenum1,
tempLine tempLine tempLine tempLine tempLine tempLine tempLine tempLine tempLine tempLine
Line Input #filenum1, tempLine Call DecodeXML(XMLstringBeginPos, XMLstringEndPos, XMLstringLength, tempLine) Clause(count1, 1, 2) = Mid(tempLine, XMLstringBeginPos, XMLstringLength) Line Input #filenum1, tempLine Call DecodeXML(XMLstringBeginPos, XMLstringEndPos, XMLstringLength,
133
University of Pretoria etd – Kroeze, J H (2008)
tempLine) Clause(count1, 2, 2) = Mid(tempLine, XMLstringBeginPos, XMLstringLength) Line Input #filenum1, tempLine Call DecodeXML(XMLstringBeginPos, XMLstringEndPos, XMLstringLength, tempLine) Clause(count1, 3, 2) = Mid(tempLine, XMLstringBeginPos, XMLstringLength) Line Input #filenum1, tempLine Call DecodeXML(XMLstringBeginPos, XMLstringEndPos, XMLstringLength, tempLine) Clause(count1, 4, 2) = Mid(tempLine, XMLstringBeginPos, XMLstringLength) Line Input #filenum1, tempLine Call DecodeXML(XMLstringBeginPos, XMLstringEndPos, XMLstringLength, tempLine) Clause(count1, 5, 2) = Mid(tempLine, XMLstringBeginPos, XMLstringLength) Line Input #filenum1, tempLine Line Input #filenum1, tempLine Line Input #filenum1, tempLine Line Input #filenum1, tempLine Call DecodeXML(XMLstringBeginPos, XMLstringEndPos, XMLstringLength, tempLine) Clause(count1, 1, 3) = Mid(tempLine, XMLstringBeginPos, XMLstringLength) Line Input #filenum1, tempLine Call DecodeXML(XMLstringBeginPos, XMLstringEndPos, XMLstringLength, tempLine) Clause(count1, 2, 3) = Mid(tempLine, XMLstringBeginPos, XMLstringLength) Line Input #filenum1, tempLine Call DecodeXML(XMLstringBeginPos, XMLstringEndPos, XMLstringLength, tempLine) Clause(count1, 3, 3) = Mid(tempLine, XMLstringBeginPos, XMLstringLength) Line Input #filenum1, tempLine Call DecodeXML(XMLstringBeginPos, XMLstringEndPos, XMLstringLength, tempLine) Clause(count1, 4, 3) = Mid(tempLine, XMLstringBeginPos, XMLstringLength) Line Input #filenum1, tempLine Call DecodeXML(XMLstringBeginPos, XMLstringEndPos, XMLstringLength, tempLine) Clause(count1, 5, 3) = Mid(tempLine, XMLstringBeginPos, XMLstringLength)
134
University of Pretoria etd – Kroeze, J H (2008)
Line Input #filenum1, tempLine Line Input #filenum1, tempLine Line Input #filenum1, tempLine Line Input #filenum1, tempLine Call DecodeXML(XMLstringBeginPos, XMLstringEndPos, XMLstringLength, tempLine) Clause(count1, 1, 4) = Mid(tempLine, XMLstringBeginPos, XMLstringLength) Line Input #filenum1, tempLine Call DecodeXML(XMLstringBeginPos, XMLstringEndPos, XMLstringLength, tempLine) Clause(count1, 2, 4) = Mid(tempLine, XMLstringBeginPos, XMLstringLength) Line Input #filenum1, tempLine Call DecodeXML(XMLstringBeginPos, XMLstringEndPos, XMLstringLength, tempLine) Clause(count1, 3, 4) = Mid(tempLine, XMLstringBeginPos, XMLstringLength) Line Input #filenum1, tempLine Call DecodeXML(XMLstringBeginPos, XMLstringEndPos, XMLstringLength, tempLine) Clause(count1, 4, 4) = Mid(tempLine, XMLstringBeginPos, XMLstringLength) Line Input #filenum1, tempLine Call DecodeXML(XMLstringBeginPos, XMLstringEndPos, XMLstringLength, tempLine) Clause(count1, 5, 4) = Mid(tempLine, XMLstringBeginPos, XMLstringLength) Line Input #filenum1, tempLine Line Input #filenum1, tempLine Line Input #filenum1, tempLine Line Input #filenum1, tempLine Call DecodeXML(XMLstringBeginPos, XMLstringEndPos, XMLstringLength, tempLine) Clause(count1, 1, 5) = Mid(tempLine, XMLstringBeginPos, XMLstringLength) Line Input #filenum1, tempLine Call DecodeXML(XMLstringBeginPos, XMLstringEndPos, XMLstringLength, tempLine) Clause(count1, 2, 5) = Mid(tempLine, XMLstringBeginPos, XMLstringLength) Line Input #filenum1, tempLine Call DecodeXML(XMLstringBeginPos, XMLstringEndPos, XMLstringLength,
135
University of Pretoria etd – Kroeze, J H (2008)
tempLine) Clause(count1, 3, 5) = Mid(tempLine, XMLstringBeginPos, XMLstringLength) Line Input #filenum1, tempLine Call DecodeXML(XMLstringBeginPos, XMLstringEndPos, XMLstringLength, tempLine) Clause(count1, 4, 5) = Mid(tempLine, XMLstringBeginPos, XMLstringLength) Line Input #filenum1, tempLine Call DecodeXML(XMLstringBeginPos, XMLstringEndPos, XMLstringLength, tempLine) Clause(count1, 5, 5) = Mid(tempLine, XMLstringBeginPos, XMLstringLength) Line Input #filenum1, tempLine Line Input #filenum1, tempLine Line Input #filenum1, tempLine Line Input #filenum1, tempLine Call DecodeXML(XMLstringBeginPos, XMLstringEndPos, XMLstringLength, tempLine) Clause(count1, 1, 6) = Mid(tempLine, XMLstringBeginPos, XMLstringLength) Line Input #filenum1, tempLine Call DecodeXML(XMLstringBeginPos, XMLstringEndPos, XMLstringLength, tempLine) Clause(count1, 2, 6) = Mid(tempLine, XMLstringBeginPos, XMLstringLength) Line Input #filenum1, tempLine Call DecodeXML(XMLstringBeginPos, XMLstringEndPos, XMLstringLength, tempLine) Clause(count1, 3, 6) = Mid(tempLine, XMLstringBeginPos, XMLstringLength) Line Input #filenum1, tempLine Call DecodeXML(XMLstringBeginPos, XMLstringEndPos, XMLstringLength, tempLine) Clause(count1, 4, 6) = Mid(tempLine, XMLstringBeginPos, XMLstringLength) Line Input #filenum1, tempLine Call DecodeXML(XMLstringBeginPos, XMLstringEndPos, XMLstringLength, tempLine) Clause(count1, 5, 6) = Mid(tempLine, XMLstringBeginPos, XMLstringLength) Line Input #filenum1, tempLine Line Input #filenum1, tempLine Next
136
University of Pretoria etd – Kroeze, J H (2008)
Close #filenum1 arrayflag = True MsgBox ("XML cube Gen1_InputV15_RT1.xml converted to array in RAM") count1 = 1 Call ShowArray End Sub ' Function used to strip XML tags before inserting data into array Public Sub DecodeXML(XMLstringBeginPos2 As Integer, XMLstringEndPos2 As Integer, XMLstringLength2 As Integer, templine2 As String) XMLstringBeginPos2 = InStr(templine2, ">") + 1 XMLstringEndPos2 = InStrRev(templine2, "