Investigation of Regular Expression-Based Pattern ... - Semantic Scholar

9 downloads 0 Views 131KB Size Report
Anthony Cox. Faculty of Computer Science. Dalhousie University. Halifax, Nova Scotia, Canada [email protected]. Maryanne Fisher. Department of Psychology.
Investigation of Regular Expression-Based Pattern Matching and Creation Anthony Cox Faculty of Computer Science Dalhousie University Halifax, Nova Scotia, Canada [email protected]

Maryanne Fisher Department of Psychology York University Toronto, Ontario, Canada [email protected]

Abstract

tern description mechanism with pattern matches providing search solutions. While there has been significant research on algorithms for automated matching and manipulation of regular expressions [4], there has been little research on the human element of these systems. In this paper, we address this deficiency by examining individual performance on the tasks of creating and matching regular expressions. This paper is organized as follows. A brief overview of regular expressions in the context of formal language theory is first provided. Then, we present and discuss the first study that we performed to investigate performance on expression matching and creation tasks. The results of a second, revised, study are then presented and discussed. Finally, the paper concludes with the presentation of some future research directions.

In this paper, we examine the cognitive skills underlying the use of patterns expressed using regular (Chomsky type 3) languages. We predicted a relationship between accuracy and completeness such that they improve in concert, thereby indicating the application of the same cognitive skill set. As well, we hypothesized a close relationship between the tasks of pattern creation and matching, since both may rely on the same cognitive abilities. In Study 1, the first but not the second hypothesis was supported. Furthermore, the measurement of task performance using fine-grained (character level) and coarse-grained (substring level) assessment techniques was investigated and the relationship between the two techniques explored. In Study 2, we addressed the possibility that our test instrument may have accounted for the nonsignificant relationship between pattern creation and matching. Our findings verify this possibility and indicate that there is a relationship between creation and matching. It is likely that creation is a more developed ability than matching due to the necessity for pattern generation rather than application. In addition to replicating the initial findings for granularity and performance measures, we extended the study to breakdown performance with respect to the alternation and repetition operators. Previous research on Boolean query systems demonstrates that alternation is more difficult than other Boolean operations, but this effect has not been examined for regular expressions. Our findings indicate a similar effect for regular expressions with alternation being more difficult than concatenation or repetition. We discuss this finding in terms of cognitive processing.

2 Regular Expressions The Chomsky hierarchy of languages [1] orders languages into four classes identified by number. Each class is properly included in all higher numbered classes giving Class 0 the largest number of languages and Class 3 the smallest. In most literature, the language classes are identified by alternative names: Recursively Enumerable or Phrase Structured (Class 0), Context Sensitive (Class 1), Context Free (Class 2) and Regular (Class 3). Every language can be described by a grammar or set of rules that describe valid constructions in the language. In grammar theory, the words that make up the language are considered as forming the alphabet, , of the language. Thus, given an alphabet and a grammar it is possible to decide whether a specified sequence of symbols is a member of the language described by the grammar. However, there are alternative representations of a language in addition to its grammar. It has been shown that every regular language can be described by a regular expression [4]. Regular expressions are formed by combining elements of using three operations: concatenation,

1 Introduction In computer software such as grep, vi, and Perl, regular expressions are used to describe the targets of search operations. For this role, regular expressions are used as a pat1

Given an alphabet 1. 2. 3. 4. 5.



where



and



:

original motivations were less theoretical. Our initial goal was to examine the relationship between matching and creation in the context of a formal pattern system. To measure performance, we adopt the measures of precision and recall used in the discipline of information retrieval to measure the accuracy and completeness of retrieval tasks. Precision and recall have been previously used to measure performance of Boolean search specifications [8]. For a search that returns a set of solutions, , where is the complete set of possible solutions, the precision of the search, and hence of the search specification, is defined as:

is a regular expression.

 is a regular expression.   is a regular expression.  is a regular expression. 

is a regular expression.

(Concatenation) (Alternation) (Repetition) (Parenthesis)

Figure 1. Well-Formed Regular Expressions



alternation and repetition. Figure 1 provides a definition for well-formed regular expressions. Concatenation appends two regular expressions and is the mechanism by which longer expressions are built from shorter ones. Alternation is a selection mechanism with the indicating a choice in selecting either the exexpression pression or the expression (exclusive or). Repetition (the Kleene closure) describes the set of zero or more successive occurrences of an expression. Hence, for the alphabet, Figure 2 provides some examples of well-formed regular expressions and their associated regular languages. In Figure 2, it can be seen that , the empty string, is a valid member of some languages. To avoid issues with coding results containing empty strings, we will utilize a modified version of regular expressions that replace the , zero or more, operator with the , one or more, operator. This change, apart from excluding as an element of any regular language, can be proven to have no effects on the expressivity of regular expressions.



 

 !#"%$'&($')+*,  .-,/  021



 3

The notation is used to identify the cardinality or size of the set . Precision measures the fraction of the search results that are accurate or correct. Recall measures the completeness of the search result and is the fraction of the correct results with respect to the total possible results. Recall is defined as:



 







4 #! " 6 575   .-,/  /81



Specifically, this research is intended to address four distinct issues. First, we wanted to explore the effects of using different granularities for recording pattern matches. It is possible that evaluating match solutions at the character level is under-sensitive since single character errors may not significantly affect results. Conversely, evaluating solutions as a whole may be overly-sensitive with respect to single character errors. Second, we wished to investigate the relationship between precision and recall. Do participants use a conservative strategy and improve precision at the expense of recall, or an aggressive strategy that improves recall at the expense of precision? For example, the omission of a suspect, but correct solution, will have no effect upon precision, but will lower recall. Third, it is likely that there is a relationship between performance on pattern matching and pattern creation tasks. We believe that the two tasks utilize the same cognitive skill set. However, pattern creation may require more highly developed abilities as it requires the generation of a pattern as opposed to the application of one. Finally, it is known that in Boolean Algebra the alternation operator is more difficult to use than the conjunction operator [3]. We believe that this effect will also appear in the context of regular expressions and sought to examine the issue.



3 Search Specification Regular languages are simple enough to be easily described but provide sufficient flexibility for describing the targets of searches. It is for this role that regular expressions are best known in the field of computer science. Another mechanism for specifying search results is Boolean Algebra, as used in many information retrieval and WWW search tools. Boolean Algebra also provides an alternation (or) operator, but replaces concatenation with a conjunction (and) operator. The repetition operator does not exist in Boolean Algebra, but a negation (not) operator is available. Human performance when using Boolean Algebra to specify search results has been studied by Green et al. [3]. Although not completely related, the common use of an alternation operator and their application for a similar role provides a link between Boolean Algebra, a representation of formal logic, and regular expressions, a restricted class of formal language. The two studies presented here can be seen as a first attempt at examining the relationship between language and logic. However, it should be noted that our 2

1: 2: 3: 4: 5: 6: 7: 8:

ab a b ab ab ab ab cd ab ac ab







 







 

9     :;   

   







ab ab, aab, aaab, . . . a, ab, abb, abbb, . . . , ab, abab, ababab, . . . a, b abd, acd Note: applies only to b and c. ab, ac a, b, aa, ba, ab, bb, aaa, baa, aba, bba, aab, bab, abb, bbb, . . .















Figure 2. Example Regular Expressions and Their Languages

4 Study 1

of their ability. As well, participants were instructed to attempt each item in sequence and to not return to previously attempted items. Part three of the survey was a pattern creation task. Participants were given a written description of a search solution and instructed to generate a regular expression for which its regular language matched the search solution. For the last seven of the 10 items, examples of some possible matches were provided to supplement the written description. An example of a creation task item can be found in Figure 4. Participants were also given five minutes to complete this task and were encouraged to use the entire time and to work on the items in numerical order. To determine whether any order effects were present, the survey was counter-balanced with half the participants performing the matching task first and the other half performing the creation task first. In part four, all participants were asked to answer a few demographic, exclusion identification and follow-up questions. Demographic data collected included the age and sex of the participants. The exclusion questions identified the field of study for the participants and their familiarity with regular expressions. The follow-up questions examined participants satisfaction with the instructional sheet and their opinions on the relative difficulty of the two tasks. When generating precision and recall scores the granularity used to determine the size of the solution has a direct affect on the score generated. For example, given the string:

To address these issues, we conducted an investigation using a survey designed to assess performance on pattern matching and creation tasks. The scoring procedure enabled us to explore the issues of solution granularity and performance measurement using precision and recall.

4.1 Method 4.1.1 Participants Participants were recruited as volunteers from various classes in the Department of Psychology at York University. It should be noted that the university is located in the metropolitan city of Toronto, and hence the ethnic and socioeconomic backgrounds of the participants are diverse. The final sample included 9 males (age in years, ) and 27 women ( ), excluding five surveys we omitted due to incompleteness or a clearly indicated lack of task comprehension. All participants reported that they had no previous experience using regular expressions, and debriefing revealed all were na¨ıve of the experimental hypotheses.

=> >?>  A@  = BDC = >?1 > 1 1

<    = F E + > B < 1 G@

4.1.2 Apparatus and Procedure In Study 1, participants were given a four part survey. Part one was an instructional sheet used to explain the formation of regular expressions. Participants were given three minutes to study the material on the instructional sheet. The instructional sheet was not taken from the participants and the experimenter suggested that participants consult the sheet for reference when completing the remainder of the survey. Part two of the survey was a pattern matching task. Participants were instructed to underline all occurrences of a pattern in a given string of characters. There were 10 items in the task, each having a different pattern and string. Figure 3 provides an example of an item. Participants were given five minutes to complete this part of the survey and were encouraged to utilize the entire five minutes to the best

xxxyzzzxxxyzzxx and the pattern xyz, the participant response: xxxyzzzxxxyzzxx

1

HDBDC

B

has a precision of at the character level (6 of 7 characters correct) and a precision of at the substring level (1 of 2 solution substrings correct). The term substring is used to indicate that match elements are substrings of the data string. To explore the relationship between these granularities, precision and recall values were calculated at both the character and substring level. 3

1

1.

Pattern: bg String: acdbggbcgbgbedccdfabagabadefbgcccfeedbbbbbgcbabcdgcef Figure 3. Study 1 Matching Task Item 1.

A sequence of c’s containing one f and that begins and ends with a c. e.g. cfc, ccfc, cfcc, cccfc, ccfcc, cfccc, . . . Figure 4. Study 1 Creation Task Item

<  E 1 B?Q  A@  E 1 >:U

The generation of precision and recall values for the matching task is accomplished by counting the number of underlined and correctly underlined solutions and forming the appropriate ratios. For the creation task, the created strings were applied to a set of arbitrarily constructed “representative strings” and the precision and recall values calculated. The representative strings were generated by the same experimenter as the data strings of the matching task with the intent that both sets of strings contain similar character orderings and constructions. It is assumed that the application of created expressions is error-free as it was performed by an individual highly experienced in the use of regular expressions. Before beginning the survey, participants were given a consent form and upon finishing the survey participants were debriefed. All participants were tested individually with each subject placing their completed survey in a collection envelope. Surveys were scored by the experimenters once all data was collected.

and , respectively. We additionally conducted paired-samples correlations to examine the possibility that performance at the character level is related to performance at the substring level. For precision, character and substring performances were significantly related, . There was a corresponding finding for recall, as character and substring performances were significantly related, . The relationship between precision and recall was analysed by collapsing the data across task and granularity, thus generating an overall mean precision and recall value for each participant. A paired-samples -test indicated significant differences; . Individuals. recall values were significantly less than their precision scores; and , respectively. In addition, a paired-samples correlation yielded a significant positive relationship between precision and recall; . To examine the relationship between pattern matching and creation, we collapsed the data across granularity and performance measures to generate an overall mean matching and creation value for each participant. The pairedsamples -tests yielded a significant difference, . Creation scores were significantly lower than matching scores; and , respectively. Furthermore, the scores were unrelated by paired-samples correlation, .

V E 6P C  W ?E E?E I M 1 1 X E ?H Q  , E?EDE I M 1 1

K K 7L B  H =FE  IYM DE E?E 1 1    E CFQ OA@  E 6 P C E > 7 A  @ L 6

< E E?Q < 1 1 1 1 X C?B  I,M EDE?E 1 1

4.2 Results

K K L B  CL R>  IZM E?EJ> 1 1 < >?>  E 1 PRC OA@  E 1 >#P   E  C H E < 1 7G@ 1 X E >+C  I [* & 1 1

I  1 JE >

Due to the number of comparisons, we adopted a conservative significance level of to reduce the possibility of creating Type I error. Due to the unspecified direction of the hypotheses, all reported analyses are two-tailed. Therefore, there were three hypotheses for Study 1. First, we predicted a difference in performance based on granularity of recording pattern matches. Second, we hypothesized the existence of a relationship between precision and recall measures. Third, we predicted a relationship between pattern matching and creation abilities. To test the first issue, the possibility of differences in performance due to granularity, we conducted paired-samples -tests for precision and recall scores at the character and substring levels. Individual mean performance on character precision was significantly higher than substring precision, . Character precision yielded whereas substring precision yielded . Individual mean character recall was also significantly higher than substring recall, ,

4.3 Discussion

The strong correlation between precision or recall values at the character and substring level indicates that either granularity can be used to measure performance. As expected, the values at the substring level are lower than those for the character level as a result of the fewer number of solutions at this level and the sensitivity of the solutions to single character errors. It was found that recall scores of each participant are significantly lower than their precision scores. We believe that this effect can be partially attributed to the testing instrument. In the survey, it was observed that many participants

K

K L B  >+= HD=  INM EDE?E <  E 1 H?H OEA1 @ P?Q  E 1 E?1  H E #> = < 1 7A@ 1 K L B  > L PRC  ISM EDE?EJT <  E CU  A@  E >+= 1 1 1 1

4

successfully identified all but one of the possible solutions for a particular item. It is likely that this is the result of simple oversight and not of their inability to identify a correct solution. One explanation could be that the participants experienced a form of repetition blindness [6] when multiple identical solutions appeared close together. As precision and recall positively correlate, there is no evidence of any individual strategy being used. For example, an aggressive participant could have raised all their matching task, character level, recall scores to 1.0, by simply underlining the entire data string. This strategy would significantly lower their precision score as a result of generating many invalid solutions. The fact that recall is significantly lower than precision indicates that a conservative strategy is consistently used. It is likely that participants were conscientious in their completion of the surveys and tended to err on the side of caution. It should be noted that the high means reported for precision and recall are a result of the survey design. The initial task items are intentionally easy and are intended to build subject’s confidence for the purpose of improving compliance. Although predicted, there was no correlation between the scores for matching and creation. Examination of the completed surveys reveals that participants had considerable difficulty in creating expressions. Further examination and consultation with experienced regular expression users indicated a belief that the creation task was much more difficult than the matching task. The number of operators used in expressions for the matching task (26) is lower than for ideal solutions in the creation task (33). The number of alphabet symbols used in matching task expressions (27) is also lower than for creation task expressions (44). Support for belief in the difference in difficulty between performance on matching and creation can be found in the significantly lower mean on the creation task than on the matching task. To further explore this issue as well as to replicate the results, we decided to extend the research by performing a second study.

ness office, retail outlet, athletic facilities, restaurant, and hospital. There were a total of 64 participants in the final sample; 30 men (age, in years, ) and 34 women ( ). One participant was excluded as he had experience with regular expressions, and three were excluded due to survey incompleteness or misunderstanding of the tasks. Participants’ educational history, ethnicity, and socioeconomic status were diverse. All participants had no previous experience using regular expressions and debriefing revealed all were na¨ıve of the experimental hypotheses.

<  D=H B B?1 Q?P E  A@  Q 1 =FH   =  U  C U < 1 A@ 1

5.1.2 Apparatus and Procedure In Study 2, the timing restrictions were removed and participants were permitted as much time as they desired for each section. As with Study 1 the tasks were counter-balanced and administered in the reverse order to one half of the participants. The instructional sheet of Study 2 was improved in accordance with the anecdotal reports obtained from participants during debriefing for Study 1. The primary change was the inclusion of the example suite of Figure 2. Other changes included minor improvements in wording, additional instruction on the use of parentheses and deletion of the task alphabet definition. The matching task was structured similarly, but the creation task was modified to be more like the matching task. Figure 5 provides an example of a Study 2 creation task item. The modified creation task presents participants with an underlined string, where the underlined portions represent the solutions to an applied regular expression that the participants must generate. For both matching and creation, the first six items are all structurally identical to an element of the example suite on the instructional sheet. Both tasks use the same six items but with the order varying. The remaining four items do not appear on the instructional sheet and can be considered as slightly more complex. The number of operators on both tasks is identical although the creation task expressions have three more alphabet symbols. As the participants are from a community-based sample, the recruiting procedure was different. Participants were approached by female experimenter and asked if they would mind participating in a study on pattern and language formation. The remainder of the procedure was similar.

5 Study 2 In Study 2 we aimed to replicate the majority of the findings from the previous study as well as exploring the differences between alternation and repetition. The survey used in Study 2 is a revised version of that used in Study 1, modified to increase the similarity of presentation between the matching and creation tasks.

5.2 Results

5.1 Method

I  1 E >

Similar to Study 1, we employed a conservative significance level of to reduce the possibility of creating Type I error, and all reported analyses are two-tailed. There were four hypotheses for Study 2, with the first and second

5.1.1 Participants Participants were solicited from various community locations in Toronto, including a manufacturing company, busi5

1.

String: zzuuxyxyzyzxzzxxyyzyxyxzzyxzyyyyyzyzzzyyzywyxxwuwu Solution: Figure 5. Study 2 Creation Task Item

stated with the intent of replicating the findings of the first study. Therefore, we hypothesized a difference in performance due to granularity of recording pattern matches, and a relationship between precision and recall. We predicted a relationship between pattern matching and creation abilities that we did not find in Study 1. Lastly, we hypothesized a difference in performance on alternation items and repetition items. A paired-samples -test was used to examine the possibility of differences in performance due to granularity for both precision and recall measures. Individual’s character precision was significantly higher than their performance on substring precision, . Similar to Study 1, character precision resulted in in contrast to substring precision that resulted in . Likewise, mean character recall was significantly higher than substring recall, . Character recall yielded M = 0.81 (SD = 0.12) whereas substring recall yielded . Paired-sample correlations revealed significant relationships between character and substring precision, , and between character and substring recall, . To examine the relationship between precision and recall, we collapsed the data across task and granularity to generate an overall mean for each measure. A pairedsamples -test resulted in significant differences, . As we found in Study 1, participants’ recall values were significantly less than their precision values; and , respectively. The relationship between precision and recall was again significant, paired-samples correlation . The possibility of a relationship between pattern matching and creation was investigated by collapsing the data across granularity and performance measures which generated an overall mean for each task. Contrary to Study 1, a paired-samples -test did not yield significant results, . Also in contrast with Study 1, there was a significant relationship between matching and creation as revealed by a paired-samples correlation, . To ensure that these findings were not due to an order effect, a repeated measures Analysis of Variance (ANOVA) was conducted. This analysis yielded nonsignificant results for the main effect of task; and for the interaction of the task and version; .

Finally, we examined the differences in performance on items containing alternation or repetition in the creation and matching tasks. For the creation task, paired-samples -test indicated significant differences between alternation and repetition items, . Alternation items resulted in lower values than repetition items, and , respectively. We also compared alternation items with items containing both alternation and repetition, . A final comparison of repetition items with items containing both alternation and repetition re. vealed a significant difference, Items with both operators resulted in significantly lower values than alternation items, and . The same pattern emerged for the matching task. A paired-samples -test yielded significant differences between alternation and repetition items, . Alternation resulted in significantly lower scores (M = 0.72, SD = 0.19) than repetition (M = 0.82, SD = 0.18). A comparison of alternation with items containing both repetition and alternation revealed no significant difference, paired-samples . Finally a comparison of repetition with items containing both repetition and alternation resulted in significant differences, paired-samples . Repetition resulted in higher values than items containing both operators, and , respectively.

K

K P 6L  L E?Q  IgM EJ> 1 1     E = > E C R > E E H < 1 7G@ 1 L?LR < 1 LW OA@ 1 K P >  E E?P  I h* & 1 1 K PJ>  L L B  IiM EJ> 1 1   E  C E E ? = H < 1 OA@ 1 <  E 1 HFU OA@  E 1 =?E K K PD=  L C?B  I  EDE?E 1 1

K

?E E?E K P 6L  > L H6C  \ I M 1 (E 1 < 1  E 1 CFE 7G@  E 1 >+C K P L6  >(U E >  I]M EDE?E 1 1   E D P H E  > C < 1 7A@ 1 ^ E H?H  I_M E?E?E X1  E QFU  I.1 M E?EDE 1 1 K B H L  I_M ?E E?E 1 1  F C B < 1 7A @  1 >:U EDE?E 1

K P 6L  E H6C 1 b E P?U  IcM 1

K FB H  > =?H  I * & 1 1 K 7L Q  B =FQ  IjM E?EDE 1 1     E #> Q E 6 H = E + > C E ? P H < 1 OA@ 1 < 1 7A@ 1

K P LR 

<  E 1 CH 7A@  1 >:U ` ?Q U  [ 1 I M

5.3 Discussion The modifications to the instruction sheet changed the participants’ reported satisfaction with the instruction sheet from 44.4% (Study 1) to 70.4% (Study 2). Anecdotal reports during debriefing indicate that the addition of an example suite was the primary cause of this increase. The replication of the correlation between character level and substring level measures provides additional evidence of the interchangeability of the two scores. Future researchers may utilize either recording technique without affecting results. However, it should be noted that the high sensitivity of substring level scores and the associated lower mean may obscure small effects in performance. Replication of the correlation between precision and recall strongly suggests the absence of any significant individual strategy differences. All participants tend towards

K  I a* & 1 E?EDE 1

d >  PD=  E CF=  I e* &  * & 1 d >  P6= 1  1 E #> P  I f 1 1

6

a conservative strategy and favour accuracy over completeness. While strategy differences may exist, they are displayed with respect to the amount of conservativism a specific participant employed. No evidence exists for the use of an aggressive strategy favouring recall over precision. In Study 2, there was no significant difference in the means for the matching and creation tasks. This finding indicates that the two tasks are much more equivalent in difficulty than for Study 1. Once the difference in task difficulty was removed, performance on matching was found to correlate with that of creation. This correlation is indicative of a common skill set being used for both tasks. The lack of an order effect also indicates the lack of a practice effect where the first task provides practice for the second. We believe that the lack of feedback given after the first task prevented individuals from improving their skill in expression manipulation. Future research is required to determine other aspects of this cognitive ability. Do nonverbal patterns reveal these same trends or are they specific to textual patterns only? Our results confirm the hypothesis that alternation is more difficult than repetition or concatenation. Item groups containing the ‘or’ operator have a significantly lower mean than those not containing the operator. This effect is evidenced in both the matching and the creation task. To ensure that the alternation operator is the cause of the effect we divided items into three groups, those containing only the alternation operator, those containing only the repetition operator and those containing both. The results indicate that there was a difference between the repetition and alternation group and between the repetition and both operator group. However, there was no difference between the alternation and both operator group. This finding indicates that it is the presence of the alternation operator that is responsible for the difference and not some form of operator interaction.

the format of a test item. While the use of a textual and a diagrammatic expression format had no effect on matching performance, it did significantly affect creation performance. Their data suggests, on the basis of a correct vs. incorrect scoring system, that creation is an easier task than matching. Participants in their study answered 72.5% of the matching tasks correctly and 89.5% of the creation tasks correctly, when averaged over both expression formats. No explanation was offered for their finding. In contrast, we obtained lower creation than matching scores in Study 1 and equivalent scores in Study 2. This apparent discrepancy in reported findings requires further investigation. It is not surprising that participants had more difficulty manipulating expressions with alternation than those without the operator since it is documented that a similar phenomenon occurs in Boolean query systems [3]. While Vakkari [9] reports that this effect decreases with improved conceptual representation of the search task domain, it is also possible that the reported improvement is due to improved skill in the use of a Boolean system. Vakkari also describes the use of alternation as a “parallel search tactic” due to the need to simultaneously identify solutions for both elements of the construct. The data of Green et al. [3] supports this concept of parallelism. Participants in their experiment took twice as long, 44.8 vs. 24.4 seconds, on queries with disjunction alone as compared to conjunction alone. Chui and Dillon [2] suggest that this effect is the result of a greater level in difficulty for processing disjunctive information. This explanation is supported by JohnsonLaird [5] who postulates that human processing of logical syllogisms is limited in the number of alternative models that can be simultaneously maintained in working memory. When working memory is depleted, processing will have to be performed sequentially, increasing the time needed to solve a task. It is possible this effect is stronger in novices as they may utilize working memory less efficiently while developing their cognitive skills. In future work, we intend to explore the need for sequential solution of expressions containing alteration by using a timed study. If parallel processing of alternation expressions is occurring, matching of these expressions should be similar in time to that of repetition expressions. However, if sequential processing is used to match alternation expressions, the time to perform a matching task will increase with the number of alternation operations. We intend to perform a timed task to verify the hypothesis that alternation tasks are solved sequentially. While the research presented here has begun the exploration of the cognitive skills needed to manipulate regular expressions, there is still much to be done. In future research, we intend to continue the line of experimentation started here and to answer some of the questions that have been raised.

6 Conclusions and Future Work The lower recall performance on matching tasks, as a result of missed solutions, is of interest for further research. We believe that some form of repetition blindness is occurring and that performance is affected by phenomena in addition to participants’ skill level. Future research will explore this by examining the locations of missed solutions relative to similar and identical character sequences. We find it curious that there is no evidence of strategy use. It is possible that different tasks would create situations involving the necessity to make tradeoffs. It may be that since our study did not impose the need for a strategy participants did not use one. Pane and Myers [7] explored the issue of pattern creation and matching in the context of Boolean Algebra. They reported no difference in matching performance as a result of 7

7 Acknowledgments

References

We would like to thank Josipa Granic and Diana Smith for their assistance with the second study. Their hard work and dedication to this project are extremely appreciated.

[1] N. Chomsky. On certain formal properties of grammars. Information and Control, 2(2):137–167, 1959. [2] M. Chui and A. Dillon. Speed and accuracy using four boolean query systems. In 10 AAAI Midwest Artificial Intelligence and Cognitive Science Conference, pages 36–42, Bloomington, Indiana, April 1999. [3] S. Greene, S. Devlin, P. Cannata, and L. Gomez. No IFs, ANDs, or ORs: A study of database querying. International Journal of Man–Machine Studies, 32(3):303–326, 1990. [4] J. Hopcroft and J. Ullman. Introduction to Automata Theory, Languages, and Computation. Addison-Wesley, Reading, Massachusetts, 1979. [5] P. Johnson-Laird. Mental Models. Harvard University Press, Cambridge, Massachusetts, 1983. [6] N. Kanwisher. Repetition blindness: Type recognition without token individuation. Cognition, 27(2):117–143, 1987. [7] J. Pane and B. Myers. Improving user performance on boolean queries. In ACM Conference on Human Factors in Computing Systems, pages 269–270, The Hague, Netherlands, April 2000. [8] H. Turtle. Natural language vs. boolean query evaluation: A comparison of retrieval performance. In 17 Annual International Conference on Research and Development in Information Retrieval, pages 212–220, Dublin, Ireland, July 1994. ACM SIGIR. [9] P. Vakkari. Cognition and changes of search terms and tactics during task performance. In RIAO International Conference, pages 894–907, Paris, France, April 2000.

kml

kml

8

Suggest Documents