A corpus-driven approach to parenthetical uses of mental predicates. Dylan Glynn. Lund University. Karolina Krawczak. Adam Mickiewicz University, PoznaÅ. 1.
Context and Cognition A corpus-driven approach to parenthetical uses of mental predicates Dylan Glynn Lund University
Karolina Krawczak Adam Mickiewicz University, Poznań
Introduction. Corpus linguistics and social cognition
Pr D ep R ri AF nt T
1.
Linguists have no direct access to cognition.1 How can we, therefore, capture the intersubjective dimension of cognitive processing? Although experimentation can obtain indirect information on how humans process language, this indirect method offers no insights into social cognition or how humans, in their natural social milieu, juggle the complexity of language structure and context. By examining patterns of actual intersubjective language use, this study seeks to capture the factors that speakers are considering in communication events. The basic assumption is that multifactorial patterns of intersubjective language use are an indirect means of identifying structures of social cognition. We focus on the mental predicates think, suppose and guess. These verbs externalise internal epistemic stance. Their use requires high sensitivity and responsiveness to the intersubjective context, putting them at an intersection of individual and social cognition. The study adopts a Cognitive Linguistic framework (Lakoff 1987, Langacker 1987) and applies corpus-driven methodology. In this, it joins with a growing number of cognitive-functional studies that examine natural language in order to understand epistemic stance (Schneider 2005, Verhagen 2006 inter alia). We consider this in terms of social cognition - the subject’s competence in the contextualized interactive communication (Zlatev 2008). The relation of social cog1
We would like to thank an anonymous reviewer for helpful comments and criticisms. Our shortcomings remain our own.
Preprint draft published in Cognitive Processes in Language, K. Kosecki & J. Badio (eds.), 87-99. Frankfurt/Main: Peter Lang (2011).
Pr D ep R ri AF nt T
nition to meaning and phenomenology is explained in Krawczak (2007, 2010a, 2010b). Speech is not just a combination of lexis, syntax, morphology and prosody. Speech draws on all kinds of linguistic form to negotiate a complex social milieu – subtly judging the interaction of an immense range of factors from gender, age, and situational context to world knowledge, but also how the speaker negotiates all of this. Psycholinguistics elicits language production and examines this produc tion. Such language is always artificially simple and without context. Corpus linguistics examines traces of language use and is, therefore, more indirect than experimental methods, but it can look at actual language in natural context. This study examines a corpus of natural interactive language to see how speakers judge, engage in, and profile epistemic stance in interaction. Careful manual analysis of a large number of speech events reveals patterns in how speakers engage with each other through language. This gives us indirect insights into how speakers profile their subjective view of the world. Multivariate statistics can then help identify the patterns of usage from a multidimensional perspective. Moreover, multivariate modelling captures this complexity in a cognitively plausible way. Although the statistical modelling itself is not argued to represent how an individual processes language, it does permit us to weigh up the relative factors and their interaction that speakers handle in communication. Arguably, this is similar to what speakers do pre-consciously when they coordinate the various dimensions of intersubjective communication. The use of corpora to describe conceptual structure is basic to usagebased research in Cognitive Semantics (Gries and Stefanowitsch 2006, Stefanowitsch and Gries 2006, Glynn and Fischer 2010, Glynn and Robinson in press). The Corpus to Conception principle, which states that patterns of language use represent patterns of conceptual structure, is specifically explored in Glynn (2009, 2010, in press). 2.
Case study. Epistemic stance verbs in American and English
The study of epistemic stance verbs, and light verbs more generally, has a veritable tradition in linguistics. The parenthetical uses caused much concern for the generative theories of language. Similarly, the highly pragmatic meaning that interacts with the grammatical semantics of light verbs has been a focus in functional research. This study draws heavily upon the
functional research in that we adopt many of the factors that were identified as relevant to understanding the use of these verbs. These factors form the basis of the manual analysis - annotation of the data.
Pr D ep R ri AF nt T
2.1 Data and analysis The three mental predicates, think, guess (US), suppose (UK) are extracted in relative proportions - equal numbers of American and British think and equal numbers of guess and suppose for American and British respectively. Although guess and suppose exist in both dialects, the epistemic use of these verbs is generally dialect specific. In total, some 250 examples are taken from an unparsed corpus developed at the University of Leuven by D. Speelman. This corpus is compiled entirely from on-line personal diaries, or ‘blogs’, and is stratified for American and British usage. A further 65 examples are taken from the spoken component of the British National Corpus (BNC) and some 81 examples from the equivalent component in the Corpus of Contemporary American English (COCA). The language in the data set is informal, the language of the blogs especially so. It is important to note that the genre of blogs differs markedly from traditional diaries, precisely because they are interpersonal. Unlike traditional diaries, blogs are read by many people, potentially millions, all of whom may respond to entries. The authors are aware of this in the writing, much of which is quite literally dialogic. The entries that are not in response to a specific contribution by a reader, remain quasi dialogic, since they are written in expectance of a reply. The study examines a very wide range of variables. Due to limitations of space, we only consider a small number of these. The formal factors analysed include: Syntactic Construction; Interrogativeness, main clause – Tense, Mood, Aspect, Transitivity, Negation, and Adverbial modification, dependant clause – Tense, Mood, Aspect, Negation, Person, and Number. The semantic factors are annotated at two levels: at the utterance level – Humour; Parenthesis; Engagement; and Axiology. At the clause level, the subject and object types for both the main and dependant clauses are analysed as well as the kind of adverb. Two extralinguistic variables were included in the analysis of the data, the dialect (British versus American) and the topic of discourse. Although both are important, due to practical limitations, the discussion is restricted to the variable of dialect. Three of the variables that prove important in the following study are the parenthetical versus ‘full lexical verb’ use, the axiology value of the utterance, and the degree of engagement on the part of the speaker. All
three factors are highly subjective categories and so their operationalisation and application to the data deserves careful explanation. Parenthetical The status of a given example as parenthetical was determined by two criteria – the mental predicate could be elided without changing the meaning of the utterance and it could be moved between front, middle and final position. a. It seems so silly. It was a side activity. I was, I guess, gratifying a -- a secondary need. And it, obviously -- I did it and I'm not proud of it. (Parenthetical) b. I just thought I would update with the song that has been stuck in my head. (Non Parenthetical)
Pr D ep R ri AF nt T
(1)
Axiology The axiological value of the utterance is particularly difficult to determine. This is not only because a great amount of context and subjective interpretation is needed, but because the scale is inherently continuous and applying discreet categories to a continuum always involves a degree of arbitrariness. As a rule of thumb, the example was considered neutral unless there was a clear and good reason to suppose it otherwise. The use of negative modifiers, emoticons, and adverbs were used as indicators of such. (2)
a. I do think she's a wonderful performer. (Positive) b. Also went to Tony and Clares but Tony was away :-( Clare was a bit strange all night and I don't think I should talk about that here, suffice to say it was the most uncomfortable night of mine. (Negative) c. i bet you didnt know that paul played bass. yuuuuuh. i always thought he played guitar. (Neutral)
Engagement Engagement is the degree of speaker’s epistemic commitment. The same reasoning was used for this factor as above. The Engagement was considered neutral unless there was clear indication otherwise, evidenced by word choice such as hedging or emphatic modifiers.
(3)
a. I'm not sure how it ended. I don't suppose it matters really. (Weak) b. I do suppose it might be outside your realm of intelligence to think of something so dirty. (Strong) c. So yeah, that pretty much wraps everything up, I suppose. I'm sure I'm missing out on a few cool things, but considering how long as this post is already, I suppose that isn't such a bad thing. (Neutral)
Pr D ep R ri AF nt T
2.2 Results The manual analysis of the examples results in a large table which includes all the usage-features of all the examples. In order to identify usagepatterns, we employ the multivariate statistical technique Correspondence Analysis. This technique is exploratory in itself and so offers no information on the explanatory power or statistical significance of the results, but it does help identify patterns, relative to a range of usage-factors. The technique converts the frequency of co-occurring features to relative distances which are then plotted on a two-dimensional plane.
2.2.1 Epistemic cognition verbs, Engagement, Axiology, and Parenthesis To begin with, let us consider the analysis of the three verbs in question and how they are used relative to each other and to the factors of Engagement, Axiology, and Parenthesis. The plot in Figure 1, below, shows the guess and suppose data points juxtaposed to the think data point. This suggests that the American guess and the British suppose are similar in use, relative to think. This would support the fact that these two items can be seen as equivalents in the two dialects. However, the plot reveals much more than this. Firstly, we see that two usage features, ‘weak engagement’ and ‘parenthetical use’, are distinctly associated with guess and suppose and distinctly disassociated with think. Secondly, a more subtle, but equally important usage-pattern is revealed. Between the guess and suppose data points, we see a range of usage features. These features are not distinctive when comparing guess-suppose with think, but they do offer interesting insights in possible distinctions between the American guess and the British suppose. The position of the ‘non-parenthetical’ data point is equidistant to all three verbs. This suggests it is entirely neutral with regard to the use of the different lexemes. However, to the left of ‘nonparenthetical’, we see that, relative to the data point of suppose, the usage-
Pr D ep R ri AF nt T
features ‘neutral axiology’ and ‘neutral engagement’ are associated with the American guess. This is contrasted by the cluster to the right of ‘non-parenthetical’. Here ‘negative axiology’ , ‘positive axiology’ , and ‘strong engagement’ are placed clearly on the suppose side of the plot, distinct from the guess data point.
Figure 1. Epistemic THINK. Engagement, Axiology, Parenthesis Multiple correspondence analysis (Burt Matrix)
By way of summary, this analysis has shown that guess-suppose are distinct from think because they are relatively more associated with parenthetical and non-engaged usage. However, the use of guess, relative to the use of suppose, is axiologically neutral and associated with neutral engagement. In contrast, the British suppose is associated with a usage that involves strong positive or negative engagement on the part of the speaker. This analysis has revealed a distinction in the use of the British suppose and American guess. Let us zoom in on these differences.
2.2.2 American vs. British cognition verbs: guess vs. suppose By removing the think data, we simplify the content of the analysis bringing out the distinctions between the verbs suppose and guess
Pr D ep R ri AF nt T
. The results, presented in Figure 2, verify what we saw above. ‘Non-parenthetical’ uses are not at all distinctive for the use of the two verbs. The data point for ‘parenthetical is again slightly more associated with the suppose usage, but not distinctly so.
Figure 2. Guess vs. suppose. Engagement, Axiology, Parenthesis, Adverb Multiple correspondence analysis (Burt Matrix)
Remembering that the position of each data point is relative to the other data points, we can see that the factor of Axiology is again crucial in distinguishing the use. The ‘negative and ‘positive axiology’ are distinctly associated with suppose, where ‘neutral axiology’ with guess. Another factor has been added to this analysis – the use or non-use of adverbs. Although ‘lack of adverbs’ is clearly positioned as equally typical of both verbs, the ‘presence of adverbs’ seems distinctly associated with guess. Its position on the extreme right of the plot may also have to do with the fact that it is not used in parenthetical instances, positioned to the extreme left. 2.2.3 British epistemic stance: suppose vs. think. We now turn to a more sophisticated statistical technique, Binary Logistic Regression. Unlike the exploratory technique of Correspondence Analysis, this method produces both scores of statistical significance and explanatory
Pr D ep R ri AF nt T
power. These scores permit us not only to ascertain the chances that our results would be reproduced in further studies but also how much of the data variation our analysis explains. The power of explanation is determined by establishing how well the analysis can ‘predict’ the example. To understand how it works, one may think of the logistic regression as an algorithm – we give it the results of the analysis of the examples but hide the actual example from the model. We then ask whether a given example is one kind or the other (e.g., suppose or think). When the model can accurately determine the response outcome, the specific features that led to the choice are indicated. This permits us to ascertain which features are associated with which outcome, here suppose or think. If, based on the analysis, the model accurately predicts the example, then we can say our analysis is accurate and explanatorily powerful. Below are presented the results of the logistic regression analysis for the British think – suppose data only. The column on the left lists the factors and features that were put into the model. Then the next column, to the right of this, lists the relative degree of importance each feature has in predicting the outcome – here whether the example is a ‘suppose’ example or a ‘think’ example. On the far right, the asterisks and numbers refer to the pvalues, or scores of statistical significance. The final row of scores at the bottom explains the overall explanatory power of the model. The p-values to the right show that all the variables are statistically significant in determining the outcome as suppose or think. Only one feature (‘weak engagement’), from the variable of Engagement, is not significant. The estimates in the second column show that the overall findings of the correspondence analysis were accurate. The negative scores here mean that the feature is a predictor of suppose and the positive of think. The higher or lower the figure, the more important a predictor it is. From this, we can conclude that although Axiology and Engagement were statistically significant, they were not strong predictors. In other words, their association in use is not chance, but it is not particularly important. However, the presence of an Adverb or a Parenthetical use is both statistically significant and predictively important. The lack of an adverb and parenthetic use points to a think use.
Logistic Regression: SUPPOSE (120 examples) - THINK (135 examples) Predictor Variables – Adverbial, Axiology, Engagement Deviance Residuals: Min 1Q Median 3Q Max -2.4763 -0.9717 0.2882 1.0557 2.3367 Coefficients: (Intercept) Adverb_NonAdverb Parenthetical Axiology_Neutral Axiology_Positive Engagement_Strong Engagement_Weak
Estimate Std.Error z value Pr(>|z|) 3.1602 0.3424 9.229 < 2e-16 -2.8670 0.2942 -9.744 < 2e-16 -2.3962 0.3496 -6.853 7.21e-12 -0.5595 0.2360 -2.371 0.01776 0.6566 0.2547 2.578 0.00994 -0.7984 0.2242 -3.560 0.00037 -15.8116 468.13 -0.034 0.97306
*** *** *** * ** ***
Pr D ep R ri AF nt T
Null deviance: 352.62 on 254 degrees of freedom Residual deviance: 250.14 on 248 degrees of freedom AIC: 264.14
P 0
C 0.814
Dxy 0.628
R2 0.442
Brier 0.167
Turning to the overall explanatory power of the model, the results are excellent. At the bottom, both the R2 (0.442) and C-score (0.814) show a model that predicts think over suppose with a high degree of accuracy. A strong result for the R2 is anything above 0.3 and for the C-score, above 0.8. From this we can safely say that not only have we identified the usagefeatures that are statistically significant in distinguishing the choice of the two lexemes, but also that these features explain most of the uses. At this point, the discrepancies between the results of the Logistic Regression and the Multiple Correspondence Analysis should be explained. The fact that not all of the features identified by the Correspondence Analysis were found to be statistically significant by the Regression Analysis does not mean they are invalid findings. Firstly, the logistic model is based on a combination of the most significant features, which does not exclude the possibility that others were also significant. Secondly, with such a small data set, statistical significance is not easily achieved. Thus, differences between the results of the Correspondence Analysis and the Logistic Regression do not necessarily indicate that the findings in the former are merely chance.
2.2.4 American epistemic stance: guess vs. think The same factors prove important in distinguishing the American guess from think. However, despite the fact that we still obtain statistical significance, the predictive power of the model for the American lexeme guess is considerably less impressive. Logistic Regression: GUESS (138 examples) - THINK (135 examples) Predictor Variables – Adverbial, Axiology, Engagement Min 1Q Median 3Q Max -1.8332 -0.9074 -0.4910 1.0293 2.0858 Coefficients: Estimate Std. Error 1.4742 0.3341 -1.8616 0.3334 -1.3802 0.5653 -0.2872 0.3039 0.7463 0.4013
z value 4.413 -5.584 -2.442 -0.945 1.860
Pr(>|z|) 1.02e-05 *** 2.34e-08 *** 0.0146 * 0.3446 0.0629 .
Pr D ep R ri AF nt T
(Intercept) NonAdverb Parenthetical Axiology_Neutral Axiology_Pos
Null deviance: 378.43 on 272 degrees of freedom Residual deviance: 314.71 on 268 degrees of freedom AIC: 324.71
P 0
C 0.758
Dxy 0.517
R2 0.312
Brier 0.193
In the final row, we see the R2 and C-score. Typically, R scores of 0.3 or higher mean a strong model. However, for logistic regression, where the R2 is in fact not directly calculated, but a ‘pseudo’ R2 using the Nagelkerke algorithm. In case of doubt, the C-score is considered more reliable. Therefore, although the R2 is respectable at 0.312, the C-score is only 0.758. This can be interpreted as explaining, or accurately predicting, roughly 25% better than chance, or at 75% accuracy. This means that either there are factors that have not been taken into account, the sample is too small, or there is not such an important difference in usage. Nevertheless, the results show clear tendencies and these tendencies are statistically significant. What can be interpreted from these results and the difference between the British and American data is that, actually, the use of the American lexeme guess lies somewhere between the British suppose and the shared lexeme think. The two dialects distinguish the use of the two lexemes in a similar way, but British more so. However, further research is needed to verify this claim.
3.
Conclusion
Pr D ep R ri AF nt T
We have established that the epistemic use of guess and suppose is similar across the two dialects and distinct from think. However, there exists a usage continuum from suppose through guess to think. The lexemes suppose / guess are more commonly used parenthetically and without adverbial modification. This suggests they are grammaticalising quicker than think. These results demonstrate that corpora of contextualised language use, not only reflect language structure, but can also reveal intersubjective structure, or social cognition by showing the conceptualizing and construing tendencies in a given community. Natural language gives us insights into communicative events, from which collective patterns emerge. These emergent patterns of contextualised verbal behaviour are constitutive of social cognition, engendered intersubjectively and observable in speech. The statistical modelling of this usage gives us a verifiable picture of how individuals, in communities, behave and cognise. One of the most significant findings of our study is that both American and British speakers tend to use the non-think verb as a parenthetical, which has important implications for the semantics of the predicates. While suppose and guess are more prone to grammaticalisation, think remains semantically richer. This is further corroborated by how it combines with adverbs and how it is less likely to be used in medial or final (parenthetical) positions. Such lexico-syntactic conventions are not arbitrary, but the result of semantic and, perhaps even iconic, structure within the language. In other words, if we assume the theoretical tenets of Cognitive Linguistics, formal variation helps identify distinctive patterns, but it is merely understood as a surface phenomenon. All formal variation is a direct result of social, functional or conceptual variation. Since the epistemic modality in question here is necessarily intersubjective, we can suppose that the variation in speaker’s choice is indicative of structuring in social cognition. Future research, presented in Glynn and Krawczak (in press), moves towards explaining how, in functional-conceptual terms, adverbial modification and syntax profile epistemic stance.
References
Pr D ep R ri AF nt T
Evans, V. and Pourcel, S. (eds), 2009. New directions in Cognitive Linguistics. Amsterdam: Benjamins. Fabiszak, M. (ed.) 2007. Language and Meaning. Cognitive and functional perspectives. Frankfurt: Lang. Glynn, D. 2009. “Polysemy, syntax, and variation. A usage-based method for Cognitive Semantics”, In: Evans, V. and Pourcel, S. (eds), 2009. 77-106. Glynn, D. 2010. “Synonymy, lexical fields, and grammatical constructions. Developing usage-based methodology for Cognitive Semantics”, In Schmid, H. J. and Handl, S. (eds), 2010. 89-118. Glynn, D. In press. “Sociolinguistic Cognitive Semantics. A quantitative study of dialect effects on polysemy.” Review of Cognitive Linguistics. Glynn, D. and Krawczak, K. In press. “Social cognition, Cognitive Grammar and corpora. A multifactorial approach to subjectification.” Language and Cognition. Glynn, D. and Fischer, K. (eds) 2010. Quantitative Cognitive Semantics. Corpus-driven approaches. Berlin: Mouton. Glynn, D. and Robinson, J. (eds). In press. Polysemy and Synonymy. Corpus methods and applications in Cognitive Linguistics. Amsterdam: Benjamins. Gries, St. Th. and Stefanowitsch, A. (eds). 2006. Corpora in Cognitive Linguistics: Corpus-based approaches to syntax and lexis. Berlin: Mouton. Krawczak, K. 2007. “Meaning as an epiphenomenon of cognition, social interaction and intercognition”, In: Fabiszak, M. (ed.) 2007. 187-198. Krawczak, K. 2010a. Epiphenomenal Semantics: Cognition, context and convention. Warsaw: SWSPiZ. Krawczak, K. 2010b. “(Inter)subjectivity and objectivity. Meaning at a crossroads”, In: Stalmaszczyk, P. (ed.) 2010. 169-178. Lakoff, G. 1987. Women, Fire, and Dangerous Things. What categories reveal about the mind. London: UCP. Langacker, R. 1987. Foundations of Cognitive Grammar. Theoretical prerequisites. Stanford: SUP. Schmid, H. J. and Handl, S. (eds), 2010. Cognitive Foundations of Linguistic Usage Patterns. Berlin: Mouton. Schneider, S., 2005, Reduced Parenthetical Clauses. A corpus study of spoken French, Italian and Spanish. Amsterdam: Benjamins.
Pr D ep R ri AF nt T
Stalmaszczyk. P. (ed.) 2010. Philosophy of Language and Linguistics. The philosophical turn. Frankfurt: Ontos. Stefanowitsch, A. and Gries, St. Th. (eds). 2006. Corpus-based approaches to metaphor and metonymy. Berlin: Mouton. Van Bogaert, J., 2009, The Grammar of Complement-taking Mental Predicate Constructions in Present-day Spoken British English, PhD, Ghent University. Verhagen, A., 2006, Constructions of Intersubjectivity. Oxford: OUP. Willems, D. & Blanche-Benveniste, C. In press. Constructions. Zlatev, J., Racine T. P., Sinha, C. and Itkonen E. (eds). 2008. The Shared Mind: Perspectives on intersubjectivity. Amsterdam: Benjamins.