Communication Monographs Vol. 72, No. 2, June 2005, pp. 117–143
The Detectability of Socio-Egocentric Group Speech: A Quasi-Turing Test Steven R. Corman & Timothy Kuhn
Scholars interested in group phenomena generally conceive of communication as either a conduit for, or as constitutive of, group decisions. Hewes’s socio-egocentric model contends that we possess no unambiguous proof of any communicative impact on decision making. This study asks whether contrived socio-egocentric group speech is distinguishable from real group speech, and therefore whether the socio-egocentric model is even a plausible depiction of interaction. We developed a simulation that produces socio-egocentric speech and report on its use in a study that asked naı¨ve raters to discern human-generated from simulated socio-egocentric group discussion. Results indicated that participants correctly judged the source of transcripts at rates worse than chance. Furthermore, heuristics employed by the participants can explain their poor performance, because criteria that produced accurate judgments of human transcripts produced inaccurate ones for the simulated transcripts and vice versa. We conclude that the model must be taken seriously as a depiction of plausible group interaction, and that in future studies a distinction between local and global features of conversation is important for studying socioegocentric interaction. Keywords: Group Communication; Decision Making; Socio-Egocentric Model; Discussion Coherence; Simulation
Both a long line of scholarship and common sense seem to suggest that the communication among members of decision-making groups exerts a significant influence on a group’s decisions and its members’ choice preferences. Of course Steven R. Corman is a professor in the Hugh Downs School of Human Communication at Arizona State University; Timothy Kuhn is an assistant professor in the department of communication at the University of Colorado at Boulder The authors thank Mark Stoda and Jarrod Gold for assistance with data collection, Robert Koyak for valuable advice on analysis strategy, and James Davis for sharing data. They also thank Larry Frey, Michele Jackson, Joann Keyton, Mike Monsour, and Renee Myers for thoughtful comments on earlier drafts of this article. Correspondence to: Steven R. Corman, Hugh Downs School of Human Communication, Arizona State University, Box 871205, Tempe, Arizona, 85287-1205, USA. Email:
[email protected] ISSN 0363-7751 (print)/ISSN 1479-5787 (online) q 2005 National Communication Association DOI: 10.1080/03637750500111849
118 S. R. Corman & T. Kuhn
communication influences decision making, the assumption goes. The long hours we all spend in meetings, as well as the large quantity of ink devoted to the subject in scholarly outlets, would seem to provide clear evidence of this. A good deal of research involving group decision making, however, portrays communication as merely conveying members’ pre-discussion decision preferences, affecting the group decision only by providing greater or fewer “process losses.” Communication can be thought of as the “extended form” (Nash, 1953) of a game or process that inevitably culminates in a “normal form” outcome governed by some simple expression of combinatorial rules. Communication, in this view, mediates decision making by serving as a channel that brings preferences and external factors together; it carries no unique contribution to group decision making in itself. The Conduit Metaphor of Communication and Models of Group Decision Making Two particularly influential models from this mediation perspective have developed. The first, by Lorge and Solomon (1955), explained group decisions with objectively correct choices. This model was based on the probability that a group includes a member competent enough to arrive at the (single) correct choice; larger groups are, therefore, more likely to include at least one competent member. A second was Davis’s (1969, 1973, 1982) social decision schemes (SDS), a probabilistic model that predicted group decisions on the basis of members’ prediscussion preferences and the decision rules (e.g., unanimity or majority rule) that synthesize decisions from those preferences. One successful application of SDSinspired models was in the analysis of jury decision making (Penrod & Hastie, 1979; Stasser & Davis, 1981), in which deliberation was portrayed as exerting little influence over group decisions. Rejecting the less parsimonious explanations involving (presumably complex) decision processes, Davis argued that we could explain most verdicts as the simple application of an assigned decision-making rule to members’ pre-discussion beliefs. Both models, therefore, reduce communication to a conduit for messages; communication is portrayed as a vehicle for simply conveying thoughts and expressing preferences (Axley, 1984; Reddy, 1979). Although these perspectives do not deny that members’ preferences can change during interaction, persuasion is considered a cognitive or mental process based on evaluations of the group’s leanings; communication is merely the device that makes those leanings manifest. As Pavitt (1993) points out, this position suggests that either communication has no influence on group outcomes or it “implies that changes in group preferences will be in the direction of the ‘best’ prediscussional preference in the group” (p. 230). In either case, group interaction does not affect the content of the decision. Influence at the Margins In response to these models, Poole, McPhee, and Seibold (1982) proposed a model of group decision making based on communication’s constitutive influence
Socio-Egocentric Group Speech
119
rather than the conduit notion seen in decision schemes. They argued that the SDS’s probabilistic structure obscures and controls-out the effects of interaction, which makes it “impossible to find a case where some decision scheme model does not hold; the theory is nonfalsifiable, at least in comparison to the null hypothesis of no decision scheme” (p. 5). In contrast, they proposed a valence distribution model (VDM; Hoffman, 1979; McPhee, Poole, & Seibold, 1981). From this perspective, group decisions are a product of the distribution of members’ statements evaluating various problem-solving-related proposals advanced in discussion. The VDM holds that during a meeting, members’ statements accrue positive and negative message valences; options are chosen based on the distribution of individual valences toward the proposals at the conclusion of the discussion. The VDM differs from SDS in that it focuses on the content of members’ remarks made during decision-making discussion rather than a rule or distribution of pre-discussion preferences. Poole et al. (1982) compared the two models and found that while both agreed on the lion’s share of cases, the VDM was superior at the margins, where the SDS predictions failed. They concluded that valenced communication “has a direct effect on group decisions and mediates the larger proportion of the effect of decision schemes on group decisions” (p. 16). Tests of this model do not, however, provide conclusive proof of the impact of communication on group decision making. Indeed, it can be interpreted in line with several competing explanations of persuasion in group discussion (see Pavitt, 1993; Seyfarth, 2000). Socio-Egocentric Group Speech Hewes (1986, 1996) was not persuaded by the evidence Poole et al. (1982) offered. Indeed, he claimed that there exists no unambiguous evidence to date showing substantive effects of group communication on group outcomes. Accordingly, he said, it is unwise to simply assume that it has such effects. Specifically, Hewes observed that the large sample of remarks and the precise measurements used by Poole et al. produced statistical significance by reducing standardized errors, whereas Davis and colleagues used overly casual measures of people’s predispositions. This led Poole at al. to conclude that the VDM was superior, but Hewes suggested that its superiority was an artifact of measurement rather than enhanced explanatory capability. He also noted that the VDM could be construed as portraying communication as a proximal cause of decision outcomes, but that it failed to address (perhaps underlying) distal mental and societal forces. To illustrate his point about the lack of a convincing conception of communicative influence, Hewes developed what he called the socio-egocentric model (SEM) of group communication, and he challenged researchers to reject it on empirical grounds. This “baseline” model of group interaction assumes that communication does not influence group decision making.1 The essence of Hewes’s model—which addresses interaction processes but does not predict outcomes, as in
120 S. R. Corman & T. Kuhn
SDS and VDM—concerns the cognitive processing demands placed on individuals. He contends that when interacting with others, people would ideally attend to both their own and others’ perspectives, but because this is a cognitively demanding task, people tend to concentrate only on their own side of the discussion. The task for members of decision-making groups is even more complex and demanding: They must not only consider multiple others but also address a pressing problem alongside the social task of conversation management. To cope with these demands, members engage in intrapersonal decision making, working through the decision task in their own minds while creating the appearance of interacting with other members of the group. Group members meet these demands in two ways. First, they use turn-taking devices (Sacks, Schegloff, & Jefferson, 1974) to regulate transitions between members’ talking turns. This produces a sequential ordering of members reporting their opinions about the problem at hand, with expressions of opinion following one another as they would if statements affected one another. Second, they use vacuous acknowledgements, connecting devices that aid in the production of what appears to be a coherent extension of a preceding comment. Hewes (1996) provides the following example of a vacuous acknowledgement: A: I really believe that educating students about the dangers of bike accidents would help to reduce them. B: Right! I’ve been thinking that maybe we ought to improve bike routes as a step toward solving the problem. (p. 197)
Hewes explained that statements like B’s show a lack of influence between turns of talk because they do nothing to support or reject the educational proposal advanced by A. Specifically, vacuous acknowledgements, such as the one beginning B’s statement (“Right!”), do not “demonstrate the impact of the preceding communication. They help create the illusion of meaningful connectedness, of influence and progress in decision making, but without necessarily reflecting reality”(pp. 197 –198). In summary, the SEM posits that group decision making is comprised of individuals managing cognitive evaluation and occasionally talking aloud about the decision, while giving the appearance of engagement through turn taking and vacuous acknowledgements. Communication is merely a conduit in this scenario as well: Interpersonal influence and coherent group discussion are impossible, given the cognitive burdens members bear and the heuristics they employ in managing them. As mentioned above, the SEM was proposed as a baseline against which we can assess the relevance of communication processes in group decision making. The three evidentiary standards to refute this model are, however, exceptionally stringent. First, it must be shown that the odds of members’ statements during discussion depend on preceding statements (to refute the possibility of a simple turn-taking device dictating interaction). Second, it must be demonstrated that the content of statements is substantively related to preceding statements (to refute the possibility that vacuous acknowledgements shape discussion). Third, the content of discussion must be shown
Socio-Egocentric Group Speech
121
to be related to outcomes in a manner independent of input and other non-interactive factors (Pavitt, 1993). Although some research has addressed elements of Hewes’s baseline model (e.g., Pavitt & Johnson, 1999; Seyfarth, 1999), no studies to date have attempted to meet all three evidentiary standards. Given the extreme difficulty involved in producing data that could address the three criteria (including, perhaps, problems of inaccessibility to cognitive data), it is important to know whether the SEM provides a potentially realistic description of group communication or whether it is better suited to critiquing claims of intra-group influence. This study examines the utility of the SEM by turning the problem around somewhat: Instead of observing interacting groups and trying to determine whether members’ speech is socio-egocentric, we can simulate socio-egocentric speech and determine whether it can be reliably differentiated from natural group interaction. If simulated socioegocentric speech is readily distinguished from that of natural groups, there is reason to question whether the SEM is a plausible model of real group interaction and its utility as a heuristic device and observational aid is reduced. But if socioegocentric speech is undetectable, we must take more seriously the idea that groups may really communicate this way and, in turn, must revise common conceptions of the role of communication in group decision making by reassessing the importance of local coherence in group discussion. Because Hewes’s argument is premised on socio-egocentric speech appearing coherent and uncontrived, asking individuals to assess samples of simulated socio-egocentric and natural group speech is a logical analytical approach. The Turing Test provides our framework for this task. A Quasi-Turing Test Turing (1950) proposed a now-famous test to determine whether machines could be considered intelligent. The evolution of the Turing Test idea is complex (see Karelis, 1986; Shanon, 1989), but the “received view” is a test situation wherein a rater interacts via keyboard with a machine and a human about similar subject matter and then makes a judgment about which interaction partner is the human and which is the machine. A machine would pass the Turing Test if raters could not distinguish it from a human at better than chance rates. This (arguably) would be sufficient evidence to call the machine intelligent. Our concern here is not with machine intelligence, but with testing for socioegocentric group speech. Yet we find the Turing Test metaphor2 appealing for a number of reasons. We can configure a machine to emulate socio-egocentric group communication in a way analogous to the Turing Test’s use of a machine to emulate individual intelligence. Since the Turing Test relies on the practical question of whether raters can tell human communication from interaction generated by a computer, we can compare transcripts of group speech operationalizing the socioegocentric baseline model to transcripts produced by groups of interacting humans and determine whether raters can discriminate between the two. Because the SEM is
122 S. R. Corman & T. Kuhn
intended as a sort of “null hypothesis,” we cast our study’s main hypothesis as follows. H1: Raters are significantly more likely than chance to judge correctly a given transcript of group speech as simulated socio-egocentric or as human generated.
Because we presume that raters will not be equally good at the judgment task, we add two research questions focusing on possible differences in judgment accuracy. First, our quasi-Turing Test involves comparison of two kinds of speech. Knowing or thinking that a transcript was generated by a human group might enable participants to scrutinize the transcripts and reveal otherwise imperceptible flaws in those generated by the computer. It is important to determine, therefore, whether participants can recognize the different transcripts at face value, or whether having a point of comparison is necessary. Thus we ask: RQ1: Does comparing two different kinds of transcripts influence a rater’s ability to correctly judge a given example of group speech as simulated socio-egocentric or as human generated?
Second, because raters’ comprehension and analysis strategies are likely to differ, it is reasonable to expect that the ways they make these judgments could affect accuracy. Thus: RQ2: Do the judgment criteria used by raters influence their ability to judge correctly a given example of group speech as simulated socio-egocentric or as human generated?
In the remainder of the paper we present and discuss a study focused on these issues. To begin, we developed a simulation that produces transcripts of known socio-egocentric group “speech,” tested it for face validity, and made some necessary modifications. Then, in a pre-test, we empirically compared the coherence and believability of the synthetic transcripts with those taken from actual interacting groups. For the main study, we constructed a judgment task involving the humangenerated and the simulated socio-egocentric group speech and analyzed performance on that task by a large sample. Method Preliminary Tasks Our capacity to test the hypothesis and answer our research questions turned on three preliminary tasks. Since a comparison of human and computer transcripts was key to our quasi-Turing Test, the first step was to obtain transcripts of human group speech, along with computer-generated speech that operationalized the socio-egocentric baseline model. A second step was to assess the suitability of the transcripts for our investigation. Specifically, we had to assess whether the human-generated speech was
Socio-Egocentric Group Speech
123
itself socio-egocentric (if so, there would be no meaningful comparison between the types of transcripts) and whether both sources of transcripts were believable and coherent. Finally, we had to assess whether there were meaningful differences between the human- and computer-generated transcripts.
Step 1: Obtaining transcripts The first task required transcripts of both human-generated and simulated socioegocentric group speech. For the human speech, we employed a set of transcribed audiotapes of four-person interacting groups used by Davis and colleagues (see Davis, Spitzer, Nagao, & Stasser, 1978) in studies of jury decision making. Besides the fact that the Davis et al. data are well known and have been used for similar tests of the effects of communication in groups, mock juries provided a good context for generating socioegocentric speech because there was a clear set of (two) choices for which jury members can argue on the basis of specific evidence presented in the mock trial. These transcripts assisted us in creating the simulation we describe next. Creating known socio-egocentric speech presents a challenge because the SEM makes assumptions about the psychological motivations and intentions behind group members’ utterances. While it may have been possible to train a control group of people to generate speech conforming to the SEM, there would have been no way to rule out the specific content of their arguments as a source of experimental variation when making our comparison to the speech of our human groups. We had to insure that the content of the group arguments remained consistent while the form of the interaction varied between real and simulated groups. Therefore, we turned to models of cellular automata (CA). CA are simulations able to depict self-reproducing complex systems through prescribing the actions of a matrix of discrete agents (i.e., cells). These agents represent autonomous entities, such as individual group members, that can take on a finite set of states in some domain. Each agent’s state at any particular time is determined by a transition rule involving its state at the previous time, the states of its neighbors, and parameters that can vary randomly or stochastically (see Corman, 1996, for a more detailed discussion of cellular automata and Latane´, 1996, for an application to group influence). Thus, we built a four-cell unlimited neighborhood CA model using descriptions of cells and transition rules derived from the tenets of the SEM. This model assumed that group decision making consists of egocentric speakers following rules or schemes for contributions and decisions, comparable to an autonomous ruleguided agent in CA. In this automaton, each cell represented a fictitious jury member in the simulated deliberation. These “members” drew their statements for or against a verdict of guilty from a database of statements generated from Davis’s groups, ensuring that the population of arguments was identical in the human and simulated juries. To generate these databases, we parsed transcripts of four of Davis’s groups into individual reasons for or against a verdict of “guilty.” For example, one of the “for”
124 S. R. Corman & T. Kuhn
reasons was: “It doesn’t even have to be premeditated, if you heard the second part of the judge’s thing. It just has to be doing it and causing an injury that could kill.” One of the “against” reasons was: “It was just a fight. I mean, he hurt him, but I’m sure he didn’t mean to hurt him that bad.” (A complete listing of the reasons used and other details of the simulation are available from the authors.) The SEM holds that group members “think aloud” while rationalizing their preferred decisions, paying no substantive attention to what other group members are saying. Consequently, we assumed that each of the four members in the simulated group had an absolute preference for or against a guilty verdict that did not change over the course of the simulated interaction. To express this preference, they would draw from a common stock of arguments appropriate to their preferences, represented by the databases. That is, those members (i.e., cells) favoring a guilty verdict drew their arguments from the guilty arguments database, and those favoring the opposite verdict drew their arguments from the not-guilty arguments database. The SEM further suggests that group members give the appearance of conversational engagement by both taking turns and using vacuous acknowledgements. Turn taking was simulated by allowing only one member to “have the floor” during any given turn. Members “bid” for the floor on each turn through an equally weighted random draw, with the restriction that a member was not allowed to have two turns in a row. Incorporating vacuous acknowledgements (VAs) was less straightforward, since Hewes does not provide much detail about the mechanics of this type of conversational move. In thinking about how to simulate VAs in the context of a jury deliberation, however, it became clear that they would somehow have to take account of the valence of the previous statement made. For example, a VA like “That may be true, but . . .” suggests disagreement with the previous statement. Likewise, a VA such as “You’re right” implies agreement with what the previous speaker said. This connection to the previous utterance is why the VAs give the appearance of acknowledgement.3 To simulate this behavior, we developed two lists of VAs, one acknowledging agreement (e.g., “Right!” or “I see what you mean”) and the other acknowledging disagreement (e.g., “Well, OK, but” or “I don’t know about that”). (A complete listing of the VAs used is available from the authors.) In executing an utterance, the “member” selected at random a VA from the appropriate list (depending on agreement/disagreement with the position of the previous speaker) and prepended it to his or her statement. Thus, in our initial design of the automaton a member would bid for the floor, and on receiving it would select a statement from the appropriate database, attach a VA, and utter the statement. This continued for an arbitrary number of turns. Initial transcripts generated from this automaton demonstrated some clearly implausible group interaction patterns. The most obvious of these was verbatim repetition of arguments by different speakers on consecutive turns, or by the same speaker on repeated turns. To deal with this problem we stipulated that once an argument was selected and used, it was not a candidate for selection on the next turn by a juror with the same preference. This left open the possibility of a lag-two repetition, but this was unlikely for the length of transcripts we generated and never actually occurred in our transcripts.
Socio-Egocentric Group Speech
125
A second problem was overuse of VAs. In the prototype every statement was preceded by a VA. Reading the transcripts, this clearly seemed unnatural, as if the jurors were “trying too hard” to connect to one another’s statements. Fortunately, this matter was easily handled: Instead of selecting a VA on each turn, we gave each cell a parameter (equal for all cells) describing the probability that the member would use a VA on a given turn. On receiving the floor, if a random draw between 0 and 1 fell below this parameter the speaker would prepend a VA as described above; otherwise, he or she would simply give the selected reason. Setting the probability to .3 produced a level of VAs that seemed more natural. Finally, we judged that, on the face of it, the length of the speaking turns was too uniform. Speakers would give only one reason per turn, and in the database items, individual reasons tended to be of a fairly uniform length. In the human groups, however, the length of the turns varied because speakers would sometimes give more than one reason in a speaking turn. To account for this, we created a parameter representing the probability that a speaker would give more than one reason in a turn. If a random draw between 0 and 1 fell below this parameter, then after giving one reason for a position, the speaker would select another one and attach it to the first with a conjunctive phrase selected at random from our list of reasons. The probability of giving additional reasons was equal for all cells; a value of .10 seemed to produce a natural variation in utterance length. When members gave more than one reason, both reasons were excluded from selection by the next speaker with the same preference.
Step 2: Evaluating coherence and believability Step 1 allowed us to conclude that we could capture human group speech and use it to generate known socio-egocentric speech. We still needed to address the believability and coherence of each type of transcript, however. One concern here was the stage of discussion the transcripts represented. Although our CA model was capable of producing transcripts of any length, it did not simulate the opening and closing of a group discussion. Doing this would have required a much more complex automaton capable of generating several phases of group interaction, and it is reasonable to believe that socio-egocentric speech may not be the norm during all phases of group interaction. For example, during initiation and closing phases of interaction, group members would presumably be more focused on planning or concluding their task and less so on working out positions in their minds. To insure that the human group transcripts did not include opening and closing phases that our CA model could not simulate, we compared the computer-generated transcripts with excerpts from the human group transcripts that excluded the openings and closing of discussions. There was no obvious basis for choosing the particular segments of the human-group transcripts to use, so we made these choices arbitrarily by randomly picking a starting point in a transcript and cutting out the same number of turns in the generated transcripts. It is possible that in slicing an excerpt from a human group’s transcript in this way, we could have disrupted the coherence of the discussion or made it less believable.
126 S. R. Corman & T. Kuhn
To control for the effects of our excerpting process, we used multiple pairs of excerpts. To test for artifacts related to cutting these from a longer conversation, we first generated short, medium, and long transcripts using the automaton described above. The short transcript contained 12 turns, the medium 17 turns, and the long set 28 turns. We then selected three excerpts of equal length from approximately the middle of three different transcripts from Davis’s groups, excluding those that had been used to construct the automaton. We distributed these transcripts to a convenience sample of 150 participants drawn from a basic undergraduate course at a large university in the Southwestern U.S. In selecting this sample size, our assumptions were that there should be apparent differences between computer generated and human generated transcripts, and that the differences should become more clear with additional text for comparison. Assuming a moderate-to-strong effect size ðf ¼ :325Þ and three treatment groups with a ¼ :05; n ¼ 150 provides power . 0.95 (based on calculations by GPOWER; Erdfelder, Faul, & Buchner, 1996). Results were similar for a planned t-test to compare the transcript types, with a moderate to strong effect size ðd ¼ 0:6Þ requiring n ¼ 122: Each participant received a computer-generated transcript and a human-group transcript, the order of which was randomized; additionally, each was randomly assigned to one of nine conditions representing combinations of transcript length. Lacking existing measures of coherence and believability, we devised three items to test for each. Participants were asked to read each transcript and then to rate it on six Likert scales corresponding to these items. For coherence, the items were: (a) “The discussion in the excerpt doesn’t make sense”; (b) “The discussion in the excerpt is coherent”; and (c) “The statements made by the participants fit together.” For believability, the items were: (a) “The discussion in the excerpt is believable”; (b) “I believe a real jury could have a discussion like this”; and (c) “I don’t think the discussion in the excerpt really could have happened.” Participants returned 144 usable transcripts. One of these was incomplete, leaving 143 observations for coherence. A reliability analysis indicated that both coherence and believability were best represented by two of their three component items. Low item-total correlations were observed for the first coherence item and the third believability item. We therefore created composite scales from the remaining items for coherence ða ¼ :84Þ and believability ða ¼ :84Þ: To test for the effects of transcript length, we analyzed ratings for computergenerated transcripts and human-group transcripts with separate one-way ANOVAs, with transcript size as the treatment. Only the test for coherence of the computergenerated transcripts even approached showing a significant effect for transcript size (F ¼ 2:836; df ¼ 2; 140, ns). The test for the believability of the computer-generated transcript was not significant (F ¼ 0:656; df ¼ 2; 140, ns), as were tests for the human-group transcripts’ believability (F ¼ 0:522; df ¼ 2; 141, ns) and coherence (F ¼ 0:105; df ¼ 2; 141, ns). To test for differences between the computer-generated and human-group transcripts, we collapsed across transcript length to obtain means and standard deviations for the two variables. Comparisons of the means showed no significant differences between the computer-generated and human-group transcripts. For believability, the difference between means was 0.1956, compared to a pooled
Socio-Egocentric Group Speech
127
variance of 1.178, tð285Þ ¼ 1:24; ns. For coherence, the mean difference was 0.2621, against a pooled variance of 0.1424, tð285Þ ¼ 1:83; ns. We concluded from this analysis that there were no substantive differences between the computer-generated and human-group transcripts with regard to believability or coherence. There was also little evidence that the length of the transcript had an impact on these judgments, at least within the range tested. We would expect that if the excerpting process were somehow inducing incoherence in the human-group transcripts, the coherence ratings would differ by size, with larger excerpts being more coherent. There were, however, no significant differences of this kind. In one case, a computer-generated transcript was rated more coherent than the others; we eliminated this transcript and its corresponding-length human-group transcript. Step 3: Evaluating socio-egocentric differences between transcripts A final preparatory task concerned determining whether we had the kind of variation in interactions needed between the two types of transcript: We did not know that the human-group transcripts were not socio-egocentric. Given that we were working with data from a previous study, a test of all the criteria suggested by Pavitt (1993) was not practical. We lacked the data to demonstrate outcome effects of the discussions; moreover, no suitable category system existed for demonstrating sequential dependence of utterances. Consequently, we used an alternate method to establish that the human-group transcripts did not exhibit socio-egocentric speech. Under these circumstances, we believe that a close analysis of the local coherence of the discourse was an acceptable alternative for determining whether the human-group transcripts could represent socio-egocentric speech. Centering theory (Grosz et al., 1995) holds that coherent communication is maintained across utterances through connections between forward-looking and backward-looking centers. Forwardlooking centers are elements of language (usually noun phrases) to which subsequent statements could coherently refer. Backward-looking centers are references to forwardlooking centers in previous utterances. Every utterance (except the first and last) has both kinds of centers, and the connections between them are what make the utterances coherent as a set. If we could demonstrate that such connections existed between turns in the human-group transcripts, then we could conclude that the transcripts had at least one kind of sequential structure and sequential relevance. Our test for centering was based on discourse analysis of the transcript excerpts. We analyzed each speaking turn to identify plausible backward looking references to previous utterances. In most cases centers are nouns and noun phrases, but in some cases the center must be inferred from ambiguous statements. Consider this example: Person 1: He shot him in the head twice. Person 2: Twice, right.
The second statement implies coherence because it makes reference to the number of times shots were fired. For each turn, we identified the feature of the preceding
128 S. R. Corman & T. Kuhn
utterance to which it referred, excluding the personal pronoun “he,” which was used pervasively in the transcripts to refer to the defendant. Since centering can operate at a lag of more than one utterance, if we were unable to find a plausible center in the previous utterance, we looked back two or more positions to find a plausible reference and collected these in a list. The number of cases for which there were plausible centers, as well as their average lag, provided us with an indication of the extent to which centering was present in the transcripts. Across the three transcripts, there were 51 potential connections between a backward-looking center in one utterance and a forward-looking center in a previous one. (These results are available from the authors upon request.) In our transcripts, 38 (74.5%) of these made reference to a center in the previous utterance. An additional 10 cases (19.6%) refer to centers two or three utterances back, which we regard as a plausible lag given potential side-sequences and back tracking in these kinds of discussions. Thus, 48 (94.1%) of the linkages were consistent with a coherent discussion. As for cases not fitting this idea, in one case there was a reference of lag five, and two utterances had no clear linkage to a previous center. On balance, we concluded that the transcripts showed clear evidence of local coherence and were therefore unlikely to have been generated from a socio-egocentric communication process. We considered performing a similar analysis on the computer transcripts, but quickly found that it was a pointless exercise because most statements had no plausible backward references except by chance and at large lags. This is not surprising since the process that selected the statements was random (except for exclusion of previously used reasons). Accomplishing these three preliminary tasks—obtaining computer-generated and human-group transcripts of group speech, assessing believability and coherence of transcripts, and displaying evidence for the absence of socio-egocentric speech— provide confidence in our ability to test our hypothesis and answer our research questions. Doing so depended on people’s ability to detect socio-egocentric group speech and to articulate judgment criteria guiding their selection. We report on this investigation next. Detectability of Socio-Egocentric Speech The most straightforward test of raters’ ability to distinguish the computer-generated transcripts from human-group transcripts would be to give them one or the other and ask them to make a judgment. This is an unsatisfactory approach for two reasons. First, it gives raters a 50% chance of getting the judgment correct with a random guess. Using multiple ‘transcripts creates a lower probability of chance success and, therefore, provides a clearer basis for differentiating raters’ capabilities and the criteria they employed. Second, RQ1 concerns the possible effects of comparing different transcripts. Accordingly, our chosen design contained four conditions. Our second preliminary task, described above, yielded two pairs of excerpts. These were paired in a crossed design such that, depending on the condition, participants would either receive two computer transcripts, two human transcripts, or one of each.
Socio-Egocentric Group Speech
129
Participants and instruments Participants were 400 undergraduate students enrolled in basic communication classes at the university mentioned above. Participants were randomly assigned to the four conditions and were given the appropriate pair of two transcripts and a response sheet with items for both. After being briefed on the nature of the study, they received the following instructions: The packet you received contains two transcript excerpts. They may both be taken from an actual jury deliberation, they may both be computer generated, or you may have one of each. Please read the first transcript, answer the basic questions for it, then do the same for the second transcript.
Three items were given for each transcript. The first was a nominal forced choice: “Was the transcript you just read human-group or computer generated?” The second was a scale relating to the question “How confident are you that this judgment is correct,” anchored with “very confident” and “very unconfident.” Finally, there was an openended question: “What are your reasons for making the judgment you did on this transcript?” Students received extra course credit for completing the questionnaires. We received 347 in usable condition. Coding The questionnaires contained open-ended responses indicating reasons why participants made judgments as they did. There were 62 cases in which responses to the open-ended question were missing or unusable; hence, analyses using the coding data are based on 285 cases. To make the responses amenable to analysis, we first unitized them to indicate individual reasons given for decisions on each of the transcripts. We defined these as “thought units” (a phrase at the sentence or subsentence level) indicating a unique reason within an answer. We used thought units rather than a more formal unit, such as sentences, because participants sometimes clearly gave more than one reason per sentence. For example, this partial answer, containing three units (in italics) is representative: “The yellow transcript was more detailed and contained more complete thoughts. The jurors seemed to be too organized.” We assessed unitizing reliability by having two coders bracket thought units in a sample of 95 responses. Because all disagreements were matters of lumping and splitting (e.g., one coder would code one unit where the other would split the same set of words into two units) we used the correlation between the number of units identified by each coder as our reliability statistic. The value was r ¼ :874; which we judged acceptable. Coders divided the task of unitizing the remaining responses. Inspection of the reasons given by respondents revealed a great deal of variability that could potentially be relevant to accuracy of judgment. As might be expected, the participants justified their choices by identifying attributes of the transcripts they considered relevant to their choices, so we inductively derived categories for this coding function. Compiling the coded units into a large list, we manually inspected these to find clusters of similar reasons. This was an iterative process, in which we developed clusters, had independent coders classify a sample set of reasons, computed
130 S. R. Corman & T. Kuhn
Table 1 Attribute Coding Categories and Definitions Category
Participant’s response unit
Disfluencies Informality
Uses words like “um,” “uh,” “like,” “y’know,” etc. Notes informality reflected in specific speech characteristics. Includes use of vernacular terminology, informal or incorrect grammar, comments referring grammatical errors, and language or word choice other than disfluencies, including slang, colloquialisms, and informal vocabulary. Notes mistakes in spelling, syntax, and/or punctuation in the transcripts. Refers specifically to error that would occur in the production of the transcript. Makes global judgment referring to naturalness or “flow” of conversation, believability, real-soundingness. Notes interruptions, being jumbled, incomplete sentences, sentence fragments. Comments about the interaction being clear, direct, to the point, easy to understand, organized, making sense, being easy to follow, structured. Deals with reasoning or persuasiveness, evidence of thinking or reasoning, or intelligence displayed by jurors. Comments on the quantity of detail or depth of conversation or individual utterances, the extent of evidence considered, range of topics discussed, or number of viewpoints expressed. Comments on the tendency of the jurors to be able to empathize, emote, or place themselves in the context of the crime or trial, or to identify with the criminals/victims. Involves the ability of the participant to imagine themselves in the trial or on the jury, or to understand the emotions, thinking, etc. of the people involved in the trial. Does not clearly fit the above categories.
Errors Naturalness Fragmentation Clarity Reasoning Depth Involvement Empathy Other
reliabilities, and discussed disagreements, repeating the process until acceptable disagreement levels were achieved. The final system appears in Table 1. Two coders applied these to the entire set of unitized reasons; reliability (kappa) for the final coded units was k ¼ 0:89; a value we judged acceptable. The attribute coding system contained 11 categories. To help reduce the number of categories in this complex set, we sought to identify a smaller number of types of participants who differed systematically according to their use of the 11 attributes. For this purpose, we performed a hierarchical cluster analysis (Ward method) on Euclidean distances between participants’ profiles of attribute use. Sixteen clusters were distinguishable in the resulting analysis, which hardly reduced our number of variables. We therefore found a break in the hierarchy that combined these into a reasonable-sized set of higher order clusters. The result was six clusters of participants and the attributes they used, as shown in Table 2. We reckoned that because participants in these clusters made judgments based on different kinds of attributes, they may have been better or worse at distinguishing the human from computer transcripts. All but two clusters clearly encompassed cases in which respondents made heavy use of one of the attributes in a single category. One exception was Cluster 2, in which there was
Socio-Egocentric Group Speech
131
Table 2 Clusters of Users Based on Attribute Profiles Cluster
N
Interpretation
1. Other
44
People who base judgments on attributes that don’t fit the coding system. People who make specific judgments about conversational plausibility based heavily on some other attribute. People who base judgments on the quantity of depth or detail in the transcript. People who make judgments based on reasoning in, and/or form of, the conversation represented in the transcript. People who base judgments on the extent to which the transcript is easy to follow. People who make global judgments about the extent to which the transcript “sounds” real.
2. Conversational
108
3. Depth
21
4. Reasoning
31
5. Clarity
38
6. Naturalness
43
consistent but not heavy use of the naturalness attribute along with heavy use of some other attribute that differed by participant. We interpret this as a specific evaluation of the plausibility of the conversation for a particular criterion (as opposed to the global judgment of naturalness in Cluster 6). The other exception was Cluster 4, which contained cases that heavily emphasized either reasoning or informality attributes, which we interpret as a judgment about whether human-like reasoning is present.
Analysis strategy Our hypothesis predicted that participants could distinguish human-group from known socio-egocentric speech at better than chance rates. One way to test this was to determine the probability that a given participant correctly classified a given transcript. A participant making a random choice would have a probability of p ¼ :5 of making a correct judgment on a transcript. Given that participants were making judgments for two transcripts, we tested H1 by defining expected frequencies under a chance model using the joint probabilities of getting zero, one, or two judgments correct (p ¼ :025; p ¼ :5; and p ¼ :025; respectively). We tested for significant deviation from this distribution by means of chi-square. The research questions relate to differences between transcript types, but our experimental design had each participant judging two transcripts. Traditional statistical tests are questionable under such circumstances because they carry strong assumptions about the independence of observations. Therefore, we tested for the presence of dependence between participants’ judgments by means of an intra-class correlation (see Gonzalez & Griffin, 1997). The value r ¼ 2:30 was significant, z ¼ 25:41; p , :05: This means that there was indeed dependence between observations, and that, in our case, traditional parametric statistics were inappropriate. For an alternative method of analysis, we turned to classification trees (Breiman, Friedman, Olshen, & Stone, 1984) because they allow for testing of complex claims
132 S. R. Corman & T. Kuhn
about combinations of variables without assuming independence of observations. A classification tree is a non-parametric method that selects from a large number of nominal independent variables4 the combination that most efficiently classifies observed instances of a nominal dependent variable according to some purity criterion. In our case, this meant it revealed the combination of independent variables that best sorted out correct and incorrect judgments. In doing this, it also provided an estimate of the value or importance of the independent variables in making those classifications. The classification tree procedure begins with the selection of the independent variable that yields the purest classification of the dependent variable within its categories. It is possible that one independent variable might perfectly classify all the cases into two groups of right and wrong judgments; however, pure splits (or even close to pure splits) are rare. Thus, the procedure continues in an iterative manner, locating the independent variable from those remaining in the set that makes the purest split of the cases in the branch. The analysis proceeds for both branches of the initial split, which eventuates in a hierarchy, or tree, of purest successive splits based on the independent variables. The procedure stops when some acceptable level of purity is achieved at the “leaves” of the tree. Following the branches of the tree allows interpretation of the combinations of independent variables that lead to certain values of the dependent variable (in our case correct or incorrect judgments), and the importance of the various independent variables in that chain. The importance of the branches can also be judged in terms of the number of cases they contain. In the present study, the dependent variable of interest was a correct or incorrect judgment about the origin of a transcript, and the independent variables were the categories and coding functions described above. The classification tree analysis allowed us to understand whether and, if so, how different attributes of the transcripts led to correct or incorrect judgments, even though those judgments were not independent.
Results Table 3 shows the results for H1. The test value of 7.41 exceeds the critical value of x2 ð2; p , :025Þ ¼ 7:38: Unexpectedly, however, the residuals show that the effect is in the direction of inaccuracy: Relative to what we would expect by chance, there were fewer people with one correct judgment and more people with zero correct judgments. Table 3 Chi-square Analysis of Distribution of Correct Judgments N correct 0 1 2 Total
Observed
Expected
Residual
107 152 88 347
86.75 173.5 86.75
4.73 2.66 0.02 7.41
Socio-Egocentric Group Speech
133
The participants did significantly worse in accurately judging the types of transcripts they were reading than chance alone would lead us to expect.
Classification Tree The classification tree sheds some light on this result. It was produced with CART5 software using as input 570 judgments (two from each of the participants with usable open-ended responses). The dependent variable was judgment (incorrect or correct). Independent variables included the attribute profiles from the clustering described above, the type of transcript being judged (computer or human), the type of the other transcript in the pair (computer or human), and the participant’s stated confidence in the judgment (high or low, based on a median split of the confidence scale). All four independent variables were used in the resulting tree. The classification tree representing these variables appears in two halves to accommodate printing constraints. The primary split is based on type, whether the transcript being judged is human generated (Figure 1) or computer generated (Figure 2). In both figures, the transcript type is shown at the bottom, followed by a number of branches leading to leaves. Each branch and leaf is labeled with the independent variable category it represents, as well as the percentage chance of a participant correctly identifying the type of transcript for cases in that branch or leaf. The size of the leaves indicates the number of cases they contained, and the shade indicates whether the leaves were balanced toward probable misjudgments (black) or probable accurate judgments (white).6 Leaf 5a Clust: Depth 0% 4Cases
Leaf 5b Clust: Clarity 71.5 11Cases
Leaf 1b Other: Human 41.8% 90 Cases
Leaf 4a Confidence: Low 68.7%30 Cases
Leaf 1a Other: Computer 31.7% 145 Cases
Branch 5 Confidence: High 51.8%
Leaf 3a Clust: Depth 73.8% 4 Cases
Leaf 3b Clust: Clarity 48.4% 14 Cases
Branch 4 Other: Computer 63.0%
Branch 1 Clust: Conversational/ Reasoning/Other 35.6%
Branch 2 Clust: Depth/Clarity 60.4%
Branch 3 Other: Human 54.0%
START Type: Human 40.7%
Figure 1 Classification tree for human-generated transcripts.
134 S. R. Corman & T. Kuhn Leaf 6a Other: Human 70.1% 14 Cases
Leaf 6b Other: Computer 48.4% 4 Cases
Leaf 8a Clust: Depth/Clarity 36.0% 8 Cases
Leaf 5b Clust: Other 51.4% 17 Cases Leaf 5a Clust: Reasoning/ Natural 41.8% 30 Cases
Leaf 8b Clust: Natural 70.0% 7 Cases Leaf 9a Clust: Natural 48.4% 6 Cases
Branch 8 Other: Computer 51.8% Leaf 4a Clust: Conversational/ Depth 57.2% 80 Cases
Branch 6 Clust: Clarity 65.2% Branch 5 Other: Human 40.7%
Leaf 3a Other: Computer 29.4% 26 Cases Branch 3 Clust: Other/ Reasoning/Natural 36.9%
Branch 4 Clust: Conversational/ Depth/Clarity 58.7%
Branch 1 Confidence: Low 50.5% START Type: Computer 60.3%
Branch 7 Clust: Depth/ Clarity/Natural 77.1%
Leaf 9b Clust: Depth/Clarity 70.9% 18 Cases Branch 9 Other: Human 65.2% Leaf 2a Clust: Conversational/ Reasoning/Other 88.1% 62 Cases
Branch 2 Confidence: High 77.1%
Figure 2 Classification tree for computer-generated transcripts.
Human transcripts Participants had only about a 40% chance of correctly identifying the human transcripts as human generated. The most important factor was the attribute cluster employed, with participants using conversational, reasoning, and other attributes (Branch 1) doing much worse than chance. In Figure 1, it is evident from the size of Leaves 1a and 1b that most of the cases were in this branch. Participants did worst of all in this branch if the other transcript they were judging was computer generated. The higher accuracy human judgments were based on depth and clarity attributes (Branch 2). They did somewhat better if the other transcript being judged was computer generated than if it was human generated. For the latter case (Branch 3) people using depth attributes did much better. When the other transcript was computer generated (Branch 4) having low confidence made people somewhat more accurate (Leaf 4a) and, unlike in branch 3, people using the depth attribute did worse (Leaf 5a). Computer transcripts In general, participants were more likely to judge computer transcripts correctly (60.3%). Figure 2 shows that in contrast to human-transcript judgments, those with high confidence (Branch 2) were much more accurate, especially if the other transcript being judged was human-generated (Branch 9). Here again, we see contrasting effects for one of the attribute clusters: Those using the natural attribute did worse than chance if the other transcript was human (Leaf 9a) but did much better than chance if the other transcript was computer generated (Leaf 8b). For those with low confidence (Branch 1) a large group using conversational or depth attributes
Socio-Egocentric Group Speech
135
did better than chance (Leaf 4a). Those applying clarity attributes (Branch 6) did better if the other transcript was human (Leaf 6a). The poorest performance on the computer transcripts came from those applying reasoning, natural, and other attribute profiles (Branch 3). Overall patterns Depending on the type of transcript and/or type of the other transcript, there were several cases of opposite effects for the same attributes. The most striking was the conversational attribute, which was associated with correct judgments for computer transcripts (Branch 4 and Leaf 2a) but incorrect judgments for human transcripts (Branch 1). For high-confidence computer judgments, the depth/clarity attributes and the naturalness attribute had opposite effects depending on whether the other transcript in the pair was computer or human (Leaves 8a– 9b). The same was true for depth versus clarity on the right portion of Figure 1. Some additional results pertain to the research questions. CART takes a learning sample of cases from the total and constructs the purest possible tree from the sample. The value of the dependent variable for the remaining cases is predicted on the basis of that tree and is compared with the actual values. An interesting result is that while the tree correctly predicted about 73% of the inaccurate judgments, it correctly predicted only about 58% of the accurate judgments. Finally, CART produces a rating (expressed as a percentage) of the importance of the independent variables relative to the most important independent variable; in this case, it was attribute profile (100%), transcript type (56.8%), confidence (50.0%), and other transcript (45.6%). In general, then, the attributes used by participants to make their judgment had twice as much overall impact on accuracy as any of the other independent variables. Discussion Our study reflected a desire to examine Hewes’s SEM, which proposes that group members pay scant attention to the content of one another’s statements during decision making and instead focus their cognitive energy on preparing for their own chances to voice their views. We wondered whether socio-egocentric speech could be detected, which would tell us whether the SEM is even a plausible model of natural group interaction. To test this, we obtained samples of actual group speech from mock juries (Davis et al., 1978) and used these to develop an automaton that produced contrived group speech in accordance with the SEM. We then made sure that our human-group transcripts were not socio-egocentric and that there were no significant differences in coherence or believability. This exercise led us to identify one shortcoming of the SEM as it is currently formulated: Vacuous acknowledgements must be valenced to properly express agreement or disagreement with the preceding statement they acknowledge. Even in a socio-egocentric discussion, group members could not completely ignore the content of what others are saying and still accomplish this move.
136 S. R. Corman & T. Kuhn
Next, we conducted a quasi-Turing Test and found that our socio-egocentric automaton passed it: Participants could not reliably distinguish transcripts produced by it from transcripts generated by interacting human groups. We therefore failed to reject the null hypothesis in the test of H1. In fact, participants performed at rates worse than one would expect on the basis of chance. The reason for this lack of accurate judgment seems to be that participants were not deciding randomly, but were (as a group) applying judgment strategies that simply did not work very well. Investigation of our research questions shed light on this finding. We placed cases into attribute profiles, clusters reflecting the types of attributes participants used to make their judgments, as expressed in their open-ended answers. Our reasoning was that some people, because of the way they analyzed the text, might be better than others in distinguishing the human-group from the computer-generated transcripts. For example, because the SEM addresses only a global appearance of coherence, those focusing on local conversational depth and reasoning might have a better chance of detecting it. In the classification tree analysis, attribute profile (see Table 2) was the most important factor predicting a correct or incorrect judgment. Its predictive accuracy was roughly double that of any of the other independent variables. This provided an unambiguous answer to RQ2, although not the one we envisioned when writing the question: The judgment criteria used by raters did influence their ability to judge a given example of group speech correctly, but consistent use seems likely to lead to either failure or mixed results. This answer requires unpacking. In Figure 1, most errors occurred when participants focused on the conversationalness or reasoning in the transcript. In any real group discussion, there are bound to be statements that are not well reasoned, and points where the conversation does not go smoothly. Given uncertainty about the origin of the transcripts, participants were willing to attribute these disfluencies to mistakes made by a computer. Depth and clarity, which are more concerned with the specific discourse and its coherence, served raters of the human transcripts much better, though they were less popular choices. In the context of the overall judgment task, however, we cannot conclude that avoiding conversational attributes is a good idea. Participants who employed conversational and reasoning (and other) attributes, when they also had high confidence in their judgment, had a high probability of correctly identifying a computer-generated transcript. Ironically, this leaf (Leaf 2a in Figure 2) has the highest probability of a correct judgment in the whole tree, but it represents the exact same combination of attributes that explain inaccurate judgments of human-generated transcripts (i.e., Branch 1 of Figure 1). Further, there were other such contradictions in the trees. For example, in Branch 2 of human transcript judgments (Figure 1), depth and clarity seem to have opposite effects on accuracy depending on whether the other transcript given to the participant was human or computer generated. Thus, our results indicate that the best attributes to use for judgment depend on the type of transcript being judged and the type of transcript available for comparison, and that the attributes that increase performance for one tend to decrease performance for the other.
Socio-Egocentric Group Speech
137
This provides a partial answer to RQ1, though again it is not the one we anticipated because of the complex interaction with attribute use. In most instances, when a split was made based on the type of other transcript (human or computer), a judgment was more likely to be correct when a participant had a contrasting other type of transcript. Relatively speaking, however, this was the least important of the independent variables in the tree. Much more important were the attributes participants used and, as noted above, these interacted in complex ways with type of transcript and type of other transcript. Overall, the most important factor in participants’ performance was the attributes they cited in explaining their judgments. Consistency was apparently a hindrance in this task, and we know from the intra-class correlation analysis that participants were consistent in their judgments. This explains how participants could have done worse than chance: Applying any kind of consistent judgment criteria seems to decrease performance, in this comparative design. Next, we explore the implications of these findings for the coherence of communication in the SEM and for the possibility of automated interventions in group interaction. Coherence in the Socio-Egocentric Model A further explanation for the poor performance of our judges revolves around types of coherence in group speech. Above, we noted that the consistent use of attribute profiles seemed to work to the detriment of raters in their judgment tasks. In constructing the trees, our classification procedure often began its splits with combinations of the attribute profiles. Generally, when assessing computer transcripts, the attributes producing the best chances of success were conversational and reasoning (and, to a lesser extent, naturalness) when the other transcript was also computer generated, but depth and clarity were better when raters were able to compare with a human transcript. For transcripts produced by human groups, the most reliable attributes were depth and clarity, across several contingencies. These findings suggest that the attribute profiles of the participants were comprised of either local or global criteria. It seems plausible that those using the conversational and naturalness attributes, in particular, focused more on holistic judgments about the transcript’s believability, while those using the depth and clarity clusters attended to turn-by-turn dynamics and the degree of detail achieved through a series of speaking turns. In short, participants likely attended to different “levels” of analysis in assessing the plausibility of the excerpt. The application of local or global heuristics resulted in differential degrees of success depending on the source of the excerpt they were considering, but were not equally successful. In other words, the consistent application of a heuristic suited to a human transcript would often produce success in judging a particular transcript type given the presence of other variables, but would usually lead to an immediate misperception of the other transcript type. We advance this notion of differing levels of attributes in participants’ reasoning as a tentative explanation while noting that participants rarely made explicit mention of
138 S. R. Corman & T. Kuhn
the excerpt evidence on which they drew in making judgments. A distinction between local and global criteria aligns well with differing versions of coherence in the group communication literature. In Hewes’s model, coherent interaction exists when there is influence in message content or function between two turns of talk occurring within a segment of the discussion. Pavitt and Johnson (1999), following Tracy (1982), refer to this as coherence generated by local topical relevance, in which utterances are meaningfully connected to some aspect of the immediately preceding utterance. An alternative route to coherence is global topical relevance, in which utterances refer to the more encompassing topic of discussion. It is possible that participants who focused on conversational and naturalness attributes used a global criterion of coherence but ignored turn-to-turn dynamics. In doing so, they may have found evidence of coherence in the computer-generated excerpts because all statements were drawn from a database of reasons derived from an actual deliberation. Those who correctly assessed the source of the human transcripts tended to focus more on the degree to which statements followed from, and built upon, one another. Attention to the semblance of relevance created by VAs in a human transcript, for instance, could lead to concentration on this misleading element in a computer transcript. A misalignment of heuristics with transcripts, then, could provide an explanation for the lower-than-chance rates of success of the sample overall. Specifically, if heuristics that direct attention to either local or global phenomena were not evenly distributed in the sample, judgments would likely be biased in one direction. This explanation in no way invalidates the SEM, but rather shifts the ground of debate and suggests a direction for future research. Perhaps it is necessary to move from the micro-level toward more global characteristics of conversation in order to locate instances of intra-group communicative influence.7 Similar reasoning is apparent in Meyers and Brashers (1999), who review the literature on group influence within a nine-cell framework that displays the multiple levels of messages and sources of message production in group interaction. From this perspective, it may be the case that our respondents’ conceptions of influence reflected higher level considerations of speech acts (like arguments and strategies) while they ignored turn-by-turn dependencies in the content of statements. Hewes suggests that communicative influence must be shown not only in the coherence of group speech, but also that it must effect changes in members’ cognitions (see Seyfarth, 1999) and have impact on group decisions. Although these issues were outside the scope of our study, future research must engage with them in responding to the socio-egocentric challenge. Monitoring Interaction in Group Decision Making If heuristics about local and global coherence surface in assessments of transcribed group discussion, they are also likely to operate in respondents’ actual group interactions. These heuristics can likewise affect judgments of one’s own and others’ contributions to group decision making. In instances of group decision making in which the choice has substantial consequences—as is usually the case with juries— monitoring members’ interactions to ensure both local and global relevance would
Socio-Egocentric Group Speech
139
seem to be in the interests of functional decision making (Hirokawa, Erbert, & Hurst, 1996). Such a monitoring role could be performed by a single member of the group whose cognitive complexity enables him or her to overcome socio-egocentric demands and take a directive stance in discussion (Gruenfeld, Thomas-Hunt, & Kim, 1998). This is not an argument for the precedence of non-interactive factors (which the SDS model suggests are predictive of decision making), but is instead about members’ personal characteristics that have the potential to shape subsequent group discussion (Haslett & Ruebush, 1999; Seibold, Meyers, & Sunwolf, 1996). The potential influence of individual characteristics on the coherence of group discussion raises an interesting intervention possibility. The results suggest that the participants, on average, could not discern computer from human speech. This indicates that an intelligent quasi-agent (Corman, 1997) could be embedded in an interaction system and be perceived by human members as conversational and natural and, thereby, shape discussion. Proponents of group decision-making systems envision an intervention along these lines in considerations of “level 3” group decision support systems, which “are characterized by machine-induced group communication patterns and can include expert advice in the selecting and arranging of rules to be applied during interpersonal interaction” (Poole & DeSanctis, 1990, p. 175). If such intelligent agents can be designed to be seen by members to be legitimate interactants, some of the cognitive affiliative, and egocentric constraints on effective decision making (Gouran & Hirokawa, 1996; Kuhn & Poole, 2000) may be attenuated. Limitations Although our quasi-Turing Test provided for an examination of the SEM’s assumptions, three limitations temper our conclusions. First, our automaton formalizes socio-egocentric group discussion, but neither traces possible lagged influence nor examines co-variance with a group output measure. Both of these are essential in a full consideration of Hewes’s model. Hence, ours is a necessary test of the SEM’s assumptions, but not a sufficient test of the theory. Second, our participants were cued by being placed in a situation in which they were instructed to seek differences between transcripts (or cues within a transcript) and to explain the heuristics they employed in finding these differences. It is difficult to imagine collecting these data without such cuing. However, we acknowledge that doing so may have made raters more prone to look for evidence of their lay theories of interaction. Doing this may have placed an analytical “frame” on the judgment task that influenced heuristic use, though it did not seem to improve their ability to discern transcript sources. Finally, although our findings can be applied to decision-making groups in a variety of contexts, this undoubtedly important and very frequently studied form of group decision-making interaction is only one of many types. Even groups dedicated to arriving at decisions are not always in decision-making mode (Scheerhorn, Geist, & Teboul, 1994). Moreover, juries engage in a highly circumscribed form of decision
140 S. R. Corman & T. Kuhn
making that may exclude the sort of problem solving, embeddedness, and relationship management integral to many other groups (Putnam & Stohl, 1990), and this should be a concern in subsequent research on the SEM.
Conclusion In response to models of group decision making that disagree concerning the role of communication, Hewes proposed a baseline model for assessing the influence of group discussion on decision making. Refuting his socio-egocentric model has proved a challenge for scholars, so we turned the question: Instead of seeking unambiguous evidence of communicative influence, we asked whether people can reliably distinguish simulated socio-egocentric group speech from speech generated by real human groups. In the end, our quasi-Turing Test cannot definitively answer questions regarding the impact of communication on group decision making. We have, however, added to the knowledge about how members’ statements are linked in group discussion and shed light on the resources people employ in making sense of group interaction. Our evidence suggests that people may employ different criteria as evidence of influence in group interaction than the linked content of utterances proposed by Hewes. We anticipate that the insights produced by this study will prove fruitful for research on communicative influence and group decision making not merely concerning the SEM, but for the study of group interaction more generally.
Notes [1] Hewes does not necessarily believe communication has no effect on group outcomes, only that such an effect has not been conclusively demonstrated. Therefore he offers his socioegocentric model as a baseline or “foil” for efforts to demonstrate communicative effects. [2] We call ours a quasi-Turing Test for two reasons. First, participants in our studies were not allowed to interact with a machine and with human groups, only to read transcripts of their conversations. Second, as described in the methods section, in some conditions of the experimental design, participants rated two of the same kind of transcript. [3] If the VAs did not do this, then it is hard to see how they could be judged to be coherent. Coherence requires that an utterance must connect in some discernable way to subjects or objects (or centers: Gordon & Hendrick, 1998; Grosz, Weinstein, & Joshi, 1995) deployed in previous utterances. This suggests that group members must be at least somewhat more attentive to the content of the conversation than is suggested by Hewes: They could not be so selfabsorbed as to completely ignore what other members are saying and still competently execute coherent vacuous acknowledgements. [4] Classification trees can also be used with ordinal variables, and a variant of the method, called regression trees, works with value ranges of continuous variables. [5] More information about this software is available at http://www.salford-systems.com. [6] There are somewhat more ðnc ¼ 298Þ cases in the computer branch than in the human branch ðnh ¼ 272Þ because unequal numbers of questionnaires were returned in the four experimental conditions.
Socio-Egocentric Group Speech
141
[7] Hoffman and Kleinman (1994) advance a similar claim in their argument for a group valence model of decision making, as opposed to the aforementioned SDS and DVM.
References Axley, S. (1984). Managerial communication in terms of the conduit metaphor. Academy of Management Review, 9, 428–437. Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and regression trees. Belmont, CA: Wadsworth. Corman, S. R. (1996). Cellular automata as model of unintended consequences of organizational communication. In J. Watt, & A. VanLear (Eds.), Dynamic patterns in communication processes (pp. 191–212). Thousand Oaks, CA: Sage. Corman, S. R. (1997). The reticulation of quasi-agents in systems of organizational communication. In G. A. Barnett, & L. Thayer (Eds.), Organization—communication: Emerging perspectives: V. The renaissance in systems thinking (pp. 65–82). Greenwich, CT: Ablex. Davis, J. H. (1969). Group performance. Reading, MA: Addison-Wesley. Davis, J. H. (1973). Group decision and social interaction: A theory of social decision schemes. Psychological Review, 80, 97–125. Davis, J. H. (1982). Social interaction as a combinatorial process in group decision. In H. Brandsta¨tter, J. H. Davis, & G. Stocker-Kreichgauer (Eds.), Group decision making (pp. 27–58). London: Academic Press. Davis, J. H., Spitzer, C. E., Nagao, C., & Stasser, G. (1978). The nature of bias in social decisions by individuals and groups: An example from mock juries. In H. Brandsta¨tter, J. H. Davis, & H. Schuler (Eds.), Dynamics of group decisions (pp. 33–52). Beverly Hills, CA: Sage. Erdfelder, E., Faul, F., & Buchner, A. (1996). GPOWER: A general power analysis program. Behavior Research Methods, Instruments, and Computers, 28, 1–11. Gonzalez, R., & Griffin, D. (1997). On the statistics of interdependence: Treating dyadic data with respect. In S. Duck (Ed.), Handbook of personal relationships: Theory, research, and interventions (pp. 271–302). New York: Wiley. Gordon, P. C., & Hendrick, R. (1998). The representation and processing of coreference in discourse. Cognitive Science, 22, 389–424. Gouran, D. S., & Hirokawa, R. Y. (1996). Functional theory and communication in decision-making and problem-solving groups: An expanded view. In R. Y. Hirokawa, & M. S. Poole (Eds.), Communication and group decision making (2nd ed., pp. 55–80). Thousand Oaks, CA: Sage. Grosz, B. J., Weinstein, S., & Joshi, A. K. (1995). Centering: A framework for modeling the local coherence of a discourse. Computational Linguistics, 21, 203–225. Gruenfeld, D. H., Thomas-Hunt, M. C., & Kim, P. H. (1998). Cognitive flexibility, communication strategy, and integrative complexity in groups: Public versus private reactions to majority and minority status. Journal of Experimental Social Psychology, 34, 202–226. Haslett, B. B., & Ruebush, J. (1999). What differences do individual differences in groups make? The effects of individuals, culture, and group composition. In L. R. Frey, D. S. Gouran, & M. S. Poole (Eds.), The handbook of group communication theory and research (pp. 115–138). Thousand Oaks, CA: Sage. Hewes, D. E. (1986). A socio-egocentric model of group decision-making. In R. Y. Hirokawa, & M. S. Poole (Eds.), Communication and group decision-making (pp. 265–291). Beverly Hills, CA: Sage. Hewes, D. E. (1996). Small group communication may not influence decision making: An amplification of socio-egocentric theory. In R. Y. Hirokawa, & M. S. Poole (Eds.), Communication and group decision making (2nd ed., pp. 179–214). Thousand Oaks, CA: Sage.
142 S. R. Corman & T. Kuhn
Hirokawa, R. Y., Erbert, L., & Hurst, A. (1996). Communication and group decision-making effectiveness. In R. Y. Hirokawa, & M. S. Poole (Eds.), Communication and group decision making (2nd ed., pp. 269–300). Thousand Oaks, CA: Sage. Hoffman, L. R. (1979). The group problem-solving process: Studies of a valence model. New York: Praeger. Hoffman, L. R., & Kleinman, G. B. (1994). Individual and group in group problem-solving. Human Communication Research, 21, 36–59. Karelis, J. (1986). Reflections on the Turing Test. Journal for the Theory of Social Behaviour, 16, 161–171. Kuhn, T., & Poole, M. S. (2000). Do conflict management styles affect group decision-making? Evidence from a longitudinal field study. Human Communication Research, 26, 558–590. Latane´, B. (1996). Dynamic social impact: The creation of culture by communication. Journal of Communication, 46, 13–23. Lorge, I., & Solomon, H. (1955). Two models of group behavior in the solution of eureka-type problems. Psychometrika, 29, 139–148. McPhee, R. D., Poole, M. S., & Seibold, D. R. (1981). The valence model unveiled: A critique and alternative formulation. In M. Burgoon (Ed.), Communication yearbook 5 (pp. 259–278). New Brunswick, NJ: Transaction Books. Meyers, R. A., & Brashers, D. E. (1999). Influence processes in group interaction. In L. R. Frey, D. S. Gouran, & M. S. Poole (Eds.), The handbook of group communication theory and research (pp. 288–312). Thousand Oaks, CA: Sage. Nash, J. (1953). Two-person cooperative games. Econometrica, 21, 129–140. Pavitt, C. (1993). Does communication matter in social influence during small group discussion? Five positions. Communication Studies, 44, 216–227. Pavitt, C., & Johnson, K. K. (1999). An examination of the coherence of group discussions. Communication Research, 26, 303–321. Penrod, S., & Hastie, R. (1979). Models of jury decision-making: A critical review. Psychological Bulletin, 86, 462–492. Poole, M. S., & DeSanctis, G. (1990). Understanding the use of group decision support systems: The theory of adaptive structuration. In J. Fulk, & C. W. Steinfield (Eds.), Organizations and communication technology (pp. 173–193). Newbury Park, CA: Sage. Poole, M. S., McPhee, R. D., & Seibold, D. R. (1982). A comparison of normative and interactional explanations of group decision-making: Social decision schemes versus valence distributions. Communication Monographs, 49, 1–19. Putnam, L. L., & Stohl, C. (1990). Bona fide groups: A reconceptualization of groups in context. Communication Studies, 41, 248–265. Reddy, M. (1979). The conduit metaphor: A case of frame conflict in our language about language. In A. Ortony (Ed.), Metaphor and thought (pp. 284–324). Cambridge, UK: Cambridge University Press. Sacks, H., Schegloff, E. A., & Jefferson, G. (1974). A simplest systematics for the organization of turntaking for conversation. Language, 50, 696–735. Scheerhorn, D., Geist, P., & Teboul, J. B. (1994). Beyond decision making in decision-making groups: Implications for the study of group communication. In L. Frey (Ed.), Group communication in context: Studies of natural groups (pp. 247–262). Hillsdale, NJ: Lawrence Erlbaum Associates, Inc. Seibold, D. R., Meyers, R. A., & Sunwolf (1996). Communication and influence in group decisionmaking. In R. Y. Hirokawa, & M. S. Poole (Eds.), Communication in group decision-making (2nd ed., pp. 242–268). Thousand Oaks, CA: Sage. Seyfarth, B. J. (1999). Are reasons structures? A cognitive-structurational approach toward explaining small group processes. Unpublished doctoral dissertation. Minneapolis: University of Minnesota.
Socio-Egocentric Group Speech
143
Seyfarth, B. (2000). Structuration theory in small group communication: A review and agenda for future research. In M. Roloff (Ed.), Communication yearbook 23 (pp. 341–379). Thousand Oaks, CA: Sage. Shanon, B. (1989). A simple comment regarding the Turing Test. Journal for the Theory of Social Behaviour, 19, 249–256. Stasser, G., & Davis, J. H. (1981). Group decision making and social influence: A social interaction sequence model. Psychological Review, 88, 523–551. Sunwolf, & Seibold, D. R. (1998). Jurors’ intuitive rules for deliberation: A structurational approach to communication in jury decision making. Communication Monographs, 65, 282–307. Tracy, K. (1982). On getting the point: Distinguishing “issues” and “events,” an aspect of conversational coherence. In M. Burgoon (Ed.), Communication yearbook 5 (pp. 279–301). New Brunswick, NJ: Transaction Books. Turing, A. (1950). Computing machinery and intelligence. Mind, 59, 433–460.