Quantifying Uncertainty

Quantifying Uncertainty Negation and the Statistical Language of Film By Daniel Skipper Rasmussen Master’s thesis Film and Media Studies University of Copenhagen 2018 Supervisor: Paisley Livingston Abstract This thesis is an attempt to revise two beliefs firmly engrained within certain areas of cinema studies. The procedure will be to update the framework within which, and the premises on which, the conclusions to these beliefs are drawn. The first belief is that cinema is not a language. The second belief is a by-product of the first belief and states that negation, when conceived as a logical or linguistic operation, is not and cannot be part of cinema’s expressive vocabulary. I intend to show that both beliefs can be usefully reconsidered by envisaging cinema as a stochastic language. In this view, the language of cinema is uncertain enough to be orthogonal to formal principles or grammars, but is structured enough to form patterns evolving in accordance with the language of statistics and probability theory. It directly follows that, when trying to learn such a stochastic language – as is the job of any spectator – epistemic uncertainty inevitably comes along for the ride. In this statistical framework, negation takes shape not as a logical or linguistic operation, but as a distance between incompatible generative models of a stochastic cinematic phenomenon. To see how, in this updated view, negation can even be said to form an integral part of cinema’s language, I intend to analyse two films, O Último Mergulho, and The Duke of Burgundy, from the applied statistical perspective.

1

Table of Contents INTRODUCTION

3

FILM THEORETIC LANGUAGE GAMES AND CURRENT DOGMAS OF FILM THEORY TOWARDS A MINIMAL SPECTATOR AND “THE FILM ITSELF” THE ROLE OF NEGATION RESEARCH QUESTION AND STRUCTURE OF PAPER

7 8 9 9

THE ART OF MODELLING: THE CINEMA AS LANGUAGE HYPOTHESIS AND ITS CRITICS

11

THE LANGUAGE LOSS HYPOTHESIS CINEMA’S TOWER OF BABEL: FROM A LANGUAGE OF CINEMA TO LANGUAGES OF CINEMA SUMMARY OF PART I: LANGUAGE, PERCEPTION, STATISTICS

12 13 18

STATISTICAL LANGUAGE LEARNING AND THE BAYESIAN BRAIN

19

THE BAYESIAN BRAIN HYPOTHESIS AND THE PROBABILISTIC MIND

20

FORMALIZATION AND MODELLING OF CINEMATIC LANGUAGES

23

WHAT VARIABLES DO WE FORMALIZE? INTRODUCING THE BOOLEAN LOGIC OF FORMALIZATION INFERENCE AND PROBABILITY FROM THE BOOLEAN DATASET: UNCERTAINTY AND DEPENDENCIES THE TWO MAJOR INTERPRETATIONS OF PROBABILITY THEORY PROBABILITY FROM THE BOOLEAN DATASET REDUCING THE SIZE OF THE MATRIX: SPURIOUS DEPENDENCIES ESTIMATION OF FREQUENTIST PROBABILITIES FROM THE BOOLEAN DATASET NEGATION IN THE PROBABILISTIC PERSPECTIVE

24 28 29 29 31 32 34 35

CASE STUDIES: TWO FILMS

40

A NEGATION OF RELATIVE FREQUENCY: THE CASE OF O ÚLTIMO MERGULHO THE CONSTRUCTION OF PLAUSIBLE MODELS DIFFERENT MODELS WITHIN A FILM: A MULTIVARIATE ANALYSIS OF THE DUKE OF BURGUNDY CONSTRUCTING THE BOOLEAN DATASET THE OPENING SEQUENCE: A CASE OF MULTI-MODEL NEGATION

40 43 51 51 56

CONCLUSION

63

LITERATURE

64

APPENDIX

70

APPENDIX 1: A REVIEW OF EARLIER APPROACHES TO MODELLING OF VISUAL DATA CINEMETRICS AND THE MATHEMATICAL ANALYSIS OF STYLISTIC VARIABLES STRUCTURAL-LINGUISTIC ATTEMPTS AND NON-STATISTICAL MODELLING FROM CINEMA STUDIES TO MACHINE LEARNING: IMAGES PARSING AND VIDEO CONTENT MODELLING APPENDIX 2: THE MATHEMATICS OF RATIONAL INFERENCE AND STOCHASTIC MODELLING

70 70 70 72 74 78 78

2

Introduction “Perceptual similarity is the basis of all expectation, all learning, all habit formation. It operates through our propensity to expect perceptually similar stimulations to have sequels perceptually similar to each other. This is primitive induction” (Quine 1995: 19). “Why does a different result compel in us the conviction that the circumstances must have changed, if ever so little? We can reconcile ourselves to the conjunction of minimally altered circumstances with very strong influence upon the result, but can never admit to the slightest change in the result in genuinely unaltered circumstances.” (Schrödinger in Reichenbach 1978: 328). It is often stated (e.g. Sainsbury (2007), Crane (2009)) that while negation is an organic property of natural language, it is not so of pictures. In short: “There is no way of simply using a picture alone to deny what it represents” (Crane 2009: 459). Jerry Fodor has called it “the familiar objection to the picture theory of ideas: There is nothing in John’s not loving Mary for a picture to resemble” (Fodor 2005: 176). Noël Carroll has argued that much the same holds for moving images, for “if negation is a natural part of language, then film cannot be a normal specimen of language, since it would appear to lack the means to say “no” in its putative vocabulary” (Carroll 2008: 105). The reason, it is argued, is that pictorial representational forms are neither compositional nor propositional in nature, at least not in the traditional linguistic sense of the words, and the consequence, it is argued, is that pictorial forms in general, and cinema specifically, are not languages. Rather, the accepted view seems to be that pictorial forms are held to be positive states of affairs mediated through resemblance to what they depict (Blumson 2010: 150). Recently, however, Ben Blumson (2014) has attempted to reconcile the view that images are mediated by resemblance with the view that images are compositional and propositional and thus in some sense also capable of being symbolic systems and hence languages, if perhaps noisy ones. If such a view might be deemed sound, it is perhaps a first step towards incorporating, with impunity, negation into the vocabulary of pictorial forms of (re)presentation, cinema included. But let us take a few steps back, since the concept of negation we are going to employ in this paper is rather different from the concept of negation found in linguistics or logic. Put simply, in the traditional view, negation signals the absence of something, but it does not try to quantize or contextualize this absence. As such, the sentence ‘John is not there’ signals the absence of John and is the negation of the sentence ‘John is there.’ Looked at in this way, negation is a formal construct functioning independently of the gravity or weight of the situation of John’s being absent. Such gravity, if accounted for, would be historically determined and depend, for instance, on the past regularity of John’s presence in various contexts. In this expanded view, where negation comes in degrees, negation is no longer a purely logical operator, but is instead tied to a temporal or narrative statistical inconsistency. To consider intuitively how this rather different and weaker sense of negation might be said to form a natural part of the messy nature of human cognition, as it is used

3

in real life and at the cinema, it will be beneficial to bring into the picture a theory which neuroscience has supported widely in recent years, entitled the Bayesian brain hypothesis (Friston 2010). The idea is that the brain in many respects resemble a probabilistic inference machine, even for low-level perceptual processes such as predicting the movement of a moving body. However, while in this sense the mind might be a computer, what it computes need not be precise, far less predetermined by or veridical to the contents of perception. While this impurity of imprecision and bias is a constant presence for even simple cognitive computations, it becomes especially prevalent in cases of more obvious epistemic uncertainty. Confronted with uncertainty about observable data (or more precisely, of what has generated the observable data) say, as a result of broken patterns in the environment, any human being is likely to resort to applying various explanatory models, each model offering an explanation of what generated the observed data. But if the models explaining the data are mutually exclusive, and provide adverse results relative to the same hypotheses, the models can be said to negate each other. This negation is not given by nature, but is a cognitive fabrication. Negation is thus a type of failure in the head (paraphrasing here Inoue et al. (1998)). It could be argued that cinema presents an especially interesting case for the study of such dynamics of uncertainty. Given its fabricated nature, cinema (perhaps especially art-cinema) can alter its own patterns in time, distribute imperfect or contradictory information, and in general work to disrupt a spectator’s world-model. Such disruptions of epistemic certainty arise because the mechanisms which generates the observed environment can no longer be inferred precisely by the brain. To alleviate the problem of not being able to rationalize sensations and observations, the brain is likely to meet the updated demands of the environment by inventing new generative models. These new models might be successful, in which case they explain the phenomena at hand, and are likely to sustain themselves in the pool of possible generative mechanisms, or, conversely, any new model might fail or clash with another model, in which case the experience of fragile emotions such as dissonance and general unease is the more likely outcome. If we situate the concept of negation within this inferential-, pattern- and model based framework, we end up with a slightly different and updated meaning of the word. When reasoning or inference is fundamentally challenged – that is, when a generative model fails or two or more models clash and compete – no negation in the linguistic or logical sense need have taken place, but a probabilistic negation – a negation of a convention, a statistical regularity of nature, or of an idiosyncratic expectation, given a model, etc. – has taken place. The primary goal of this thesis is to formulate a theory of cinema in which this renewed concept of negation, as formulated above, can indeed be said to form even an integral part of its language. We will argue, then, contrary to Noël Carroll and others, that, in its own way, cinema very much has the ability to say ‘no’. The two-folded contrary view that, since there is no language of cinema at all, then no negation can take place, or that, since cinema cannot perform negation, it cannot be a language, becomes more difficult to ascribe to when we look at the phenomenon of negation in a broader statistical light. Further, we will argue that if cinema can perform negation in this expanded sense of the word, then it is indeed some form of language. Why is this necessarily so? The reasoning here, which will be employed throughout the paper, is that language is made possible by a certain repetitiveness of the structure of communication. But language does not arise until someone engages in trying to infer this repetitiveness and learn from it. To make things simple, the perceived structure could be a cinematic work, and the one making inferences about this structure could be a

4

given spectator. But if no repetitiveness is perceived to be present, then no perceived negation is likely to take place. But if on the other hand negation is perceived to take place in the mind of the spectator, then it must be because this spectator has perceived the existence of a certain degree of regularity beforehand. In our view, negation is thus initiated by a collapse of regularity given a model. This perceived model-collapse leads to the initiation of further (mutually negating) models. Eventually, the hope is that a new model is arrived at which is stable enough to collapse all other models. A cycle of regularity-negation-regularity is thus formed. For observers of stochastic environments (and cinema is such an environment), such cycles of mental operations tend to replicate themselves over time. But if in this way, negation is dependent on perceived regularity in an environment, the question then becomes: what types of regularities exist in cinema, and how can we delineate them? To answer this question, there is a need to see cinema as not merely cohesive or messy states of affairs, but as consisting of autonomous parts capable of sustaining themselves over time and of engaging in different relations and configurations. Such an autonomous part might be a character in the film. A regularity might be his or her always wearing a white shirt. An irregularity in this case would be his or her unexpectedly wearing a red shirt. This will, all other things equal, admittedly be an extremely truncated form of statistical negation, but nevertheless a valid one relative to our model. To be sure, any statistical language, in contrast to a formal language, evolves in time. Mapping and tracking the evolution of autonomous parts and relations in time – shirt, white shirt, red shirt, character – and treating them as being capable of engaging in causation, means arguing for the validity of the process of formalizing and reducing the image to certain visual components deemed salient. We are thus led into a process of determining what variables are deemed superfluous, and which relevant, in a given film, and to isolate these by the act of designation and reference. This is a process of formalization through decomposition. By engaging in such a process, we are liable to go against a few existing dogmas within film theory. Responding to various critiques of the validity of the approach described above means engaging in discussions with scholars who reject the cinema as language hypothesis. Such a discussion will form the first part of this thesis. The second part will have as its goal to develop a formal method of analysing cinematic language at work in general, and negation in particular. As already noted, negation in cinema, at least in our model, has less to do with propositional forms than statistical irregularities. Surely, we might imagine a situation, as in Alain Robbe-Grillet’s La Belle Captive (1983), where a character appears to be both dead and alive at once, and ask why not model such an instance as a pure and simple negation of classical logic. The reason for not doing so is simply that whatever logic is applied in a given film might differ from the logic applied by the spectator in real life. In contrast, if we model the “negativity” of the instance in question as simply a gap between the prior statistical or probabilistic model of the spectator with respect to the simultaneous alive/death scenario, and his or her model of the scenario being logical within the world of the film, we have a model capable of greater plasticity. To recapitulate our argument so far: negation is a concept found mostly in logic and linguistics. However, as cinema is not a formal system with clear truth-predicates – indeed, one could argue that it doesn’t form propositions in the traditional sense – the language of logic, linguistics, or even semiotics will be deemed too crude and inflexible to form a veridical relationship to cinematic forms of communication. Rather, we will engage in the language of statistics and employ a probabilistic view of negation. Looked at as a statistical language, with certain regularities and irregularities,

5

cinema has the ability to say ‘no’. By employing this view, this thesis will argue that otherwise incompatible views, such as the cinema as language/not language, cinema as compositional/not compositional, can in fact be reconciled. The tension in these dichotomies will be, while not eliminated, then alleviated by focusing on statistical relations across time. We will thus engage in a process of quantifying uncertainty, which is generally said to comprise the very realm of probability theory and statistics. Let us however take a step back put such a move into perspective. As in any other field, it counts for film studies as well that with what certainty something can be said depends on the nature of available evidence. In film studies, the material of evidence is largely constituted by the tripartite nature of our general knowledge of the limitations that make up the medium, the instances of films or imagined scenarios, and thirdly spectators and our knowledge of them and their general cognitive faculties. In many cases, our knowledge is more than just limited. Evidence of specific spectatorexperiences is, for the large part, non-existent. In addition, in its current state neuroscience can only guess about the nature of most functions operating within our brain. As such, the perhaps two major roadblocks that prevent cinema theory from establishing itself as, if not a hard science, then a science in which it will be less cumbersome to validate statements, can be said to be the following. The first roadblock is upheld by the fundamental problems associated with any attempt to elucidate with any sort of precision the organization and structure of particular films. In later chapters, to gauge the strength of the first roadblock, we will review a few of the existing scholarly attempts at a formalization of cinema. The second roadblock can be said to be the problem of empirically discovering a precise, deictic connection of stimuli to brain. That is, to be better at answering questions such as: with what certainty can we say that the more or less particular X in the film is related to the more or less particular sensation Y in the brain? Or to be more precise: what is the (non-)connection between an instance of, say, red on screen, and the sensation of red in the brain? While the latter roadblock seems restricted first and foremost by scientific progress in relevant areas such as functional magnetic resonance imaging and eye tracking technology, the former – tied more generally to the problem of treating cinema as a (strict) language – might be restricted by a certain reluctance ultimately rooted in dogmas concerning the nature of cinema language. Indeed, in the first case, the evidence is right there in front of us in the form of the films themselves and the properties they carry. Our concern in this thesis will be with the first roadblock, so far upheld by persisting difficulties in figuring out what evidence the film itself carries. We do not claim to arrive at a definitive solution to the problem. Further, while our stated goal is to arrive at a solid theory of the language of negation as occurring in the film situation in its various forms, the statements made about the viability of formalizing films – and of film as language – will be of a more general nature. Let us briefly interrogate the first of the claims made above pertaining to the problem of language, reduction, and formalization, since this is the one we are going to engage with critically. The problem of reducing or constructing the structure of complex systems – and cinema is one such system – is a very foundational one. Naturally, a good deal of headway has already been made. Barry Salt, already in 1974, in his article in Film Quarterly, “Statistical Style Analysis of Motion Pictures”, proposed a data-driven approach to film theory. More recently, the study of so-called cinemetrics has popularized and extended the proposals made by Salt. Even more recently, a largely discoursedriven tradition of multimodal film analysis (e.g. Bateman (2012), Wildfeur (2014)) has attempted

6

to extend the formal approach to cinema even further by including diegetic variables in the formalization process. Finally, within the field of Machine Learning, there appears to be great interest in discovering precise ways to formalize (moving) images. Rather than address myself to these recent advancements anon – we will do so in due time – I will turn to defend the validity of the very attempt to formalize non-formal languages or systems, whatever form they might take. I will do so by introducing the concept of the minimal viewer.

Film theoretic language games and current dogmas of film theory What has restricted the language of cinema – assuming it is viable to even speak of such an entity – from being realized, made specific, formalized? The term language of cinema could be said to be somewhat of a platitude. Ironically, there seems to be no commonality in how theorists of film describe the language of cinema and its constituents, as Edward Branigan (2006) has pointed out at length in his book Projecting a Camera: Language-Games in Film Theory. If Branigan’s theory is made plausible by, say, the proliferation of the tradition of designating a close-up without defining it formally, it would make most theories or statements about film, however plausible, easy preys. In other words: they would be too easy to refute, and it would be too easy to refute the refutation. An evil circle is produced. If such is indeed the situation, and one can argue for or against this, many reasons come to mind: film images are not just concrete and unique in a very real sense, they are also continuous multiplicities of signs, or, more objectively, signals, or, even more objectively, variables and properties. Further, signification is not limited to meaning, but includes sensations and emotions as well, a topic that has received increasing attention since the 90’s at the hands of, amongst others, Carl Plantinga and Greg M. Smith (1999). The sensations arrived at, faced with a string of close-ups at different points in time, might have a certain commonality, within or amongst spectators, or they might not, but in any case, the impact of the variable and the state of affairs it appears in, might be the main transfer or signal between sender and receiver, rather than its “meaning.” Further, both emotion and meaning are products of convention, ecology, and evolution, as well as bottom-up processes, as made clear by Joseph Anderson in his book The Reality of Illusion: An Ecological Approach to Cognitive Film Theory (1998). To put it simply, the dogma is this: an image cannot be reduced structurally, nor conceptually, since the process by which it comes to be understood is both compound, contingent, and only partially dependent on the image itself. Poststructuralist circles might be the most avid defenders of the dangers of formalization, but the view is found in naturalism as well. Naturalism has, as Murray Smith (2017) explains, a limit. Indeed, as he says, it “is a form of theorizing” (Smith 2017: 56), and he underlines that a naturalized aesthetics is a process of explanation through theory construction rather than scientific explication. Naturalist methods might be demonstrational, but they are not what they demonstrate. The dogma outlined above – that an image can be neither structurally nor conceptually reduced – does not argue against the possible values of forms of formalization, but does argue against the value of any pure reduction to material and at-hand visible causes of experience. Film semiotician Christian Metz, of all people, has perhaps stated one part of the dogma most clearly: “One can decompose a shot, but one cannot reduce it” (Metz 1974: 116). Unlike the language-critics, who refute both ‘decomposition’ and ‘reduction’ – what we termed conceptual and structural reduction above – Metz believes that decomposition is meaningful but only if it acts in favour of explicating the image’s compositionality. I do not claim that the dogma so described is all-prevalent, nor that it is necessarily wrong. I do argue, however, that it is a) present in various shades and strengths and that b) it might have contributed to a widespread unwillingness to formalize film language that is perhaps not

7

rationally warranted. It is my argument, however, that by taking a statistical and relational approach to formalization, many of the problems arising from a semiotic conception of signification and a perhaps too literal conception of the art of reduction, can be alleviated. With the above in mind, let me state the main argument of the thesis with regard to the constructionist part of it. Suppose initially that any film has a material or ontological basis. Or suppose, if you cannot ascribe to such a doctrine, that the lowest-base accessible reality is a(n) (epistemological) convention or agreement. I will argue, then, that cinema in its material basis can be interpreted as a (principally infinite) number of variables interacting in time in probabilistic ways and that these relations can be objectively constructed once a few conventions for our constructionlanguage have been established and agreed upon, that is, once a certain model has been chosen.

Towards a minimal spectator and “the film itself” In talking about “the film itself,” I realize that this is an idealization, and it will be treated as such. As such, there exists no one film, but only models of a given film. Indeed, this is part of the main argument of the thesis. Still, some model account of a film is closer to the properties of the visual evidence of the ideal “film itself” than others. As such, while it is perhaps obvious that various spectators’ idiosyncratic responses to an artwork cannot be modelled outside of hypothetical scenarios, the construction part itself, as briefly outlined above, presupposes a spectator as well – not a rogue one, but, let’s say, a minimal one: that of the formal analyst in search of a model of “the film itself.” While the idiosyncratic film-situation is one of translation and interpretation, yielding reciprocities between languages of cinema and languages of spectators and the inferential and compromising mechanisms that follow, the ideal analyst, however displaced, is in a process of translation and interpretation as well. According to this view, the two processes – that of the analyst and that of the idiosyncratic spectator – differ in degree rather than in distinction. As we embark on our task of establishing and adopting methodological conventions and assumptions for a formalization of cinema, it would thus be natural to construct these foundations from generalized knowledge about the ways in which viewers – i.e. the human brain – attend to films. One might initially find it attractive to simply designate in the most precise and least granular way possible. However, above and beyond the problem of the granularity of perception, the additional impediment appears to be that any attempt to designate the thing merely as ‘the thing in itself’, or to look for autonomous ‘smallest entities’ in the image, would come off as self-defeating; done in this way, all objects would be unique, thus preventing any useful convention from being established. To be clear: it wouldn’t be prosperous to outline relations between pixels in a film unless we have plausible evidence that our eyes will detect and extract data functionally at such a level. The project is thus one of discovering which variables in the image should be considered salient. Should we include a chair depicted in the image in our formalization, or simply ignore it? Such questions are far from banal, and there is perhaps no clear answer to such a question. One will simply have to follow a set of methodological ground-rules to achieve consistency. What these ground-rules consist in will be laid out in the chapter on formalization. It should be made clear, then, that our minimal viewer is simply a model, and while this thesis will end up choosing one set of methodological assumptions to build a probabilistic framework, it does not claim this to be the only one, nor the right one. To be clear: the objectivity of any formal language we claim to present lies not in exhausting what can be said about a given film, nor in presenting a universal language of cinema:

8

rather the objectivity rests on a strict coherence between what can be said and what can be formalized within a given language.

The role of negation We can now return to the, for our purposes, main use-value of the probabilistic system: that of negation. If a language of cinema works, as we argued initially, not through strict signification, but through regulatory use, then this intuitively opens the door for the negative side of language – the ability to play with and deregulate itself. As such, though we establish a framework capable of broad assessments, the way we use it will be confined mostly to problems of cinema language, in the form of messy signals. Such messy signals exist, not in the film itself, but between spectator and screen. Cinema, in its proper form – as an experience – is a mutual space of language translation and multifaceted cognitive-emotive response. Any viewer is liable to conduct statistical inference, filter information, perhaps view statistical regularities as authorial intentions and vehicles of meaning or structure, experience certain emotions, etc., all partially circumscribed by the nature of the film, but (s)he is not an entirely rational being, nor an unbiased one: any viewer will conduct him- or herself with a certain bias as product of both culture and evolution. In this way, the role of negation is as much dependent on the viewer’s idiosyncrasy as it is on the base structure of the film. But we might ask: why this choice of focusing on negation? It seems to me that one apparent starting point for a critique of formal applications to cinema might be instances where meaning is negated and where language or translation breaks down. For the claimed naturalist, however, such instances can be modelled on equal footing with positive or productive aesthetic principles. If one can fit instances of negation neatly alongside all other forms of cinema, and find a framework that can unite negative and positive principles with equal clarity and fitness, and connect these to neuroscientific theories of aspects of such experiences, then one will have made a strong case for the validity of both a formalist and naturalist approach. In the latter part of this paper, we thus turn to ask questions of the following sort: how can a film negate the regularity of use of its own signals in time, and how might such a negative self-play obstruct or influence a viewers’ response to the signal? By answering these questions, we gauge the claim that negation forms a natural part of movie-viewing when looked at from a more sensitive statistical and probabilistic perspective.

Research question and structure of paper If we frame our hypothesis in the form of a question, it can thus be stated as follows: How might a statistical view of the language of cinema, as realized through a probabilistic, compositional and temporal framework, help alleviate the film theoretic problem of formalizing the image without crudely reducing the image structurally or conceptually, and how might the concept of negation, as travelling in and between spectators and films in the form of irregular structures and negative experiences, be said to be broadened and sensitized when formulated within a statistical framework rather than that of logic or linguistics?

9

The thesis will be split into four parts. The first part will review and consider the validity of some oft-applied arguments against the notion that cinema can be viewed as a language. It will argue that these arguments tend to fail, and end with a presentation of how cinema might beneficially be gauged against a statistical rather than linguistic view of language. The second part will introduce the method of formalization we are going to employ in this thesis, which includes an introduction to probability theory. The fourth and final part, before summing up our findings, will consist of a formal analysis of two films. Since the process of formalization rises exponentially in difficulty with the number of variables to be formalized and tracked through time, we will start off by applying our probabilistic framework to just one variable. Only then do we consider a situation of several variables interacting in time. O Último Mergulho (1991) will be used as a case for the univariate analysis, while The Duke of Burgundy (2014) will be used as the main case for the multivariate analysis.

10

The art of modelling: The cinema as language hypothesis and its critics It is like looking into the cabin of a locomotive. There are handles there, all looking more or less alike… (Wittgenstein 2009: 10) Imagine that you are asked to describe a certain object on the table in front of you, the only rule being that you cannot describe it by means of ostension, that is, by simply referring to it. You might describe the object as a blue cup, while admitting at once that the description is not exhaustive but neither exactly wrong, in so far as the object is in fact a cup and is in fact blue in your model of language. In every such process of description or modelling, reality and model enter a non-binding relationship with the goal of achieving some sort of mutual correspondence. To make an analogical point, we currently see reality as quantum-mechanical, but it is not the same as saying that reality is quantum-mechanical. In fact, in many ways it is not. Rather, what it means is that quantummechanics offers a model of a certain aspect of reality which, when employed, provides a distinctive and useful explanatory power relative to the certain aspect under inspection. And so, to turn to cinema, one can similarly see cinema as a variety of things which each work to elucidate certain aspects of its nature. One can see cinema as entertainment, as art, as embodied perception, as language, as a mental construction, as fiction or reality, and so forth, and test the correspondence quality of the model-object relation accordingly. Whatever model one chooses will impose constraints on the possible set of things you can say about the object at hand. Following this line of thought, to ask what it entails to conceive of cinema as a language is to ask first and foremost what model of language you are going to apply. If we choose to test a hypothesis by which we see cinema as language, the premise is to engage in another model in which we see language as something. In this way, the iterative process described above is bound to circularly repeat. For one can see language as many things, each carrying its own more or less defined set of properties. Is there an intersection to which every type of language, with its following set of inferred properties, pertains? If there is, we would likely have a hard time shedding light on what these properties exactly are. Following the above it becomes a given that if one desires to study and test the language-like qualities of a certain X against a certain model of language, one must define a set of properties belonging to or constituting that model of language. Within cognitive film theory, the tradition has been to choose rather narrow models of language. Consequently, the result has been the widespread claim that film is not a language. Within this tradition, the statement that film is not a language has not so much consisted in highlighting, say, the contingency or diversity of viewer response, as it has been to prove that cinema in its very composition differs substantially from the way natural languages, such as the English language, are organized. As such, cognitive film scholar Torben Grodal notes that there is a “radical difference between language and audiovisual communication” (Grodal 2016: 101). More radically, in The Matter of Vision (2017), Peter Wyeth frames the cinema as language hypothesis as “literary reductionism” on the ground that language is “logocentric.” Within other segments of film studies, scholars have used wider, often semiotic models of language (e.g. Buckland (2000)) with the obverse result that film is indeed a language.

11

From the above, we can infer that one should perhaps be careful with equating a negative answer, provided by one’s own model of language, to the question “is film a language?”, with the generalizing statement that cinema cannot be any language at all. It is my claim, however, that cognitive film scholars such as Noël Carroll, Gregory Currie, and others have not only chosen an inappropriately narrow model of language, but that, even given the model they have chosen – that of the English language – we should get a largely positive answer to the question at hand. We will turn to this argument shortly. However, even if, as is the case, we will end up with the result that cinematic language indeed shares many properties with natural language, I will argue that those affinities are best studied through, not linguistics, but statistics. Thus, if one maintains a weaker conception of language, in which its elements are studied statistically, to be able to envisage a language of cinema does not require the discovery of a strict set of conventions. Neither does it require of cinema that it resemble or mimic some other specific language of some sort, such as the English language. In the more minimal view of language which we will employ, to treat X as a language means insisting that there is a regularity to X which, when decoded or experienced, is regular enough to yield a utility relative to a decoder who is trying to learn or experience that language at work. This is a statistical view of language. If we, using this model of language, end up with the positive answer that cinema is indeed some form a language – namely a statistical one – it is my claim that we can extract information from cinematic works which we otherwise cannot. Before turning to our argument with Carroll and Currie, however, we should briefly address an idea that quite a few scholars seem to support, namely the language loss hypothesis.

The language loss hypothesis The idea that something akin to language loss is more intuitively associated with cinema than language is a hypothesis which effortlessly aligns itself with the popular work of scholars such as Maurice Merleau-Ponty, George Lakoff, and Karl Polanyi, whose notion of embodiment has paved the way for a deeper acknowledgement of the ‘tacit’ and ‘tangible’ as determinate of and prior to notions of both language and cognition. Evidently, such work has made its impact on film studies as well, leading to many scholars advocating variants of a language loss conception of cinema and cinematic experience (e.g. Tijana 2013). Are such views inherently incompatible with a formal approach to cinema analysis such as ours? One might try to rescue notions of the contingency of experience (of watching a film) from the realm of non-science by giving the following response: “… But if a certain quality X in the artwork always produces Y in the viewer, Y being an experience of language loss in the form of a direct material contact, then this relation X-Y will be part of a fixed language of language loss.” The clear and valid counter-argument is that no such regularity is likely to exist between viewers, thus causing trouble for the language hypothesis. But let’s test the notion that some regularity of experience indeed does exist. It is commonly accepted within cognitive film theory that certain “tyrannical” or primitively manipulative properties of films are likely to yield approximately similar reactions in the audience as a consequence of us humans sharing the same primal brain and the same physical space. One such theory is the tyranny of film hypothesis as put forth by Loschky et al. (2015). The theory states that, in certain intensified continuity sequences of Hollywood films, exogenous variables – that is, the visual properties on the screen – guide the audience’s attention to a significantly higher degree than endogenous variables such as higherorder cognition and inferred context. Even if we accept this theory, which evidence says we should,

12

an evident trouble of generalization persists. For if one car-chase in one film is likely to yield a similarity in reaction and attention across audiences, does this mean that can we generalize to statements about other car-chases? It is not evident that we can. The trouble is not only that empirical studies into such things require great effort; the trouble is also in finding variables that persist across films to a degree that allows one to generalize. This evidently poses a problem in and of itself for the cinema as language hypothesis. But even if one finds empirical evidence of the sort that connects certain properties of an image to regularities in stimuli across audience members by, let’s say, having recourse to empirical eye tracking methods, fMRI studies, and questionnaires, it still holds that what accounts for that regularity is speculative at best. Two people might cry at a certain point in a film, but they might do so for various reasons. It is for this reason that I will turn to a regularity that, if it does exist, does so in a non-contingent way, and for which we have shared visual evidence, namely the structure and composition of the visual properties of the film itself. We will thus leave the matter of the intersection between spectator-experience and the language hypothesis for now. We will return to it in our chapter on the probabilistic brain and probabilistic reasoning. Instead, let us then turn more directly to some fundamental counter-arguments to the film-as-language hypothesis, here in the voices of philosophers and cognitive film theorists Noël Carrol and Gregory Currie.

Cinema’s Tower of Babel: from a language of cinema to languages of cinema We all know the platitude that “an image says more than a thousand words”, but few are probably left wondering “yes, but, how many more words than a thousand?” The idea is rather that no amount of words will do the job. While some scholars (e.g. Bateman (2012); Buckland (2000)) argue that cinema is some form of language, with semiotic rather than linguistic theories as their reference point, the more prominent contemporary view seems to be that, due to its nature, cinema cannot be any language. In this section, I review Gregory Currie’s and Noël Carroll’s attempts at a refutation of the cinema as language hypothesis, and I argue that the main arguments end up being inherently paradoxical. Even if their arguments are similar, they differ enough for us to deal with them separately. We will start with Currie. Currie’s argument rests on the idea that language is acontextual and conventional, and that cinema is not those things. Curries premise is thus, in his own words, that “in sum, our language is productive and conventional, so its meaning-determining conventions are recursive, so it has meaning atoms, so it is molecular, so it is acontextual. A great deal in the argument that follows will depend heavily on these entailments” (Currie 1993: 210). In the pages that follow, I will argue that it is sound to reject this premise, but perhaps I should put my agreements with Currie on the table first. Mainly, I will agree that there is no fixed syntax of cinema; nor does cinema contain the same naturally appearing atomistic qualities. But Currie’s argument becomes more spurious once he moves beyond these preliminary notions. Currie is not all that clear as to just what he means when he says that language is acontextual. That language is not acontextual, at least in one sense, namely the sense attached to its use, has been extensively argued for by a host of philosophers such as Hilary Putnam, W.V.O. Quine, and Ludwig Wittgenstein amongst others (see, for instance, Quine (1968) or Wittgenstein (2009)). More precisely, if language is acontextual in use, it is so to a degree. This intuitively invites our statistical view. But perhaps Currie’s idea of acontextuality is more limited and not tied to use. We will evaluate this option shortly, but for now, let me put forth my main argument. I will argue that a) semantic meaning is not acontextual or recursive, nor is it, and for this

13

reason, entirely conventional and b) that it seems ill-considered to think that our job – the job of the defender of the validity of the conception of a language of cinema – is the following: “What we require of the defender of cinema as language; my note is that he tells us what this intrinsic, acontextual meaning possessed by cinematic images is, and that he shows (i) that this meaning is story-meaning, and (ii) that this meaning has the explanatory features of semantic meaning” (Currie 1993: 213). The fact that Currie demands of the “defender” to show the acontextuality of cinematic images is based on the premise that language is acontextual, a premise one might reasonably question. To see just what Currie means with acontextuality, let us review his own attempt to play the Devil’s advocate, the role of the defender, when he notes that “what the image typically records is actors performing actions among props on a set. This sort of meaning is acontextual: it does not depend on relations between images, because it is locally determined by the conditions of the take. And by juxtaposing images one simply gets an accretion of meaning: if the meaning of image A is M(A) and that of image B is M(B), then image A followed by image B just means M(A) & M(B), where the order of juxtaposition is irrelevant to meaning… The meaning, in this sense, of a complex of images is just the logical sum of the meanings of its constituent images” (Currie 1993: 213). We have already addressed the fact that we believe Currie’s discussion is based on a flawed premise. If we put that aside for a moment, it is still not clear that his argument – while meant as an attempt to disprove his own theory – is sound. While it may be true that any image cannot directly have an impact on another image – that any configuration does not alter the fixed constituents of, say, single frames, and that it shares this with natural language (the words themselves, in their material form, can be put together just as you like: the sentence “There is a rabbit over there” and the sentence: “over there rabbit is a” have the same basic constituents), then it renders the concept of meaning, to make a pun, meaningless. Beyond this base level, I will argue that both cinema and language are contextual, compositional and – to a certain extent – conventional. The reason that Currie, even if he in this very deprived sense of meaning accepts that images are acontextual, still does not accept that cinema is a language, is, then, that this deprived ‘photographic’ sense of meaning is not storymeaning. Let us consider the following, perhaps slightly contrived scenario: If shades of red happen to be recognizably present in the image whenever A is happy and well-mannered, and if shades of blue happen to be recognizably present whenever A is sad or fails to behave, and A then, in a scene where she is about to propose to her boyfriend, is accompanied in the image by shades of blue, while signalling a happy state of mind with a smile, does this not impact directly on story meaning and the nature of the predictions made by the audience? If we accept this, it would seem that photographic meaning – which is acontextual according to Currie – can contribute to story meaning. But perhaps we have confused ourselves: Is Currie’s argument rather, that if blue, in this case, is seen as acontextual, as pure photographic fact, then it cannot engage in story meaning, but if we move beyond its pure photographic fact, if such can be said to exist, then it becomes contextual and can engage in story meaning, as in the example just given? But then, since cinema is story meaning, it cannot be acontextual, and so it is not. In this I am perhaps willing to agree. But my point is that we would be better off not to mention acontextuality at all as meaningful components of either natural language or cinema language. Natural language does not, I would argue, convey story meaning in an acontextual way. If we look at the word ‘dog’ as pure photographic or textual fact, that is, as three letters forming a compound and nothing more, then the same meaninglessness appears. It is thus Currie’s premise itself, rather than his reasoning from it, that I deem flawed.

14

Cutting to the bone of the issue, Currie argues that “the fundamental dis-analogy between language and all pictorial modes of representation” is that “it is not possible to identify any set of conventions that function to confer appearance meaning on cinematic images in anything like the way in which conventions confer (semantic) meaning in language” (Currie 1993: 214). To review his argument, let us consider the semantic meaning of the word ‘rabbit’. What does the word refer to? That it does not refer to anything in particular is perhaps the most reasonable thing to say. Rather, it is an idealized reference that only becomes meaningful in a certain language and frame of reference. As Saul Kripke reasoned at length in Names and Necessity (1980), words do not refer to concrete properties as much as to elastic, ideal mental objects. The word ‘rabbit,’ without a frame of reference, does not denote a specific object, but a certain undefined spectrum of possibility and necessity. This ‘quasi-object’ status – as Carnap (1937: 51) termed it – can be defined as being consistent with the concrete object across all the relations it might appear in, but not because its meaning is recursive or acontextual, but rather because of a certain perceived regularity of nature which has allowed humans to impose on nature the idea of it being compatible with us dividing it into classes, such as the class of all rabbits. The allowance of noise in our concepts makes room for a degree of forced permanence in the correlation of perception and language. If the word or the concept ‘dog’ stands robustly, it does so because it is an object in a language, rather than an object outside of language, but it could not be an object in a (more or less universal) language if it were not an at least partially general and recurrent object. Similarly, we might say it is a convention that the semantic meaning of ‘the statue of liberty’ is found by reference to the object that is the statue of liberty – but what exactly constitutes the object? In both images and natural language, convention and semantic meaning are noisy at best. It is for this reason, to yet again return to our main argument, that it is best dealt with statistically. In this view, any concrete object in an image is in a sense a quasi-object as well, as defined by Carnap, so long as it contains any degree of familiarity to other entities in the visual world in any one of its properties. Thus, the extreme version of the ontology of depiction found in André Bazin’s famous essay “The Ontology of the Photographic Image” (1960) – that the photographically depicted object is the object rather than a representation of it – even if accepted, tells only half the truth. Most objects we choose to designate and include in our language can tolerate large degrees of noise exactly because they are not necessary and specific objects in nature. It is for this reason that I say that cinema is more like language than contemporary thought seems to acknowledge. And so, while Currie wants us, the defender, to argue that cinema is acontextual, I would rather say that it is the fact that it is not acontextual that makes it function in similar ways to natural language – at least not in any absolute sense. More precisely: both cinematic language and natural language are acontextual, as well as conventional, to a degree. In conclusion, this invites our statistical view. Before turning hereto, I will briefly turn to an example given by Noël Carroll in his book The Philosophy of Motion Pictures (2008), which allows us to discuss a proposition of relevance to our aims of formalization which is not dealt with in depth by Currie, namely the proposition that images cannot, by definition or in any degree, be canonically decomposed. If Carroll is right, then any claims to the validity of a visual formalization practice are more than susceptible to being based on a fallacious premise. Put simply, Carroll’s statement is that “you can’t break down the moving picture of (a) tall man into its component parts in the way that you can analyse the sentence into minimal units such as “tall” and “man”” (Carroll 2008: 104-105). To see how this might be untrue, consider the simple observation that Carroll, in his example, decomposes his own mental image of a ‘tall man’ into the parts ‘tall’ and ‘man’ only to subsequently argue that one cannot break the image in discussion into

15

the properties ‘tall’ and ‘man’. Logically speaking, if the words or properties ‘tall’ and ‘man’ did not have a quality apart from their compound meaning in the instantaneous state of affairs, then it would be cumbersome to argue how it is possible for Carroll to use the two words separately without committing a fundamental act of self-contradiction. If it is perhaps accurate in an absolute sense, as Carroll argues, that images are not canonically decompositional, I would also argue that many decompositions of a compositional image – in so far as the depiction contains familiar objects and properties – are far from arbitrary. As we have just seen, Carroll in fact proves this point himself by choosing the words he does. The word ‘tall’, which is the word Carroll chooses to describe a certain property of relative height, is, while perhaps represented, not actually depicted in the image. It is thus not an objective property of the image, but exist as part of a canonical decomposition of the image made by Carroll himself. Thus, just in case the property ‘tall’ is inferred as a property or ‘component part’ of the image, as it is by Carroll, the image becomes both general, language-like, and, at least to the one making the inference, conventionally or canonically decomposed. If we argue, then, contrary to Carroll, that parts of images – which, as inferred properties in the image, in themselves are made language-like – can relate to parts of other images in a way that is, while not fully canonical, then often far from arbitrary, this decompositional thesis of the content of images do not speak against the compositionality or compound nature of images but in fact favours it. By following such a relational logic, in which particularity does not equal uniqueness, and compositionality does not equal holism, there is no need to choose rigidly between a decompositional view of images and Carroll’s “all-at-once”-theory of how images work, which, as we have seen, is liable to result in contradiction. To sum up, the view that emerges from our evaluation of Currie’s and Carroll’s thought is one in which the language of cinema turns out to bear important similarities to the way natural language functions. But it is also a view in which those similar properties, such as conventionality, acontextuality, and canonical decompositionality, do not operate in any absolute sense, but come in degrees. It is for this reason that we argue that the language of cinema intuitively invite itself to be treated within the framework of statistics. In this new framework of language, probabilistic negation comes to play a crucial part as the manifestation of constantly fluctuating degrees of regularities of occurrences and co-occurrences of parts and their compound configurations. Applying the statistical view means shying away from imposing a model of a pre-existing language onto cinema. Instead, the job is to discover, not language, but languages of cinema by means of first and foremost observing and determining which relations and properties exist in a cinematic world where that language is to have utility. Viewed as such, language is not synthetic, but consists of exactly the contextual relation between world and word – it cannot do without either. This view is not very far from Ruth Millikan’s view of perception when she says that “Perception is the interpretation, a translation into mental representation, of (informational signs) found among patterns in the energies to which the outer sense organs are sensitive. The process involved in interpreting a language is of this same kind. Setting aside several tangential peculiarities of common verbs of perception, the likeness between perception of the world and linguistic understanding is strong” (Millikan 2017: 220). As such, “the preliminary and likely much the most difficult problem for cognition is not that of artificially classifying but of locating and identifying real properties and real entity clusters in the distal world, then in learning to reidentify or “sametrack” them when encountered again in experience so as to find and follow fruitful paths for induction” (Millikan 2017:

16

7). Of course, in an absolute and radical sense, nothing is ever encountered again, mirroring the oftstated fact within probability theory that any attempt to calculate the probability of particular events is meaningless. Inference in its various forms thus presupposes the idea of degrees of permanence and likeness of structure. As Keynes argues in his book A Treatise on Probability, there exists a principle – Keynes calls it The Principle of Limited Variety or The Principle of the Uniformity of Nature – in the external world which makes properties “cohere together in groups of invariable connection” (Keynes 1929: 256). In other words, to form reasonable induction we “need an assumption… that the amount of variety in the universe is limited in such a way there is no one object so complex that its qualities fall into an infinite number of independent groups” (ibid.: 258). It is far from obvious, however, to see how to delineate such structures even if one agrees that they exist. The fact that cinema, much like life, is enormously complex means that its informational content and density are enormous, and far greater than natural languages. Unlike languages that have words, and thus obvious degrees of redundancy, no two images, or objects, or percepts are identical. Further, the number of variables in complex systems such as perception, or a live-event, or even a film, is quite literally endless, resulting in what at some level appears to be a complete non-redundancy of information. This in turn makes such processes extremely difficult to code, reproduce, or model with any sort of precision. This is why we have developed various languages, mathematical, natural, or otherwise to describe the world. But just as we allow ourselves to use different languages to map and model life, we can do so for cinema. Is one, however, sure to find a useful degree of regularity when engaging in such a process of modelling cinema? Chaotic or complex systems, which behave in constant asymptotic manners, are close to being non-languages for the reason that no repetitive patterns occur, and without regularity, a language cannot develop. The graphs below depict a two-hour financial time series:

Not too surprisingly, such chaotic financial time series look much like a graph for the messy shot distribution (shot duration) of a film we are going to analyse later. In cinema, however, such irregularity in one variable need not cause the spectator to lose completely the ability to make inferences. Irregularity in one variable might disappear when conditioned against another. It is such granularities we are going to consider when introducing our probabilistic system. To sum up, we agree with Davidson when he says that: we cannot confidently ascribe beliefs and desires and intentions to a creature that cannot use language. Beliefs, desires, and intentions are a condition of language, but language is also a condition for them. On the other hand, being able to attribute beliefs and desires to a creature is

17

certainly a condition of sharing a convention with that creature… (but) convention is not a condition of language. I suggest, then, that philosophers who make convention a necessary element in language have the matter backwards. The truth is rather that language is a condition for having conventions. (Davidson 2001 1984: 280)

Summary of Part I: Language, perception, statistics The discussion of language contra cinema contra naturalism carried out above is more than peripheral to the question of formalization than perhaps seemed at first sight: since if images refuse submission to canonical forms of decomposition – structurally as well as conceptually – then our concept of the minimal viewer and its aspirations of formalization are a methodological mishap from the start. But if we agree to the naturalist view, which we do, but still hold that cinema is a language, then we must argue that the contents of perception – and such is generally argued to exist (see Siegel 2010) – hold a more than tangential relation to language. But what evidence is there for such a claim that perception, language (or concepts) and embodiment are ultimately tied together? As it turns out, there is more than a little evidence. Andy Clark and Gary Lupyan have argued that “words and larger verbal constructions are special kinds of perceptual inputs” (Clark & Lupyan 2015: 283), and that “words activate neural patterns overlapping with those activated by nonverbal sensory inputs” (Ibid.: 283). Correspondingly, Zwaan (2014) has argued that the gap between the symbolic and grounded view of language and cognition can be bridged if one consents to a more connectionist view between language and perception. Such a view could, importantly – at least in my view – help bridge the banners of formalism and language with that of naturalism. If language-use is an instantiation of a (more or less weak) conceptualization of a percept as it occurs ad hoc in one’s mind, this makes language more than superficially contingent on the contingent fluctuations of life. But does it go the other way around as well? Does perception hinge on language or something akin to language? More than sporadic evidence seems to suggest that it does. Contemporary neuroscience harmonizes reasonably well with the notion that emotion and cognition, or the sub-cortical and cortical part of the brain, in fact cohabitate in a dependent and interconnected way (see Clark (2017) for a review of the matter), but there is lack of consensus on just how the relation plays out. A popular recent idea is that the foremost aim of the brain is to find the best adaptive strategy to the environment in order to minimize prediction error. It is perhaps common sense that such a strategy profits from having low-level perceptual processing and attention (the product of the evaluation of visual salience) be under the influence of priors – that is, to keep track of past probabilities that a given hypothesis about data of the sort under current inspection has turned out to be correct. But there is another factor at play, equally well recognized in the literature. Once a brain has formed a model, it exhibits a certain bias towards adhering to this model. This means that cognitive bias not only occurs in higher-order cognition, but that even “strong biases in perceptual estimates arise as a consequence of a preceding decision” (Stocker et al. 2007: 1). Briefly, the brain has a model which it does what it can to adhere to, creating a “circular perceptuo-motor causality” (Clark 2017) which helps ensure both self-consistency (Stocker et al. 2007) and energy-efficiency. What all this means is that even low-level perception and processing are conditioned on current sensory data as well as priors as well as internal self-sustaining models. If true, this suggests two things: one, that the brain structures its environment much like a language

18

– that is, in a certain discriminatory nature – and two, that language – in whatever form – as physical gestures, visual language, or natural language – is learned statistically. It is this thesis we now turn to.

Statistical language learning and the Bayesian brain In this chapter, we turn more formally to the apparent relation between learning – especially language learning – and statistics. As Gary Lupyan notes, if one applies a statistical and embodied rather than conceptual view of language, one can better see “how apparent stability emerges from pervasive variability” in the word-object relation (Lupyan 2015: 561). In fact, “virtually all methods for language modelling are probabilistic in nature” (Charniak et al. 2016: 7). Further, it has been argued that statistical learning in the brain is domain-general (Rebuschat 2011.) Thus, rather than functioning separately, language co-evolves with vision and sound. This means, as Misyak et al. (in Rebuschat 2011) argues, that incidental learning of structure in the world is akin in process to more rigid statistical learning cases such as natural language acquisition (Misyak in Rebuschat 2011: 16). From the vantage point of learning, this more than suggests that language and perception are much alike. In fact, it has been made quite plausible by empirical studies that one of the ways in which infants learn language is by inferring conditional statistical structures or cues in the distal environment. Such cues, which lead to the formation of language, are multimodal. While focusing on infants perhaps helps make the case evident, the same learning habit appears to hold for adults who speak the same language: we can sense the meaning of a word, not by knowing its dictionary definition, but by having perceived the ways in which it is used, and we do so by inferring a variety of cues. As such, “within any sequentially distributed input, there are a priori potentially many statistical cues available to the learner: (simple) frequency, co-occurrences, transitional probabilities and ‘conditional probabilities’, more generally, which can describe nonadjacent relationships, and higher-order conditionals (e.g., second-order, third-order, . . . nth-order probabilities)” (Ibid.: 22). In this view, “statistical learning is fundamentally two complementary processes: the extraction of coherent units and integration across those units to induce further structure” (Thiessen 2015: 101). This merely reinforces the view that no compositional axiomatic theory of meaning for every part of a given language is needed to infer meaning and learn efficiently. Instead, in this broader statistical view, “learners have to figure language out: their task is, in essence, to learn the probability distribution P(interpretation|cue, context), the probability of an interpretation given a formal cue in a particular context, a mapping from form to meaning conditioned by context” (Ellis 2011: 23). If we accept that such a probabilistic model is reasonable, this also means that we should not talk about causality or causation in the usual sense. Within the theory of probabilistic causation, one talks instead about degrees of causal strength. If A, given some context, then B with a certain probability. A statistical delineation of such co-occurrences will seek to compute a stochastic grammar rather than a formal grammar. While Jurafsky (1996) has argued that language processing and construction could be usefully modelled in a probabilistic framework in which “a single probabilistic mechanism underlies the access and disambiguation of linguistic knowledge at every level” (Jurafsky 1996: 186), Mumford et al. (2006) have argued that stochastic grammars are equally applicable to analysis of visual material. They write: “The origin of grammar in real-world signals, either language or vision, is that certain parts of a signal s tend to occur together more frequently

19

than by chance. Such co-occurring elements can be grouped together forming a higher order part of the signal and this process can be repeated to form increasingly larger parts. Because of their higher probability, these parts are found to re-occur in other similar signals, so they form a vocabulary of “reusable” parts” (Mumford et al. 2006: 278). Importantly, from this vocabulary of reusable parts, configurations across time gain affinity and resemblance by means of shared parts. Our cognitive “algorithms,” it seems, can extract meaning through detection of such patterns, without necessarily knowing what the meaning is. The reason is that, for even primitive, low-level processes, the brain has an inherent interest in inferring the causes of sensations and observations in the environment. This view has close ties to a wider field which is gaining substantial traction in various fields, namely the idea of the probabilistic or Bayesian brain.

The Bayesian brain hypothesis and the probabilistic mind As we have already stated, most environments or systems, natural, cinematic or otherwise, tend to be principled enough to allow for degrees of successful induction, but uncertain and stochastic enough to require the application of a sensitive learning mechanism to be understood. As it turns out, such a sensitive learning mechanism would be more than likely to benefit from familiarizing itself with the rules of probability theory. In the words of Mathys et al. (2014): “Probability theory formally prescribes how agents should learn about their environment from sensory information, given a model. This rests on sequential updating of beliefs according to Bayes Theorem, where beliefs represent inferences about hidden states of the environment in the form of posterior probability distributions. It is this process that we refer to as perception” (Mathys et al. 2014: 1). What has just been described is, essentially, the Bayesian brain hypothesis, which at its most basic level seeks to connect Bayes Theorem, the most famous theorem to prescribe how to rationally update beliefs in light of new data, with general theories of perception. The theorem is explained and derived formally in Appendix 2. For our current needs, the important point is that, when employed, and contrary to frequentist theories of probability, the theorem gives an updated subjective probability estimate of a hypothesis. More accurately, the probabilistic update is proportional to the likelihood of the new data times the prior likelihood of the hypotheses under question. Computing along the lines of such a model, the brain rationally responds to and learns about a changing environment. The idea that the human brain is a form of inference machine goes at least as far back as Hermann von Helmholtz (1866). The difference between then and now is that, in fact, for many (simpler) inference processes, we now know that our brains do compute something akin to probability distributions. How do our brains do that? “Until fairly recently, the classical assumption was that they didn’t. Instead, neural activity was thought to encode a single value, such as the direction of motion of an object or the identity of an object (the latent variable)… Over the last two decades, however, several groups have proposed that neural activity encodes functions of latent variables, as opposed to single values... If this is the case, then neural computations must manipulate whole functions, and must do so according to the rules of probabilistic inference” (Pouget 2013: 1172.) But if we accept initially that the brain, at least for some computations, learns about its environment in relative agreement with the rules of probability theory (it does not always do this; see next paragraph), then we need to ask what the brain conditions and computes over, that is, we need to ask: what does the brain make inferences about? This question brings us into the nature of vision. The literature seems to suggest that the brain segments or decomposes its visual environment into

20

useful classes and recombines them into precision-weighted cognitive models. This problem is called the binding problem, which in turn is split into the segregation and combination problem. By studying “cue-combination experiments” it has been suggested “that subjects indeed seem to perform such probabilistic decomposition” (Stocker et al. 2007: 6). A recent study (Zeki et al: 2014), however, suggests that the binding of different visually salient features – such as ‘red’ and ‘the shape of hair’ – into a whole – the compositional part of the binding problem – does not (largely) happen in early stages of perception but is a largely post-perceptual process. This means that cognitive binding relies on memory, which in turn means that it is task-contingent. More probable, however, is that the brain more generally relies on hierarchical dynamical models. Indeed, many studies, e.g. Knill (2007), Knill (2012), and Mumford et al. (2003) suggest that inference through cue integration follows hierarchical Bayesian models and that the various level processes are linked as in a Markov chain. There are a lot of new words here. For now, suffice it to say that the general idea of the above is that the brain computes the best probability estimate of the generating model of the sensory data at any given level. As such, “the system as a whole moves… toward an equilibrium in which each xi (visual variable) has an optimum value given all the other x’s. In particular, at each point in time, a distribution of beliefs exists at each level. Feedback from all higher areas can ripple back to V1 (early visual cortex) and cause a shift in the preferred beliefs computed in V1, which in turn can sharpen and collapse the belief distribution in the higher areas (Mumford et al. 2003: 1436). Thus, a form of recursive computation is performed. Such inferential mechanisms are constant, and mostly unconscious; “In all areas, however, the goal is the same: compute probability distributions over variables of interest s given sensory measurements I and prior knowledge p(s)” (Pouget et al. 2013: 1171.) There appears to be evidence, then, that for a large amount of processes, the brain behaves like a rational inference machine. However, evidence also suggests that, when cognitive computation reaches more complex tasks, the picture painted above is too neat. As such, there is much uncertainty as to the principles by which the brain predicts at higher cognitive levels. Indeed, evidence suggests (see, for instance, Pouget’s (2012) paper “Not Noisy, Just Wrong”) that inference becomes fuzzy for more complex cognitive tasks. Such a view is far from new. Kahneman (2011), Gigerenzer (2016), and Festinger (1957) have repeatedly made the scientific public aware that heuristics might be just as foundational an inference model as the process of proper statistical evaluation. As the last grain in the already gritty image, there is evidence that even low-level visual computation often occurs with a certain amount of error and epistemic friction. Following Mathys (2014), there are two sources of uncertainty which can obscure prediction processes: “First, even when states are constant, the amount of sensory information will in general be too little to infer them exactly. This has been referred to as informational uncertainty or estimation uncertainty (…) The second source of uncertainty is the possibility that states change with time, i.e., environmental uncertainty” (Mathys 2014: 1). The emerging view in the literature which can unite the clean statistical or probabilistic view of the brain and the fuzzy heuristic and biased view is one in which the guiding principle of the brain is not to predict per se, but to reduce uncertainty as caused by prediction error of values of observed quantities in the distal world. In the words of Clark, the “core operating principle is the reduction of precision-weighted prediction error… High-precision errors enjoy greater post-synaptic gain and (hence) increased influence. Conversely, even a large prediction error signal, if it is assigned

21

extremely low precision, may be rendered systemically impotent, unable to drive learning or further processing” (Clark 2017 (II): 3) This view, in which statistical learning in the form of optimization of pattern-recognition and corresponding action, allowing the brain to adapt and minimize uncertainty and prediction error, weighted by precision of optical measurement, conceptualizes the brain as biased but as biased for a self-preserving reason. What is true in all cases is that the brain will choose the function which is expected to minimize loss of information or surprise. We have thus offered a view in which active inference means taking action that reduces uncertainty about the environment. However, there is a second factor at play in reasoning and inferential processes, which we briefly mentioned earlier, namely that of model bias. The most popular proponent theory in this regard is Festinger’s (1957) cognitive dissonance theory. Festinger’s work has shown quite uniformly that our brains are likely to reduce discomfort or uncertainty posterior to a decision being taken. What this means is that, at the cinema, for instance, once an audience member has established an interpretative model, he or she is likely to want to preserve that model. A more recent empirical study showed that “accumulating more probabilistic evidence of more complex conditional dependencies has a cost, both in terms of storage, and in terms of the computational load of performing subsequent inference. Thus, discarding information after making a decision can help to keep this storage and the computational complexity at a manageable level” (Stocker et al. 2007: 6). We might ask if such model bias is likely to only exist at a higher cognitive level. It is known that the brain plays some sort of evolutionary game: whenever it predicts well – or, whenever no data clearly prevents the plausibility of the prediction given the model – the model that assured that prediction is strengthened synaptically in a very concrete way. In fact, Suchow et al. (2017) has shown that the standard evolutionary fitness function is isomorphic with Bayesian induction. What this seem to suggest is the intuitive notion that all organismic systems are naturally self-preserving. In the increasingly popular terminology of neuroscientist Karl Friston, the idea is that the brain seeks to minimize variational free energy. The term free energy is borrowed from thermodynamics, and, within that paradigm, translates into unused energy. In the framework of the probabilistic brain this means that, when free energy is low, model evidence is high, and when model evidence is high, the brain can explain the cause of its sensations and give accurate explanations for observations by means of active inference in the form of a generative model. What does the brain do when model evidence is low? The answer, perhaps quite logically, is that it acts or makes decisions. In the words of Friston: “Crucially, action can only affect accuracy (not complexity). This means the brain will reconfigure its sensory epithelia to sample inputs that are predicted by its representations; in other words, to minimize prediction errors.” (Friston 2017: 196.) This is where active inference, and Bayes Theorem, comes into play. So, it seems we have two options: “We can either change our expectations or predictions (perception) or we can change the things that are predicted (action)” (Ibid.: 198). Here in closing, we might ask: what about emotions? What about a situation of sublime experience? Does the brain perform inference in that state? It might be worth noting here that there is no certainty as to whether the Bayesian brain hypothesis is universal. It is, in short, a theory rather than a principle. Various scholars such as Pelowski et al. (2017) have noted that strong affective experience during the perception of artworks is likely to activate the Default Mode Network in the brain usually activated in states of rest. However, they also note that the DMN is likely activated because of maximal congruence between internal schemas and experience resulting in a hedonic

22

reward. Thus, such apparent non-inferential experience is likely entangled with, or a product of, inferential mechanisms. Let us summarize. A good part of this thesis has argued for the fact of a generality in natural environments and non-natural structured environments, such as cinema, thus allowing for primitive induction and statistical language learning. In recent pages, however, we seem to have argued for the foundational role of epistemic uncertainty. Have we thereby not contradicted ourselves? The point is that any organization which is not completely deterministic but has a stochastic quality, could be reasonably cast in a statistical light. Thus perceived, uncertainty and typicality are elements on the same granular scale, and thus no contradiction has taken place. But then, if one agrees that language learning, and learning in general, is primarily statistical, giving language, as it is instantiated in use, a stochastic quality, then how does one go about formalizing such languages? It is this question we now turn to, in which we introduce our primary method of decomposition, and our primary probabilistic conception of the language of cinema.

Formalization and modelling of cinematic languages If by now we have a set of plausible hypotheses, that a) cinema is a form of language, namely a stochastic one which therefore needs to be learned, that b) language and language learning is statistical in nature, and that c) the brain is a statistical organ, then a couple of important questions still stand, namely: how do we choose which properties, objects, and variables to model? And how do we formalize and designate the variables which we choose to model? Before I try to answer these questions, I should perhaps briefly discuss the extent to which I believe these questions have already been answered – or at least asked – in cinema studies and elsewhere. The largest field in which actual modelling of images takes place is the field of Machine Learning (ML). Even if our approach is manual rather than automatized or computerized, the logic and method of (moving) image understanding applied in the ML literature is a close relative to the philosophy of (moving) images defended above. The ML literature converges on the idea that images can be naturally represented and learned as a stochastic grammar in which some configurations are more likely to exist and persist than others. This implies, amongst other things, a strict adherence to the concept of the (de)compositionality of images. But what do visual compositions consist of? To learn the concept “green” not only in isolation, but as appearing in various configurations, it must be understood as carrying a certain permanent characteristic across time. Thus, Machine Learning relies on the principle and method of hierarchical decomposition. Most basically, this means that an image is decomposed into its reusable parts. We turn to the meaning of the concept “reusable part” already in the next section, where it comes to play a rather crucial role. I acknowledge here simply that it is a term borrowed from Machine Learning terminology. If we turn to film theory, the two major existing approaches to the modelling of visual data are a discourse-analysis inspired approach (e.g. Bateman (2012), Wildfeuer (2014), Chiao-I Tseng (2012)), and the field of cinemetrics (e.g. James Cutting et al. (2015)), the latter largely concerned with the modelling of shot-length dynamics over large corpuses of films. Even if these film theoretic approaches are related in spirit to our own approach, they stray too far from our path to be introduced here. The same goes for the field of Machine Learning, beyond the very brief

23

introduction given above. If this was a book-length thesis, I might have surveyed these various approaches more in depth, but, to keep a strict focus, and given that I believe my own approach is distinctive enough to be able to stand on its own, I instead refer any interested reader to Appendix 1, in which a survey of the approaches mentioned above can be found. I propose now the main philosophy, logic, and method of formalization to be applied in this paper. I propose that a filmic sequence can be modelled and formalized, even if in a non-exhaustive way, by means of the language of statistics. This entails the proposal that the content of moving images is compositional in nature and that its compositions across time are configured and strung together by reusable parts which reside in various hierarchies in perception and cognition, starting with pixels and ranging all the way up to symbols and concepts. As promised, I will first deal with the problem of formalization. Secondly, once the method of formalization has been introduced, I turn to the problem of inference. This includes a very brief introduction to probability theory and a subsequent introduction to how cinema, when decomposed and formalized, can be properly envisaged as a stochastic language. But first, we need to ask: what variables should we, and can we, formalize?

What variables do we formalize? To begin to answer the question posed in the title to this section, it will be conducive to return to the concept of reusable parts. The term, as it originates from Machine Learning, rests on the theory that the world is built up of a small number of geometric parts which, analogous to LEGO-bricks, can engage in various relations whereby observable constructions and configurations arise. A large part of Machine Learning studies, and studies on computation by image primitives in general, refers to Irving Biederman’s classic text Recognition-by-Components: A Theory of Human Image Understanding (1987) and its theory of primitive visual geometric elements called geons. Biederman’s idea is that, with just 36 geons, 154 million possible three components objects with five possible relations at edges between geons (curvature, collinearity, symmetry, parallelism, and cotermination) can be created, well exceeding the number of objects human beings are known to know (likely well under 100,000). In theory, these geometric “image primitives” can be used, more or less recursively, to reconstruct entire images, as demonstrated in the image below:

Mumford et al. (2006: 303)

24

While most fundamentally, reusable parts are thus low-level geometric properties analogous to the fundamental building blocks of the visual universe, in which case “a possibly small number of reusable parts might be sufficient to compose the large ensemble of shapes and objects that are in the repertoire of human vision” (Geman et al. 2006: 1), the term can be extended beyond low-level properties to span objects and higher-order concepts as well. To see how such a philosophy might work, let us review a thought put forward by Hilary Putnam: “Suppose I have a sensation E. Suppose I describe E; say, by asserting ‘E is a sensation of red’. If ‘red’ just means like this, then the whole assertion just means ‘E is like this’, (attending to E), i.e. E is like E – and no judgment has really been made… on the other hand, if ‘red’ is a true classifier, if I am claiming that this sensation E belongs in the same class as sensations I call ‘red’ at other times, then my judgment goes beyond what is immediately given” (Putnam 1981: 62). By following this logic, ‘red’ has become a reusable part. To accept such reasoning as valid is to accept, essentially, the conclusion we arrived at in our discussion with Currie and Carroll. The term ‘reusable part’ is thus not only a convenience principle which allows one to designate anything which can reasonably be said to have the ability to reappear across time in various relations and configurations, it is also a proper philosophy of how the world is in fact constructed. As should be clear, however, to formalize at this more granular level means to engage with a new set of problems. While the upshot of formalizing at the object-level is that the amount of data is significantly reduced, the downside is that one must now not only deal with the question of how to find the right object-granularity – one must also consider how, as uncontroversially as possible, to designate the reusable parts within the chosen spectrum of granularity. Let us look at an example from the literature to consider how it might be done for just a single image:

Mumford et al. (2006: 262)

Worth noting here is that the decomposition is hierarchical, that is, dependencies are already inferred. Worth noting is also that many more reusable parts could potentially have been chosen – such as the distinct colours present in the image beyond the colour green. To see what effect such decisions about leaving out variables can have on results once we turn from single images to relations across sequences of images, consider a moving image depicting a wave moving across the ocean. Consider also that it is followed, at some later point in the narrative, by a shot of a wave in a bathtub. What reusable parts might reside in this scenario? One can imagine various ways of formalizing the shots. Let us initially designate the scenario in the first shot as the total observed action, say, ‘wave moving across the ocean’. This would mean that, later, presented with the moving image of an artificial wave in a bathtub, no similarity is accounted for between the two shots. If we,

25

on the other hand, designate the first shot as the larger set of reusable parts (water, wave, ocean, moving, across the ocean, wave across the ocean, wave moving across the ocean, etc.), the higher granularity allows for a more fine-grained comparative network of relations. In this new scenario, the two scenes are connected by the set (wave, water) – and possibly other factors. This means that, in a statistical perspective, the frequentist probability of seeing a wave increases for any observer, but the probability of observing the co-occurrence (wave, ocean) decreases. Even in this simple example, the idea of “wave” has achieved a certain complexity through the logic of compositionality and reusable parts, which our brains are likely to expend energy on processing. The point we make here is that, once you designate, class and particular instantiation of class cannot, and perhaps should not, be clearly separated. But the point we make too is that there is no way to determine exactly which classes or reusable parts the two “watery” instantiations – the wave in the ocean, and the undulation in the bathtub – in fact share with each other. The only certain thing to say is that what they share is, or rather, becomes, a reusable part, and that if we allow some noise in our designations of objects – following Putnam’s thought experiment and the example given above – then the number of reusable parts, and thus the depth of analysis, significantly increases. But if we can tentatively agree to this philosophy of maximising, under reasonable constraints, the number of possible reusable parts, then we face the problem of granularity head on. Since what reusable part is too small, too insignificant, to be included? Can the depth of analysis be too deep? We might here recall Keynes’ principle of limited variety, stating that “there is no one object so complex that its qualities fall into an infinite number of independent groups.” But if we flip that argument, we could argue that there might be an object so simple, so small, or so universal that its quality falls into an infinite number of independent groups. Unfortunately, to designate at such a level would mean a computational prohibition ending only in regress. If we refer to a tree, what do we really refer to? The stem, the leaves, the bark, the colour green, the colour brown, the atoms it is made of? The problem of regress was one Christian Metz was well aware of in his book Film Language (1974), in which he attempted to lay out the principles of how to formalize the moving image. To stumble upon the problem of regress, means, according to Metz, to stumble into a terminal state: “In the current state of the semiotics of film… it is impossible to locate precisely the threshold separating elements we call “large” from those we term “small”… does the threshold extend to the aspect of the filmed object (the “color” and “size” of an automobile; the “violent roar” of the train sound)? And if this is the case, how, again, is one to isolate these aspects? Perhaps, even, it is located in the parts of the filmed object (the “hood” of the car, the “beginning” of the train sound) How are these to be handled? For the time being, these problems all seem insoluble” (Metz 1974: 140), and continues: “when (semiotics) reaches the level of the “small” elements, the semiotics of the cinema encounters its limits, and its competence is no longer certain” (Ibid.: 142). But is there a need to reach these “small” elements? What is important to note here is that the need to reach ‘the smallest unit of meaning’ is perhaps only a vital ingredient to arguing for the existence of, and formalization of, a language of cinema so long as one believes that the concept of atomism is an indispensable part of any language. So, the answer to the resolution or granularity problem might be to simply operate with a coarse and fixed resolution. We will introduce the further requirement that objects or properties have an autonomous function. That is, if the finger acts in accordance with the hand, then only the hand is designated. And if the hand only moves in accordance with the body, then only the body is designated. If we apply these two simple rules, we have taken a first step towards avoiding the problem of regress. There is nothing inherently “right”

26

about this decision, however. The most important problem is not to find the right granularity, but to simply choose one, which then determines the number of reusable parts, and thus number of possible relations and configurations. This is what I mean by choosing a fixed resolution, and one that is computationally viable. To introduce one last principle, which is perhaps the most difficult to operate with, we distinguish between a) hidden states and b) visible observations. This distinction is one typically made within modelling frameworks and is widely used in the ML literature. It is, however, not obvious how to draw the line between the two concepts. A hidden state could be a character’s mood, and the visible observation a smile. But one could equally well argue that “smile” is a hidden state, while the actual geometrical expression of the mouth area is the observable. Even further, one could argue that even that is a hidden state, and that certain wavelengths of light are in fact the true observables. But this is simply another type of regress which needs to be avoided. Let us introduce an example to clarify the logic. Given a filmic representation of a “party” on screen, we should ask if “party” figures on the screen as a visible observation, or if it is in fact a hidden state. One could argue, as we will, that what we see are various observables, such as a gathering of people, drinks, music, etc., and that what we infer is the hidden state “party.” The result of such a logic is that, while the image represents a party, it does not depict one, nor does it resemble one. This is not to say that hidden states or variables cannot have causal powers (in terms of story events), since that would be a false statement. We merely suppose, initially, that a set of observables are given from which inference can follow. This is the data, in other words, that the spectator condition against. We have now introduced the most pertinent problems tied to the practice of formalizing images, and offered some tentative solutions for how to deal with those problems. We can now look more precisely at the variables we are going to track through time. At the most basic level, cinema is a three-tuple consisting of space, time, and sound. Given a higher resolution, however, more variables turn to appear, as the table below shows. Certain constituents of the medium are not included, since we deal here with variables, rather than ontological properties of the medium. Variables thus always have particular instantiations. The column on the right represents variables not considered in the model to be developed in this thesis: Location Character Objects (mise-èn-scene) Sound Objects Camera movement Framing Action/event Visual properties/stylistic operators

The variables we choose not to model include: Adjacency Region/location Relative size in image and relative to other Dialogue Spatial and hierarchical configuration of parts

To model the variables found in the right column, while no doubt important to the structure of any film, would require a precision of measurement which is difficult to achieve without the assistance

27

of computer technology. The variables on the left are more easily modelled by manual means, and are thus included. Before I introduce the method that more concretely allows us to formalize cinema and analyse it statistically, we need to look at one last methodological problem, namely that of the time dimension. Film appears as continuous time to the human eye, but is in fact discrete. Nowadays, typically, 24, 25, or 30 frames are projected per second. Designating and formalizing at the framelevel, however, is neither conducive, nor computationally feasible. We will thus employ a common approximation strategy within statistics and discretize further. The dilemma is the following: If we pick a too fine-grained time-granularity, we enhance model precision, but end up with a large amount of data. If we pick a too coarse time-granularity, we lose information and thereby forfeit model precision. In this paper, instead of rigidly formalizing at the shot or frame-level, the intuitive rule of thumb will be to perform a new measurement whenever a visible change appears amongst any reusable part in the image. Whenever a measurement is made, a new time-block is initiated. We are now ready to introduce the main method of formalization. Introducing the Boolean logic of formalization The idea we would like to suggest in this sub-section is that every observation (decided to be modelled according to the principles above) can be modelled using Boolean logic. This follows from the above claim that we are here modelling objective observables, and not hidden states. Thus, at any time t, either the designated object is or is not represented on screen. The following rule can thus be applied: Given that a visible object, included in the vocabulary of reusable parts, appears on screen at time t, attach to it the number “1.” Given that a reusable part does not appear, attach to it the number “0.” Thus, at any time t, any visible object exists on screen with a certain frequentist probability relative to this model. Whenever an object appears on screen, its frequentist probability increases. Similarly, any co-occurrence probability rises conditional to any object or property it appears with at time t. Lastly, the co-occurrence probability of the variables it does not occur with, in this instantiation, decreases. This also means, controversially, that even if an object is not on screen, it still exists as a possible (but non-instantiated) part of the image, since it is known to be a possible configuration. In this view, every moment is an instantiation of a configuration from the total set of (included) variables. By following the above logic, we end up with a so-called observation or emission matrix for every time step, of which an example is depicted below: Observation matrix Object  time  Location (level 1): House Location (level 2): Living room Location (level 2): Garage (Character): John (Character): Mary

t=1 1 0 1 1 0

t=2 1 1 0 1 1

28

In the full model, every designated variable at every time t will be represented in the matrix. Since the matrix grows exponentially with the number of variables, it is for practical purposes necessary to keep this number under control by adhering to the principles presented in the last section. To reiterate, the above logic holds only in a situation in which there can be no doubt as to whether the given observation is true for all observers. This means, as we noted earlier, that no hidden states are delineated in the matrix. The reason is simply that, at a certain point, it becomes convenient to say: given that we (assume that we) know A with certainty, what can we say about B of which we are uncertain? Thus, for our database construction, we choose to use a Boolean system in which, using a certain intuition, we choose what to see as a non-objectionably observable state, and what to see as a hidden state. As long as we are strict in our intuition throughout – that is, not to designate “house” at one point and neglect to do it, under similar circumstances, at another – the problem of finding this distinction is not fatal. As mentioned earlier, the hardest problem is not that of finding the right granularity, or the right distinction between observables and hidden states, since there is none. The problem is rather to keep a fixed granularity and logic of distinction throughout the formalization process, under the restriction that the chosen logic renders computations tractable. By now, we have introduced the principles of formalization, and the Boolean method of modelling. We are now ready to introduce and establish the probabilistic perspective on the language of cinema, which, as it turns out, is the natural product of a statistical treatment of the Boolean dataset.

Inference and probability from the Boolean dataset: Uncertainty and dependencies We turn now to the concept of probability as it relates to inference about cinematic structures evolving stochastically in time. Both Branigan (1992), Bordwell (1985), and others recurrently have mentioned probability as a mechanism of interest to the functioning of cinematic spectatorship, often in relation to the concept of cognitive schemas. More recently, Kukkonen (2016), with her paper “Bayesian Bodies: The Predictive Dimension of Embodied Cognition and Culture”, has addressed the notion of probability more directly. Neither Kukkonen, nor anyone from the cognitive tradition, however, have turned to explicate the formal mechanisms underlying the concept. The work by sociologist Peter Abell and quantitative film theorist Nick Redfern are two welcome exceptions in this regard. Abell (2009) introduces the term Bayesian narratives and deduces a simple log-measure of the strength of any hypotheses against its negation, as inferred by an agent making sense of an ongoing narrative, while Nick Redfern (2011) builds upon Abell’s work by considering various evidence models plausibly used by spectators. We should like to follow the work of these two scholars in applying the concept of probability and inference in a formal light. However, given the complexity and depth of the subject of probability theory, it is impossible to provide here a full overview of the field. For the interested reader who is already familiar with general probability theory, I refer to Appendix 2 for a more detailed introduction to the formal aspects of the mathematics of probability and stochastic modelling. Below, I provide a more general introduction to the subject. The two major interpretations of probability theory We might intuitively start off this section by evoking some of our earlier findings. If we recall the Bayesian brain hypothesis and the evidence on which it is founded, we found much reason to believe that one of the major tasks of the human brain is to predict the causes of sensory inputs.

29

We found that, given an uncertain environment, that is, given any environment, the brain must compute probability distributions over variables of interest to decrease future prediction error relative to these variables. If we now beyond these findings and ask how the brain creates such a generative mechanism in the most statistically rational way possible, then probability theory enters the picture. Stated most simply, probability theory is the study of how to accurately quantify uncertainty. Thus, if there is a need for a language of plausible inference, that is, a language in which the utterer can remain consistent and rational with regard to his or her conjectures, suppositions, doubts and uncertainties, then probability theory is that language. As it turns out, there are quite a few interpretations of the subject within the field. For a light and general introduction to these various existing interpretations, I recommend Darrell P. Rowbottom’s book Probability (2015). Despite these manifold interpretations of the concept of probability, however, the main scientific disputes seem to have taken place between two broad interpretations: the orthodox frequentist view, and the Bayesian view. Both are logics which prescribe rules for how to conduct statistical inference, but beyond this general affinity there are important differences to be noted. E.T. Jaynes’ book Probability Theory – The Logic of Science (2002) provides a recommendable, even if slightly biased, overview of the main quarrels separating the two interpretations. Before turning to explicate the conflict between these two interpretations, and let the reader know how we are going to position ourselves in the field, I will briefly engage in some definitional clarifications over which there is no dispute in the field, and which will be used as standard terminology throughout the rest of the thesis. Standardly, three types of probabilities are said to exist: Marginal probability, conditional probability, and joint probability. As a fourth term, forming a special case of conditional probability, we have the concept of transitional probability. Let us explain the concepts by means of a few examples. If you estimate the probability of rain tomorrow, irrespective of any other predictor, you estimate a marginal probability. If you estimate the probability of rain tomorrow given that it rained yesterday, you estimate a conditional probability. If you estimate the probability that it will rain tomorrow and that it will be windy, then you estimate a joint probability. Finally, if you estimate the probability that the weather will transition from rainy to sunny on a given day, you estimate a conditional probability in the form of a transition probability. In later sections, we will return especially to the concept of conditional probability, which will come to play a fundamental role in what we will later define as the art of model selection. As promised, we can now return to engage with the two major interpretations in the field of probabilistic modelling. Recall that, they are the (orthodox) frequentist view and the Bayesian view. It will be natural to engage first with the former. The frequentist view assumes that all phenomena are random processes that, if repeated long enough, will approximate a true distribution. This view might hold up for a coin-toss, in which case, if the coin is ‘fair’, the true distribution will tend towards 50/50 as time goes towards infinity. But the coin is not always fair. The frequentist view thus becomes controversial when transferred to non-ideal scenarios. The reason is that most complex, particular events in nature are not repeatable. Asking what the probability of rain is at a particular place tomorrow is meaningless in the frequentist view, for it is an event which occurs only once and thus has no true distribution. The same can be said of any scene in a film. It is only if we abstract from the non-reusable aspects of such events and group them as recurrent types – or as composed

30

of recurrent types – that there is a chance of assigning frequencies to them. But while taking such a step might make the application of the frequentist view possible for a broader range of phenomena, it does not necessarily validate its inductive philosophy, since, we might ask: does the distribution of any (recurrent) event merely depend on its past distribution, or are there other determining factors at play? According to the Bayesian view, there are. The Bayesian view does not assume a true distribution as an existent in the world. Nor does it concern itself with repeatable events. Instead, it measures subjective probability, but does so according to a rational model. Put together, this means that probabilistic reasoning can be entirely rational, even if the model is a bad predictor of whatever it seeks to measure. The reason is that Bayesian models depend on available evidence as well as belief, thus positioning itself as a semiinductive framework in contradistinction to the fully inductive frequentist view. The Bayesian view thus leads us straight into a situation where, given uncertainty, multiple models, no one more “true” than the other, can rationally be employed by an agent. This further means that, if models are mutually exclusive, they can be said to negate each other probabilistically relative to a hypothesis. One factor which makes such multiple model scenarios very likely, if not inevitable, is the concept of priors, which is an exclusive asset of the Bayesian view. The term “prior” is an abbreviation of “prior knowledge”, and thus simply represents the “luggage” that goes into the probability estimations but which cannot be seen or inferred from the statistics of the environment. In summation, in this expanded framework, inference does not depend simply on a statistical frequentist distribution of variables in time, but on observations as well as an observer model, which, stated simply, manipulates frequencies according to prior beliefs, which in turn hinges on estimations of prior frequencies. From the above, we can sum up at least two approaches to probabilistic modelling: 1) Inference about the likelihood that a specified (prior) model has produced given observable variables (say, in a film sequence) 2) Inference about the frequentist statistics of an observation sequence as specified by the Boolean dataset But the question is, then: how do the two interpretations given above each fit with our construction of the Boolean dataset and with inference at the cinema? We now turn to more practically address our probabilistic conception of the language of cinema and the problems attached to it.

Probability from the Boolean dataset Given any point in time, the matrix specifies certain variables in play which have a certain distributed quality and a certain set of distributed relations. But the question is then: what is the marginal or conditional probability that any given variable appears, or the joint probability that any variableconfiguration appears, at any time t? As it turns out, there is no single (or simple) answer to this question. Given our Boolean dataset, one might initially assume that a purely frequentist probability can be calculated by simply dividing the number of occurrences of a given variable or variable configuration with the total number of possible occurrences. There are several problems by applying such a strategy, however.

31

The problem is that, in its current state, the Boolean dataset presents a stripped down datafication of occurrences and co-occurrences of a vocabulary of variables in time, but the structures it specifies could be equally meaningful, spurious or incidental. There is, in other words, no recipe for how to “count” the occurrence of a variable properly. This means, amongst other things, that the frequentist view is far from as simple as it might initially suggest. Who is to decide whether the entire past of the variable X, a subset of its past, or some other variables Y, Z… best determine the probability of X’s future distribution? To play the Devil’s advocate, one might ask if this question is not simply answered by the Boolean matrix. If so, its specifications of what variables tend to appear together is also a specification of what variables depend on each other. This logic, however, is insufficient, which can be made clear by looking at an example. The matrix might specify that the main character tends to cry nearby a table, and more often at the kitchen table than at any other table, but this pattern need not be (though it could potentially be) meaningful, i.e., it might not serve to increase prediction power. This simply means that, given an observed pattern, either consciously or unconsciously, the task of any spectator is to decide if the detected pattern is accidental, spurious, or meaningful. Sadly, the Boolean dataset does not specify this rather crucial aspect of the inference equation. If a pattern is perceived to be meaningful, this means it is perceived to be useful relative to a relevant hypothesis (a hypothesis, in this framework, is an internal probabilistic model which serves inductive ends by predicting the causes of sensations and observations). The most reasonable conclusion is thus to say that, for both the frequentist and Bayesian view, there is a fundamental need to condition data only on data perceived to form a dependent relationship to a hypothesis, which means to engage in a process of excluding nonmeaningful patterns. Thus, to perform efficient inference, a logical subset of dependencies must be inferred which in turn defines what the spectator assumes is beneficial to condition his set of beliefs against. We will turn to the intricacies of this sorting process at present.

Reducing the size of the matrix: Spurious dependencies If someone were to suggest to simply condition everything on everything, I would reply that that unfortunately is an unfeasible task, and that, even if it were feasible, would be a badly-chosen strategy. The reason is that the number of variables in a film is simply too large for it to be viable to include them all in, say, a co-occurrence matrix. Not only must a new matrix be specified each time a new measurement is performed – if 1000 measurements are made, 1000 unique matrices must be specified – this set of matrices must be specified for each plausible model. To avoid dealing with such enormous amounts of data, which better accommodates the limited cognitive faculties of a human being – a possible and computationally viable solution is to track only a few variables at a time, and to condition them only on variables deemed to form a dependent relationship. To see how one can reduce computational complexity in this way, let us introduce the idea of Bayesian networks. A Bayesian network reduces complexity by specifying dependencies and independencies amongst its variables. An example could be the following network:

32

Bayesian network

How does one go about creating such a network of dependencies for, say, a scene or sequence in a film? Variables might be probabilistically correlated, but perceived to be causally independent. Recall that, the theory of probabilistic causation states that if A conditional on B is bigger than A given not-B, then B causes A probabilistically. But as noted, probabilistic correlation must be distinguished from actual dependency. To take a concrete example, let us look at one variable in a film, such as location. Does the state of a location depend on what the location was 10 minutes ago? Does it depend on the location of the earlier shot? How about other variables such as the weather, an action, a character’s mood, etc.? The question perhaps cannot be answered clearly. The process of inferring dependencies is a contingent learning process. A general observation can be made, however, which is that every observed change only is observed given an inferred dependency, that is, given a model. If one accepts this view, there is no such thing as passively being surprised at a certain rupture of a structural pattern. Any surprise is a result of a model in which you see the disruptive correlation as non-incidental and dependent. If a shift in cutting-rate is observed, it is because you perceive the current cutting rate to depend on, say, the earlier cutting rate, or some other variable of interest. This dependency is inferred, rather than given, and is relative to a model. On the other hand, if the statistical irregularity goes by unnoticed, it is because it is seen to be independent of the variable under inspection. Let us look at an example. A spectator might be uncertain as to whether a certain scene in a film represents a party or something else, say, a formal gathering or a memorial. As in any such situation, the spectator will condition on the data available to resolve as much uncertainty about the generative mechanism as possible. Thus, if a character in the film utters to the man next to him the phrase ‘what a lovely party’, a spectator is likely to re-estimate the conditional probability 𝑃(𝑝𝑎𝑟𝑡𝑦|𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑏𝑙𝑒 𝑑𝑎𝑡𝑎). But if the next observation happens to be a swastika on the back wall, any spectator might revise his or her assumptions about what, to speak in Bayesian terms, has generated the observable data. In the updated situation, our spectator might hypothetically attribute probability 0.4 for party, and probability 0.6 for official Nazi meeting, on the assumption that he sees the two hypotheses as mutually exclusive. Let us introduce another element. Suppose an earlier scene, depicting an official Nazi meeting, though a smaller and more informal one, was filmed in a fixed total shot. Suppose further that the present gathering is shot in the same way. Suppose even further, for convenience,

33

that only these two scenes are shot exclusively in a total shot. Our viewer might, then, rightly or not, reason with a certain probability that the given gathering is anything but a traditional party even before seeing the actual swastika on the wall. But the point is that he will be justified in this reasoning if and only if he sees the total shot as a “child” of (that is, a product of) the present gathering, and if, in the earlier scene, he saw the official Nazi meeting as a “parent” of (that is, a generator of) the total shot. It is the ‘if’ which is the important word here, since it emphasizes the problem that there is no unassailable way to objectively determine whether the observed secondorder probabilistic correlation represents a dependency, however weak, or not. We thus leave with the idea that, to predict well, a spectator must infer the right dependencies, but, given that cinema is not a physical system which obeys deterministic principles, no such set of actual or “right” dependencies exist. What exist are merely observed correlations or patterns which, if the pattern is strong enough, can be inferred to depend on each other in a given evaluative or predictive model. Thus, we have a situation where the language of cinema is structured enough to provide a fertile ground for modelling according to inferred dependency relationships from observed frequentist patterns, but where it is also messy enough to provide fertile ground for multiple model application, and thus for negation. But this does not mean that no probabilities exist, it just means that they are always a product of a contingently built model. We now move on to specify what probabilities can be inferred and estimated given such a model. Estimation of frequentist probabilities from the Boolean dataset We can now present the main consequence of the above for this study’s conception of how the language of cinema functions. It will, as promised, be in a probabilistic perspective. What probabilities exist in the image? At any time t, a set of non-exclusive models, each with its set of dependence assumptions and possible priors, will specify a set of marginal, joint, and conditional probabilities of each variable of interest. The conditional probabilities might span variables within a scene, or they might take the form of probabilities of one variable or variable-set transitioning into another variable or variable-set. Thus, by seeing cinema as a statistical, multivariate distribution in time, we ask questions of the following sort: if variable X is present, then, given a model M, how does it affect the probability that variable Y is present? If variable X was present at time t -1, what is the probability that Y is present at time t? To repeat once more: the answer to the first question will represent a standard conditional probability of Y given X. The second represents a temporal conditional probability, termed a transitional probability. If we, on the other hand, ask: what is the probability of X,Y,Z,... occurring together at time t? then we seek to calculate a joint probability. In this way, we can potentially calculate the probability of a scene by decomposing it into a joint set of variables and introduce certain independence assumptions. Lastly, if we ask: what is the unconditional probability of a given node X, then we seek to calculate the marginal probability of that node. Two nodes X and Y are independent if the conditional probability of X given Y is the same as X without Y. The likelihood of the appearance of a variable X can be estimated by conditioning on every other variable, as well as previous events of relevance. To reiterate, at every time t there exists a conditional probability, given a model, that any variable X or set of variables in a set (a scene or part of a scene X,Y,Z,…– set) happens conditional on any X or X,Y,Z,…– set at current time. Further, there exist transitional probabilities conditional to any variables of relevance at an earlier time-step. Second to last, and less interestingly, there exist marginal probabilities of all X, both instantiated and non-instantiated, at time t, specified by the purely frequentist rate of appearance in the image across all time-steps. Lastly, given the Boolean

34

system, at every time t in which a variable appears, there is probability “one” that the given variable appears, and probability zero given that it does not appear. But, as we saw in the last section, the point is that any specification of such probabilities is a product of a contingent learning process of model application and revision. As Jacob Feldman notes in his observation on what he calls the observer lattice: each observer can be identified by the set of data channels Σ to which he or she has access. One observer, for example, might estimate the probability of rain by considering the set Σ = {month, temperature}, while another might consider Σ = {month, humidity, day of week} …The set of possible observers, then, corresponds to the set of possible subsets of the RT, the total set of variables assumed in the universe at hand… Each of these agents will, on the basis of the data available to them, have an estimate of p(X). How do they compare? (Feldman 2016: 12) Let us therefore make it clear: there is no probability distribution that resembles or takes on the form of a “true” distribution. According to N models we have, at any time t, N probabilities that a hidden state X is on screen. Depending on the nature of priors, and on the nature of what any spectator chooses to condition on, probability assessments vary. Thus, there are great reason for believing that a third approach – let us call it a mixed models approach – is most veridical to the actual workings of the human brain. As we noted in the previous section on the probabilistic brain, there is great evidence that, for more complex cognitive tasks, inference is more than likely to become fuzzy. This does not mean that reasoning is no longer probabilistic. It merely means that it is not smooth. This leads us into our final topic, namely that of negation.

Negation in the probabilistic perspective We can now finally turn to the topic of negation, which I suspect has been long anticipated. As noted in the very beginning of this thesis, the argument that images cannot perform the operation “negation,” at least in a logical or linguistic sense, is one of the most oft-used arguments against the cinema-as-language hypothesis. Edward Branigan wrote an article titled “Here Is a Picture of No Revolver!” (1986), and Sol Worth penned a similarly titled essay “Pictures Can’t Say Ain’t” (1984). Noël Carroll has much the same view of the matter when he says that: “Showing an image of Napoleon is scarcely the same as saying that Napoleon is not tall” (Carroll 2008: 105). Perhaps Carroll’s example is a bit coarse, but it nonetheless stands to support the argument that the thesis that images cannot perform negation is deeply engrained in the literature. We will spend this section arguing that, when perceived in the empirical light of statistics and probability theory, rather than linguistics or logic, negation in fact functions as a natural constituent of cinema. To see how negation functions in cinema contra language, agree first – that is, this is the premise of the argument – that natural language is capable of negation, and that an example of a use of negation is the following sentence: “I did not go to the cinema yesterday, since I only enjoy language.” If cinema, then, as some argue, is indeed not any type of language, then it cannot perform any type of negation. Clearly, one would be hard-pressed to convey the type of negation

35

sketched above through cinematic images alone without doing violence to the concept of negation. The reason is that cinema is naturally positive by way of only being able to indicate that which is not by means of that which is. If one shows a room, let’s say, without John, this clearly does not deterministically state that ‘John is not there’. Similarly, if a scene shows John in the room, only to subsequently jump-cut to an empty room with no John in sight, it would seem the proposition is rather that ‘John was there, but he is no longer there’. But the matter is not settled. Might we not accept that cinema is capable of negating convention or statistical regularity? If so, then I will argue we must also say that it is a language. Remember, we are looking for the proposition that ‘John is not there’ expressed visually. Imagine now a film in which John’s mother is dying. The film follows John as he meticulously takes care of his mother. The viewer is presented with scene upon scene of John sitting next to his mother in the bedroom. He does not leave her at night, but chooses to sleep in the same room. He only leaves to do grocery shopping, which he usually does in the early morning. Then, near the end of the film, late afternoon, we see the mother alone in the room, badly coughing. The image might state quite a lot of things, and invoke emotions, but one could argue that amongst those is the proposition, inferred by the viewer, that ‘John is not there’. This negation is not of a logical nature, nor are both John and not-John directly present in the picture. Rather, what has been negated is a certain, not convention, but regularity. But to refer to our section on inferred dependencies, the negation only takes place insofar as there is an inferred dependency between John’s being currently present or absent in the room and John’s being present or absent in the room in the past. If we therefore assume a viewer which sees John’s presence in the specific image as dependent only on John’s past presence in that specific context, then, for this spectator, the negation ‘John is not there’ is, in a statistical sense, inferred to be part of the state of affairs of the image. More precisely, for this viewer, on a purely statistical level, John exists indirectly as a property in the picture with a certain probability; similarly, his negation, or the negation of his presence, exists in the picture with a certain probability. The conclusion we can make is that something’s presence negates its absence, and vice versa, in a context, insofar as its present state is inferred to depend on past states. Viewed in this way, negation comes to play a natural role in the language of cinema. The upshot of the view presented above is that dependencies that are not visible in the shot can be mapped. The rationale of the view is that frequentist probabilities of observables change even if the observables are not presently instantiated. That is, whenever X is instantiated, all its earlier children or parents are instantiated indirectly with an updated frequentist probability. Conclusively, at any time-step, any variable or possible configuration, either instantiated or non-instantiated, exists with a certain probability in the image. Expressed in a non-formal language, this merely represents the familiar notion that the meaning of an image is determined not only by what is in the image, but in all that it is inferred to be dependent on. But it also means that any non-instantiation of a variable that in a model is inferred to be probable to occur produces a probabilistic negation which can be measured by the ratio of the probability of the variable’s non-occurrence relative to its probability of occurrence. In this sense, the negation is a probabilistic gap between a prior frequency and a posterior frequency. There is no need, however, to incorporate a spectator’s prior world-model for such a negation to manifest itself. In fact, as we established above, frequentist negation manifests itself continually by way of probabilities, be they joint, conditional, or marginal, evolving over time.

36

We might ask what happens, on the other hand, if the prior model is not a prior distribution within the film, but a prior spectator model over observables? In that case, the probabilistic gap no longer exists between various frequentist distributions within a film, but between a viewer’s prior model of reality or cinema – which is largely time-invariant – and the film’s inferred distribution of observables. In such a model, a viewer might expect – to go back to the earlier example – that John takes care of his mother every day, because, according to this prior model, “that is what people do when their mother is sick.” This means that, given that John is observed to momentarily neglect his caretaking chores, then, above and beyond the frequentist likelihood of his behaviour, negation manifests itself between the expectations of a prior model and an observable distribution. This type of non-frequentist negation thus forms a second way negation might function in our framework. Too see how a third, and last, type of negation might arise from our view, let us now complicate the picture slightly. Suppose it has been made clear to the audience that John has an appointment, say, a funeral, on the very afternoon he is observed to abstain from the caretaking of his mother. Suppose further that there, on this afternoon, can be observed an ordered food plate in the room, including what seems to be a cup containing a hot beverage, and that this food plate stands on a chair next to, but out of reach of, the mother’s bed. In this extended scenario, what are the various probabilities of John’s absence, and how do these define the statistical negation produced by John’s absence? The unconditional frequentist probability of John’s absence is, as we have established in this scenario, much lower than his presence. However, conditioned on the event of the funeral, a prior model might estimate that John’s absence is more probable than the unconditional probability alone suggests, since, let’s say, “one cannot be two places at once”. Oppositely, if a model conditions on the hot beverage, etc., it might estimate that John’s absence is less probable than the unconditional probability of his absence, the reasoning being that the hot beverage, under the assumption that John needs to be present to feed his mother, tends to probabilistically cause John’s presence. If, in the scenario just given, each conditional probability represents a model, we thus have a situation where several models simultaneously render John’s absence (im)probable to various degrees. This we might term multi-model negation, which defines the third type of negation as it occurs in our statistical framework. If we have now described the phenomenon of negation informally, as it occurs in our framework, and want to provide a more formal measure of the amount of negation produced by the gap between distinctive predictive or evaluative observer-models, we can go by the book of statistics and express the clash between predictive models as a probabilistic ratio between these models, by using Bayes theorem (see appendix two for a derivation hereof): 𝑝(𝐷1 |𝐻) ∙ 𝑝(𝐻) 𝑝(𝐻|𝐷1 ) 𝑝(𝐷1 |𝐻) ∙ 𝑝(𝐻) 𝑝(𝐷1 ) = = 𝑝(𝐷2 |𝐻) ∙ 𝑝(𝐻) 𝑝(𝐻|𝐷2 ) 𝑝(𝐷2 |𝐻) ∙ 𝑝(𝐻) 𝑝(𝐷2 ) This gives us the intuitive result that the ratio between two posterior model-probabilities are proportional to the ratio of the two likelihoods of the observable data given the model:

37

𝑝(𝐻|𝐷1 ) 𝑝(𝐷1 |𝐻) ∝ 𝑝(𝐻|𝐷2 ) 𝑝(𝐷2 |𝐻) The left-hand side is called the Bayes factor or simply odds on the model or hypothesis H relative to the observable data, and is in general a good measure of (probabilistic) negation. In the example above, H is the hypothesis “John is absent,”, whereas D represents the observable data. In theory, the data D could be endless, but for simplicity, in the above equation only two data-points are considered. In the analytical section, the extended case with several data-points will be considered. How does a viewer react to such negations? Given our knowledge of the self-sustainable, biased nature of the brain, the result is likely that a viewer will try to translate, incorporate or transfer any perceived negation into a set of variables that better appropriates his or her prior model. Such transfers work to transpose signification from a set of variables that do not naturally go together in the prior model of the viewer, to a set of variables that do. The shift in signification is, unconsciously or consciously, a shift from a low-probability state to a high(er) probability state given a prior model. This means that, in the mathematical framework above, a viewer will try not only to maximize the likelihood of data 𝑝(𝐷|𝐻1 ) by inventing hidden states H1, H2, etc., which make the observable data the most likely observation possible; the goal is also to make the likelihood of various hypotheses similar, thus reducing the amount of model-negation as expressed by the Bayes ratio. We can express these two goals as follows: 𝐕𝐢𝐞𝐰𝐞𝐫 𝐢𝐧𝐜𝐞𝐧𝐭𝐢𝐯𝐞: 𝟏)

𝑚𝑎𝑥𝑖𝑚𝑖𝑧𝑒 𝑝(𝐷|𝐻1…𝑛 )

&

𝟐) 𝑚𝑎𝑘𝑒

𝑝(𝐷|𝐻1 ) ≈1 𝑝(𝐷|𝐻2 )

In the last example we gave, a viewer might attempt to reduce negation by reasoning that “John must be exhausted; it is only fair of him to take a break and get some rest” or that “I always suspected John was a bad boy, I could only expect him to act in such badly mannered ways!” It is important to notice, however, that such apparent shifts in the signification of a scene by the addition of hidden states (hypotheses) which increases the probabilities of observations, without there being a shift in the material which signifies, do not translate into there being no language of cinema, or that signification is entirely unstable. Neither does it mean that the process of perception and cognition functions without the application of a language-like mechanism. Rather, it simply means that one cannot derive from a single distribution of the relative frequency of observables, a single or true probability distribution over observables, just as there is no way for a spectator to infer dependencies in a non-empirical way. The final consequence of the stochasticity of conditioning is that various non-exclusive probabilities can be rationally attached to each element in an image, scene of sequence simultaneously, depending on what any statistical learner chooses to condition on. Of course, the more complex the (cinematic) environment, the greater the need for the application of multiple models, and the more difficult it is for an observer to maximize the likelihood of observables. This is why, given a complex environment, the most approximate evidence at the disposal of the audience becomes, not a single probability estimate, but a probabilistic distance or ratio between various models and their conflicting predictions and evaluations. This ultimately defines negation in our framework.

38

Summary To sum up, in this chapter we have a) formally introduced Boolean decomposition as a method for creating a database for/from a given film, b) informally introduced how probabilistic patterns can be extracted from this database, and the problems connected to this endeavour c) informally introduced the ways in which negation naturally comes to play a part in how cinema communicates, when seen in this way as a statistical language. We now turn to apply the framework to two cases.

39

Case studies: Two films This section will be split into two parts. The first part will look at how a single variable – namely shotlength – evolves in the film O Último Mergulho (The Last Dive Portugal, dir. João César Monteiro, 1992). The film is chosen since the irregularity of shot-lengths in this film can be said to facilitate a case of the first type of negation. The first case-study will focus on only one variable, and thus be the very simplest application of the logic of Boolean decomposition, in which only the timedimension shifts. For the second part, we will look at a multivariate case and analyse the opening sequence as well as a few additional shots of the film The Duke of Burgundy (UK, dir. Peter Strickland, 2014). In this second case, our focus will be on the second and third type of negation. The objective of the two analytical studies is not to investigate the films broadly in the mode of traditional film appreciation and interpretation, but rather to isolate certain aspects of the films deemed to be of special interest to our statistical and inferential perspective.

A negation of relative frequency: The case of O Último Mergulho In this first case-study, we imagine a movie spectator who is given the task of trying to evaluate and learn the evolution of a single stochastic cinematic variable evolving in time. In theory, we could pick any variable which happens to have a frequentist distribution, but the case turns out to be more practical if we choose a continuous variable such as shot-length. The task, then, given to this spectator, is as follows: at any time-step, the spectator must evaluate and estimate (by forming probability distributions) the likelihood that there is going to be a cut in the next time-slice. This is the process of evaluating the likelihood of observables given some model. But how does the spectator, or rather, his brain, go about this task? We can now recall some of our findings so far. We have found that, one of the major mechanisms of the brain is to minimize prediction error. We have also found that, one of the ways to ensure minimal prediction error is to reason about the generating mechanism of the observed or sensed variable in question. Part and parcel of this endeavour is to generate an initial set of assumed dependencies between the variable and other variables deemed to be predictors of the variable under scrutiny. However, to simplify matters – or complicate, depending on one’s perspective – our hypothetical spectator is presented with a further restriction. The restriction is that, he has only the history of this one variable – the distribution of shot-length in time – to condition against and include in his set of dependencies. This last addition might seem implausible to some, so, before moving on, we should perhaps stop and ask if such a scenario is in fact plausible. Is it reasonable to assume that some films can be said to redirect and restrict attention to inference about single continuous variables within a film, or is it a contrived picture we have set up? I would argue that, even if cinema in its traditional form is an interplay of multiple variables in time, some cinematic constructions can work to isolate and fixate the attention of the audience to primarily the distribution of a single or single variables. To see how this might be true, note that, in European arthouse cinema, it is not uncommon to see a rather extreme simplicity of organization of mise-en-scène which in turn means a sparseness of variables to pay attention to. This also means that, given one variable which functions as an outlier in the sparsely populated distribution, it is more likely to become an object of singular attraction of attention. This is all pretty abstract, so to make matters more concrete we should perhaps turn to a film which can be said to make use of such a tactic. The film we will choose to look at is João César

40

Monteiros O Último Mergulho (1992). Particularly, we will focus on one sequence in the film in which one variable, namely shot-length, statistically deviates in a rather extreme way from its own prior distribution in the film, thus producing prediction troubles for any spectator trying to construct a single generative model of the distribution by means of only univariate model prediction. Since this is the sequence being fed to our hypothetical spectator, let us introduce the scenario in greater depth. The sequence represents a two-part dance performance of the play Salome by Oscar Wilde. A still from midway through the sequence can be seen below.

Still from the sequence in which German actress Fabienne Babe performs the play by Oscar Wilde.

The sequence is 12 minutes in total, split into two 6-minute segments recording the dance in a fixed total shot. Between the two segments, there is a brief interlude in which we get a glimpse of the few guests witnessing the performance. However, given the brevity of the interlude, it is not far off to consider the sequence as an unbroken shot. The average shot-length prior to the 12-minute sequence is ∼ 20 seconds. The shot-length of the dance sequence thus represents a significant statistical deviation, which becomes rather visible by depicting the accumulated length of the film:

Accumulated lenght in seconds

Accumulated lenght of O Último Mergulho 140000 120000 100000 80000 60000 40000 20000

0 0

20

40

60 80 Shot number

100

120

140

160

The disruption conjured by the sequence is exacerbated by an overall stillness; there is no camera movement (or only the very slightest movement), the image is dark, and there is minimal movement within the scene, which consists of just a single character performing a play. But let us, as noted,

41

focus merely on this one variable – shot-length. While the two gaps within the demarcated circle in the graph above clearly illustrate the deviation, it is perhaps less clear the extent to which this deviation poses a problem for any statistical learner. In its essence, the problem is that, before the initiation of the performance sequence, the past distribution of shot-length happens to be a fair predictor of the future distribution of shot-length. Thus, a statistical learner could reasonably apply a model in which the evaluation of the shot-length at time t depends on all observations up until time t-1. But, posterior to the initiation of the performance sequence, such a model would likely result in massive prediction error. If we return now to the task of our hypothetical spectator, he must, if he is to minimize prediction error, evaluate quite carefully what has generated the observed change in the environment. The very first quote of this paper was Schrödinger’s claim that the brain “can never admit to the slightest change in the result in genuinely unaltered circumstances.” The question, then, is the following: what other change has occurred which justifies, according to a model, the rather radical change in shot-length? Which alteration in circumstances has happened that justifies the change in result? We might note here that, in this framework, a good interpretation is simply a high conditional probability between an observation and a model of what (hidden state(s)) has produced that observation. The set of all good interpretations can thus be said to be the set of all high-probability relations between observations and causes of observations, given a model. But the problem is that there just so happens to appear no obvious such high-probable relation which can make the deviation probable. There is no frequentist distribution of performance sequences in the film to condition against (however, there is one additional dance sequence which is not a performance, but a dance in a bar; here, the sequence is shot in a more regular fashion). Further, as the graph above evidences, there are no other incidents of shots close to this length in the film. But then, if there are no other obvious elements to condition on which yields a high conditional probability, then no conditional probability P(shot length (dance) I X…Y) can be intuitively put forward. This makes it plausible that the solution is to try to find a way to make the distribution itself the prior to condition on. But how does one do that? We can now make the task given to our hypothetical spectator more concrete. Recall once again that his task is to evaluate, throughout this 12-minute sequence, the likelihood of the observed data, which in this case is the probability of a cut. How will a spectator, faced with an increasingly deviating sequence, update his probabilities and models of what generates the sequence? First, we must remember that cinema is dynamic, and thus, no viewer is presented with the length of a shot all at once. Rather, its length is experienced incrementally. His problem can thus be visualized as a Markov chain:

42

Given the disruption, it is unlikely that this spectator will apply a single model. But, given our restrictions, he cannot condition on anything but the single variable that is shot-length. His task is thus to figure out at each point, if the shot-length depends on the prior distribution in the film, and, if so, in what way. Lastly, note that we are here not interested in prediction per se, but rather model evaluation of observed data. We can now turn to evaluate which models a spectator could reasonably construct to minimize prediction error. The construction of plausible models We will look at five hypothetical but plausible models. First, let us assume that the spectator changes his probabilities according to a simple frequentist model in which all earlier data is included in the model (that is, in which new data depends on, or is generated by, all earlier data) and that any point in the past distribution is considered to be of equal weight. Rationally, such a model would update its probability distributions by incrementally decreasing the probability that any cut will occur as fewer and fewer cuts are experienced on total. How can we form a probability distribution given such a model? What we are calculating here is in fact not probability, but likelihood, in that we estimate the probability of a new observation given a model. The likelihood, given this model, evolves in the following way, where B represents a time slice in which a cut appears, A represents a time-slice in which no cut appears, and each time slice lasts for one second. The table below thus represents a distribution of only the first 12 seconds of the film: 1: 2: 3: 4: 5:

A AA AAA AAAA AAAAB

p1: 0 p2: 0 p3: 0 p4: 0 1 p5: 5 1

6: AAAABA

p 6: 6

7: AAAABAA

p7: 7

8: AAAABAAA

p8: 8

9: AAAABAAAA

p9:

10: AAAABAAAAA

p10: 10

11: AAAABAAAAAA

p11: 11

12: AAAABAAAAAAA

p12: 12

1 1

1 9

1

1

1

etc. etc.

If we complete the calculation above for all time slices in the film (which amounts to the number of seconds in the film, which is 5.472), we can visualize the data for the graph below.

43

According to this model, the first cut in the film happens with probability 0.2, which is significantly higher than the average probability of a cut per time-slice in the film. Since this causes the rest of the data to look “compressed”, and since we are only interested in the dance sequence, the graph below starts at t0 = 705 seconds. Again, I have highlighted the sequence with an oval:

Likelihood of a cut (abridged; t0:705) in O Último Mergulho 0.065 0.06 0.055

Probability

0.05 0.045 0.04 0.035

0.03 0.025 705 841 977 1113 1249 1385 1521 1657 1793 1929 2065 2201 2337 2473 2609 2745 2881 3017 3153 3289 3425 3561 3697 3833 3969 4105 4241 4377 4513 4649 4785 4921

0.02

Time t in seconds

In this standard frequentist model, the probability of a cut occurring during the long dance sequence merely decreases in a steady fashion. While this is quite logical, it is, all things considered, not necessarily a very plausible spectator model if looked at in isolation. Rather, any spectator, faced with uncertainty, is likely to apply various models which each offers a way to explain what generated the data. What makes the multi-model approach the only statistically viable option is that there is no way, as we have repeatedly stated above, to produce the one true model, in the sense of finding a model which repeatedly best predicts the future; the reason is that there is no periodicity in a random variable such as shot-length. Put simply, there is no guarantee that the past distribution generates the future distribution. As we noted earlier in the paper, the problem this spectator faces is similar in many ways to analysts trying to predict and model stochastic financial time series; a non-optimal solution often applied is to try to predict, via single models, the periods which seem somewhat stable. The problem is deep, for, if the past does not seem to generate the future, then what model does generate the observables? Given that there is no single generative model, the result of the above is that minimizing uncertainty means minimizing the distance between plausible models, none of which are what we would call a “true distribution.” This is the problem our restricted viewer is faced with. This viewer, to iterate, is not a typical viewer, in the sense that we have a priori established that he or she cannot, for instance, interpret the deviation as, let’s say, an intended deviation by the director or as being a symbol of a,b,c… etc. This is a direct consequence of the univariate character of the analysis. But the point is that while relaxing restrictions and allowing for more models might make the analysis more approximate to real life cognition, it does not alter the scenario in its very nature, it merely increases the space of all possible and plausible models.

44

Let us therefore now return to what models this viewer, within the restrictions given to him, might apply in addition to the model already constructed. So far in our probability estimates we've considered the situation where the entire history of t informs Ptn equally. We have considered a situation where the order of events has been accounted for by means of looking at t incrementally per second, but we have failed to take into account the temporal order of observations. This can mean two things: that we haven’t accounted for the role of patterns (e.g., that a brief shot is typically followed by a longer shot), or that, we haven’t accounted for the fact that past observed data might not be weighed equally to newer data. Secondly, we haven’t considered the possible role that a prior distribution over the variable might have in influencing the estimations made by a spectator. In this analysis, we will not take up the task of pattern recognition, but restrict ourselves to a) introducing a prior and b) introducing a set of possible weighting principles. I will thus introduce a set of four extra models. This set will consist of one model in which the spectator includes a prior to influence the probability estimates, and three models which weigh the history of the distribution of shot-length differently. We might ask, before turning to construct the models, if it seems reasonable to suggest that weighting principles of such a nature play a part in the brain’s inference engine. It certainly seems plausible, generally speaking, that while any viewer is influenced by what he or she has witnessed before, he or she does not weigh all past observed data equally. But the question of how to weigh past data against new data is one that changes a lot according to what variables are considered. Does the mood of a character as it instantiated itself in the beginning of a film influence the character’s mood as it instantiates itself the end of the film? Does the shot-length as it was estimated it 20 minutes ago influence the shot-length as it is estimated in this moment? There is no clear answer to such questions, but I will construct three hypothetical scenarios that I believe to be somewhat plausible. Let me present them briefly before explicating them: The first model is one which weighs recent data much higher than old data. The second model is one which weighs past data more highly than recent data. The last model is a combinatorial model which neither weighs the distant past nor the very near present very highly, but which instead weighs highly the recent past. The last model I am going to introduce is, as noted, one which weighs the incoming data against a prior (spectator-) model of shot-length. We will explain how to construct those models with numbers shortly, but first, it must be noted here that these models, while perhaps sufficiently plausible, are purely hypothetical in their concreteness. They are meant to be a symbolic representation of a general evaluation strategy of incoming univariate data, whose exact generating mechanism we do not know. Before turning to construct the models, and consider how they each evaluate the statistically disruptive sequence in the film, I will use a small simplifying trick which should make the problem clearer overall. The trick is to a) treat the sequence as an unbroken shot, and b) to introduce a simplified sequence that adequately mirrors the dance sequence in the film in discussion if treated as an unbroken shot. The sequence can be represented as follows, in which “a” simply stands for a time-slice with no cut in it, while “b” represents a time-slice in which there appears a cut: aaabaaababaababaaaabaababaaabaaaa

45

The idea is to now introduce a’s into the sequence above one at a time. This perfectly mirrors the challenge posed in Monteiro’s work. a a a b a a a b a b a a b a b a a a a b a a b a b a a a b a a a a a a a a a a a a a a a a a a a a a... etc. We now try to construct the four models described above, each offering their own model of how to respond to the statistical disruption represented by the sequence. We will do so by introducing four weighting principles, one for each of the four models. First, let us try to establish a plausible prior model of cutting rates in films. We image here a fixed prior, that is, a prior that does not change as this viewer watches the sequence. We imagine further that, as the sequence moves along, and the non-biased probability of a cut becomes lower than this viewer’s prior, this viewer, (perhaps irrationally) starts weighting the prior more. This means that, as the shot becomes gradually longer than expected given what this viewer normally experiences at the movies, he, contrary to the frequentist model we constructed earlier, increases his estimate of a probability of a cut in the next time-slice. But what should this spectator’s prior be? In theory, we can fix the number arbitrarily. If we assume the spectator watches mostly Hollywood films, we can expect his prior shot-length model to average around 5 seconds (the specific number is not important). If each time slice is 1 second, then the prior in this scenario is Pprior = 0.2. The specific number here is not important, since there is no true number to aim at. The important thing is that there is a prior, which, whatever it is, a viewer will use to estimate the plausibility of incoming data. Potentially, we could establish an infinite number of various priors, say, one for art films, one for action films, and so forth. If we established a plausible prior model for an art-film appreciator, this model would naturally interpret the statistical deviation provided by the film as less disruptive, given the simple fact that such deviations are more probable within (a model of) that type of film. For simplicity, however, we concern ourselves with one just the one prior introduced above which then acts as a prior model-average. With regard to the actual calculation, there are two probabilities to take into account: the prior, which is static, and the posterior probability which, in this case, is chosen to be given by the non-biased frequentist probability of a cut in which all past data is weighted equally. The two probabilities are then weighted and summed to arrive at a model average probability estimate between the observed frequentist distribution and a prior. The weighting follows the same principle as the table below, which we now turn to. To create weighting principles for the three other models, we introduce a weighting scale from 0 to 1. 0 represents the lowest possible weight, and 1 the highest. The weighting principle for the viewer who attributes high weight to the present, and low weight to the past, is as follows. The now is given full weight = 1. Each row behind is given the weight of xt-(1/n) incrementally (n being the length of the sequence and xt being the current node). This means that, once a new “a” is introduced, the first data-point of the sequence disappears. We carry out this procedure until the sequence consist of all a’s. The weighting principle is demonstrated below:

46

Sequence a a a b a a a b a b a a b a b a a a a b a a b a b a a a b a a a a a (new data point)

Weight 0 0 0,03125 0,06250 0,09375 0,12500 0,15625 0,18750 0,21875 0,25000 0,28125 0,31250 0,34375 0,37500 0,40625 0,43750 0,46875 0,50000 0,53125 0,56250 0,59375 0,62500 0,65625 0,68750 0,71875 0,75000 0,78125 0,81250 0,84375 0,87500 0,90625 0,93750 0,96875 1,000

The model above is dynamic so that the newest data point (is this case, always a time slice in which there is no cut) is continually weighted highest. The weighting principle used to construct the model of the second scenario (in which a viewer puts a lot of weight on past data but allocates less weight to recent data) is an exact reverse mirror of the above table, which allows us to omit it here. The weighting principle applied in the last model, the combination model, can be demonstrated by the table below. By looking at the table one should observe the principle that the midmost observations – neither the near present or the distant past – are given the most weight:

47

Sequence a a a b a a a b a b a a b a b a a a a b a a b a b a a a b a

Weight 0,0625 0,09375 0,125 0,15625 0,1875 0,21875 0,25 0,28125 0,3125 0,34375 0,375 0,40625 0,4375 0,46875 0,5 0,53125 0,5625 0,59375 0,625 0,65625 0,6875 0,71875 0,75 0,78125 0,8125 0,84375 0,875 0,90625 0,9375 0,96875

Sequence cont’d a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a

Weight cont’d 1 0,96875 0,9375 0,90625 0,875 0,84375 0,8125 0,78125 0,75 0,71875 0,6875 0,65625 0,625 0,59375 0,5625 0,53125 0,5 0,46875 0,4375 0,40625 0,375 0,34375 0,3125 0,28125 0,25 0,21875 0,1875 0,15625 0,125 0,09375 0,0625 0,03125 0

If we include the computer model which have no constructed bias, we have five models all in all. The probabilities at each time-step are calculated simply by summing the weights of the time slices in which there appears a cut, and then dividing that number by the total observations so far. If we

48

carry out this calculation for all five models, and plot the probability estimates into a single graph, we obtain the following:

Evolution of the subjective probability of a cut A five-model scenario for O Último Mergulho (1992) 0.3

Remembering model

Probability

0.25 Combination model

0.2 0.15

Forgetting model

0.1

Non-weighted model 0.05 0 1 2 3 4 5 6 7 8 9 101112 131415 16171819202122 232425 262728 29

Number of a's added subesequent to initial state n=32

Model average of nonweighted model and prior model

Each model, which is represented as a principally biased probability distribution, provides a subjective estimate of how the probability of a cut evolves in time. Perhaps counter-intuitively, one can observe that, in the combination- and remembering models, the subjective probability of a cut increases even if, on the “non-biased” frequentist view, the probability should decrease. This is a function of these two models interpreting new data as a “flaw,” or a random deviation, or untrustworthy. As the distribution changes, however, and the “new” old data becomes a distribution with minimal cuts, the probability naturally decreases. In a similar vein, in the model which weighs observed data against a prior, as soon as the probability of a cut moves below the probability threshold of what is expected given the prior model, the model increases the probability of a cut even if the observables indicate otherwise. This might intuitively seem irrational – since, one might argue, this model represents a viewer’s desire of there being a cut, rather than being a product of statistical inference – but it need not be. It simply represents one aspect of a learning process. In the forgetting model, the probability of a cut decreases steadily until it reaches zero. That it reaches zero might seem extreme, but it is perhaps not an entirely implausible scenario: imagine a viewer – not our primitive viewer, but a more advanced viewer – who is lost in the sequence and its rhythm, and who interprets it as an isolated or nested segment which plays according to its own rules; given such a model, any viewer is rational in believing that, after a certain while of having experienced no cuts, as long as the dance goes on, there will be no cut. Could we tentatively conclude that this last model, since it effectively aligns itself with the observed data and sees the sequence as nested, is the “best model” of the data? The problem with such an argument is that it is based on an observation made after the fact in which it is known that no cuts are to appear throughout the sequence. This means that, if hypothetically a cut suddenly appeared, this spectator’s model would be measurably more wrong than the others. The argument is thus a logical fallacy. To not dismiss the notion of a “best model” so quickly, however, let us briefly entertain the

49

following thought. Could one argue that any viewer who wants to act in a statistically rational fashion should stick to, not a “true model”, since such does not exist, but just one model, to ensure consistency? The idea is that, given non-periodic, stochastic data, such as the irregular shot-length of the film in hand, no single model of what generates future data is likely to be exempt from loss of prediction ability. The reason is that any stochastic language evolves, but it is not clear what mechanism(s) produce(s) its evolution. There is, to reiterate, no way to predict the degree to which the past determines the future, and in what way it does so. We might recall here Karl Popper’s conjecture that one cannot prove a theorem, or a theory, but only falsify it. The reason is that one has no empirical access to the future. Even given a film with a highly regular and periodical shotstructure, there is no way to prove, given only past empirical data, that the future data in the film will depend on past data. This is one of the fundamental problems of prediction and modelling of stochastic data. Can we pinpoint, given the scenario depicted by the graph, where most dissonance is likely to be experienced by this hypothetical viewer? One might note that, at time t = 28, the “prior” model increases the probability of a cut in the next time-slice, while any other model decreases the probability. If we define negation as the probabilistic distance between incompatible models, then negation will rise exponentially between the “prior” model and the other models as time increases. The same tendency can be seen in the beginning where the distance between the forgetting model and the remembering and combination models rises exponentially. Given our knowledge of the phenomenon of cognitive dissonance and the protective nature of the brain, such periods of dissonance are likely to be as brief as the brain can make them to be. It can thus be argued that it is the process of figuring out which model to choose which provides the most fertile space for negation to exist. It can be argued, further, that such indecision is facilitated by a disruptive environment, and that one such can be said to exist in Monteiro’s film during the performance sequence. Lastly, we should note that, even if the models constructed in this section are both hypothetical and severely restricted, relaxing or changing restrictions, such as allowing for multivariate analysis, or dragging in empirical data, would merely result in many more models. The point is that, if it is unlikely that, given a large statistical deviation, any single model will be chosen by a spectator which singularly precludes the need to invent other models, then all models are hypothetical. In the next case, we will turn to a multivariate analysis, and thus make greater use of our Boolean technique of decomposition. The type of negation studied in the coming section will not primarily be in the form of a frequentist disruption, but will instead analyse possible disruptions between a viewer’s model of the world and the world presented in the film.

50

Different models within a film: A multivariate analysis of The Duke of Burgundy The modelling case in this section is Peter Strickland’s The Duke of Burgundy (2014). We turn here to complicate the problem of inference in a stochastic environment by moving from a univariate to a multivariate framework of analysis. As we will see, such a move introduces computationally intractable problems, but before we turn to see how that might be the case, let us briefly introduce the film’s narrative. The film presents the slight curiosity of an all-female world in which everybody is a lepidopterist (a person who studies and collects moths). Additionally, the two main characters, Cynthia and Evelyn, appear to engage in a further obsession – a sadomasochistic role-play. The fluctuations of this double relationship, with its shifting dynamics of submission and domination, and complex entanglements between role-play and reality, form the dramatic spine and thrust of the film. I will argue that Strickland strategically distributes contradictory information to maximize uncertainty about the nature of their relationship and to cloud a spectator’s judgment of the true emotional state of the characters in any given scene. To minimize prediction error, any spectator must condition wisely to form stable hypotheses about the hidden states in question. These hypotheses will be formed from a generative model in which elements deemed to hold a dependent relationship to the hypotheses will be included by the spectator. Thus, in the analysis in this section, the main objective is to imagine a spectator who asks: given the elements I have observed so far, what is the likelihood that character A is currently in emotional state B? Given a scenario where some elements predict one emotion, and some elements another, a multi-model scenario, and thus negation, is likely to emerge. But to see if such a scenario can be said to exist within the film, it will be conducive to figure out which variables appear when and with what other variables. It is this we now turn to. Constructing the Boolean dataset I have carried out a Boolean decomposition of the first 42 minutes of the film. It would take up 30 pages to lay out the data here. I have deemed it too voluminous for an appendix as well. However, on the page below is given a sample of the decomposition, namely of the opening sequence, which takes up the first three minutes of the film. The following variables have been included/excluded: Location (abb.: Lo) Character (abb.: Ch) Character’s facial expression (abb.: Ch. obs.) Action/narrative information (abb.: Nar) Objects (abb.: Obj.) Music (abb.: Mus) Lyrics (abb.: Lyr)

Left out of the analysis: Size of objects in image Spatial relations between objects in an image Dialogue Etc.

As we have already discussed, it is difficult to construct a co-occurrence table such as the one below without controversy. What variables do you leave out? And how do you designate the variables you choose to include? To regurgitate, the following three principles have been applied: a) include only observables, b) include only reusable parts c) include only autonomous parts. Thus, a “hand” is not included, unless deemed to have autonomous power. Thus, a character’s inferred emotional state is not included, since it is not directly observable. These principles, however, do not free the methodological choices from dispute. For instance, I have included a variable such as “dark”, even

51

if it is not clear the extent to which it is a reusable part. Where is the limit that distinguishes a dark image from a non-dark one? I can only reiterate that, given a continuous variable such as “dark”, only degrees of darkness exist, but since the analysis is carried out manually and by intuition, imprecision is to be expected. To avoid systematic error, the most important thing is consistency, that is: given two images which are ~equally dark, they must both be designated in the same way. Lastly, concerning the omission of the element of dialogue, I realize that dialogue plays a major role in film comprehension. But since there is no dialogue in the opening sequence, and since the later analysis considers only a single shot, the omission is not fatal. Let us now look at the table:

Main variab les St.

Variables/ timecode Freezefram e Superimpos ition Daylight

Lo.

0 0 . 5 2

0 1 . 0 8

0 1 . 1 3

0 1 . 1 6

0 1 . 2 1

0 1 . 2 6

0 1 . 3 0

0 1 . 3 5

0 1 . 4 0

0 1 . 4 5

0 1 . 5 0

0 1 . 5 6

0 2. 0. 0 0

1 1 1 1

1

1 1 1 1 1 1 1 1 1

1

1

1 1 1 1 1 1 1 1 1

1 1

Dark 1 1 1 Shaky zoom 1 1 Framing Object 1 1 FO trees 1 1 Pan 1 1 Super Total 1 Close-up 1 1 1 1 1 1 Medium Total 1 1 1 1 1 POV shot 1 Yellow Red tint 1 1 Blue tint 1 Green tint Flare 1 Blurry/Out of focus 1 1 1 Macro Forest 1 1 1 1 1 1 1 1 1 1 Forest staircase City Porch

0 2 . 0 6

0 2 . 1 1

0 2 . 1 3

0 2 . 1 7

0 2 . 2 5

1

1

1 1 1 1 1 1 1 1 1 1

0 2 . 4 9

0 2 . 5 5

0 3 . 0 3

1

0 3 . 0 7

0 3 . 1 3

0 3 . 1 8

0 3 . 2 1

0 3 . 3 5

1

0 3 . 4 0

1 1 1 1 1

1 1 1 1 1 1 1 1

1 1 1 1

1 1 1

0 2 . 4 4

1 1 1 1 1 1 1

1

1 1 1

0 2 . 3 1

1 1 1 1 1 1 1 1 1

1 1 1 1 1 1

1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1

1 1

1 1 1

1

1 1 1 1 1 1

1

1

52

Ch.

Ch. Obs.

Obj.

Nar

House Library Outside library Abstraction Evelyn Cynthia Other woman Smile Face obscured Slight open mouth Serious expression Cynthia, open mouth Mushroom Book(s) Larvae Unfolding larvae Bike Identical bike Cynthia's superimpos ed face Painted butterfly Butterfly Abstract figures Red box Running water Black dress Dark leaves Sun (behind leaves) Sits

1 1 1 1

1

1 1 1 1 1

1

1 1 1 1 1 1 1 1

1

1 1 1 1

1 1

1 1 1

1 1 1 1 1 1

1

1 1 1

1 1

1

1

1

1 1

1

1 1 1

1

1 1 1 1 1 1

1

1 1

1 1

1 1

1

1 1 1 1 1

1

1 1

1 1

1

1 1

1

1

1

1

1

1

1

1

1 1 1 1 1 1

1 1 1 1

1

1 1

1 1

1

1

1

1

1 1

1 1

1 1 1

1 1 1 1 1

1 1 1

1

1 1

53

1

Mus. Lyr.

Title

Looks at sun Studies butterflies Larvae unfold Paper unfolds and crumbles Walks down stairs Cycles Arrives at house Stands up on bike Evelyn waves at woman Woman waves at Evelyn Reads Takes elevator “Cat’s Eyes” Dream/loss /love/come back How people change Title

1

1 1 1 1 1 1 1 1 1

1

1 1 1 1 1

1

1 1 1

1

1

1

1 1 1 1 1 1 1 1 1 1 1

1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1

1

1

1 1 1 1 1 1 1 1 1

1 1

1 1 1 1 1

1 1 1 1 1

1

The table above is, admittedly, not particularly expressive in itself. But while this might be true, it does hold a lot of information in the sense that it specifies, even if partially, the underlying structure of the opening sequence. Partially, because the model is not recursive (we cannot reconstruct the image(s) from the model). But how can one extract information from the table, then? Given our probabilistic perspective, we might ask: what is the probability of any possible N-size configuration at any time t? Unfortunately, such a problem is largely intractable. In the full 42minute table, 217 variables are specified (69 in the above table). This means that, for any given model, a co-occurrence- and transition matrix of size 217 times 217 can be specified for each timestep. But this is only part of the full picture. An exhaustive matrix would specify not only single values

54

against other values, but the number of combinations (image configurations) possible. Given samples of a 10-variable size, and 217 possible variables to choose from, the combinations are as many as 5,1716 . This is one problem, which has to do with computational complexity. Another problem is of a more philosophical and fundamental nature. The problem, which has haunted us before, rests on the notion that there is no true measure of probability, since there is no definite way to count frequencies (since it depends on inferred dependencies over time), and no definite way to condition. Thus, we could specify a conditional probability matrix from the above matrix, or indeed several, but there would be no deterministic equivalence between those and the Boolean Matrix itself. It is in this very radical way that negation is an inherent part of probabilistic language. What has been written above might seem a not particularly satisfying answer to the question of how to exhaustively extract data from the dataset we have constructed, but it is the only realistic answer I can come up with at this moment. Let us try, therefore, to solve a simpler problem using the Boolean dataset. Ultimately, we want to arrive at the subject of how to perform inference about the hidden (emotional) states of the characters given observable data as specified by the matrix. But before we go there, let us first look at a simple inductive problem, to avoid introducing a prior just yet. Concretely, let us try to delineate the use of the colour blue in relation to the two main characters and see if a pattern emerges over the first 42 minutes of the film: Three conditional and marginal probability distributions of the color blue for the first 42 minutes of The Duke of 1

Burgundy

0.9

Unconditional probability of color blue

0.8

Probability

0.7 0.6

Conditional probability of blue given Evelyn Conditional probability of blue given Cynthia

0.5 0.4 0.3 0.2 0.1

1 17 33 49 65 81 97 113 129 145 161 177 193 209 225 241 257 273 289 305 321 337 353

0

Shot number (time) In the graph above, the probability that the colour blue occurs as time evolves is specified by three distinct models. The blue line tells us that the marginal probability that the colour blue will occur in a shot is on average ~ 1/10. As the graph shows, the colour blue does not appear for the first many minutes, until it escalates and then stabilizes. What about the conditional probabilities? Given that the colour blue occurs, there is a ~ 7/10 chance that Evelyn will appear in the shot. But given the same scenario, there is on average only ~ 1/6 chance that Cynthia will be in the scene. Is this rather clear difference merely a coincidence, or is there a dependent relationship between the properties “blue” and “Evelyn” which a viewer could plausibly implement in a generative model? Since the probability that blue occurs is significantly higher if Evelyn is present, the colour blue could be said

55

to probabilistically cause Evelyn to appear. But does this necessarily equal the result that Evelyn is a predictor of the colour blue? A spectator who inductively inspects the environment behind the data above might end up with two hypotheses of (in)dependence assumptions:

Two Bayesian Network models in which blue depends or does not depend on Evelyn There is perhaps no true way to argue against anyone who believes that the second model is just as valid as the first. He or she could simply maintain that the probabilistic dependence so far has been purely incidental and that the future occurrence of the colour blue does not depend on this past distribution. Thus, the table can be used to specify whatever foundation is present for induction, but it cannot properly specify actual dependencies, and therefore true models or theorems. The likelihood that blue appears in a scene depends, as we have just seen, on subjectively inferred dependencies. Thus, a likelihood of 0.2, and one of 0.7, can be “true” or “valid” at the same time. The case above is perhaps a trivial example, but the method of induction used to present it is not trivial but can be extended to any more complex case. At any rate, inductive learning does not tell anyone what the learned structure “means”, but it forms the basis of indicating to the audience what means and what does not mean. The analysis above would naturally have to be elaborated further (i.e. include more variables) to study the use of the colour blue less trivially. Instead of going down that route, however, let us turn instead to the problem which we posed to ourselves in the beginning of this chapter, namely, the problem of inferring the hidden emotional states of the main characters given observable data. More precisely, below we will engage with the dynamics of a spectator trying to infer Evelyn’s emotional state in the opening sequence of the film. The analysis will not be exhaustive, but indicative, and will focus on the (mathematical) method of induction. The opening sequence: A case of multi-model negation The opening sequence of The Duke of Burgundy is a montage which is both simple and complex at the same time. Except for a brief appearance of Cynthia’s face in an abstract superposition (see image on next page), only Evelyn appears. The re-occurring thread which holds the montage together is simple: On a bike, Evelyn moves through the forest, and later the city, before arriving at a fanciful house. Interspersed with this simple scenario is a string of less simple montage-elements, some of which are shown on the next page in the form of still images. Remember that, we are interested in inferring Evelyn’s mood from the observables. But the challenge is the following: we mostly see her on her bike, her face is only rarely visible, we do not hear her speak, and she doesn’t engage with other people, so what can a spectator possibly know? In the following, I will argue that while a spectator cannot know much with certainty, he or she can assume a few things about Evelyn’s hidden emotional state by means of implicit learning techniques. To validate this reasoning, let us first make the assumption that every (major) montage

56

element is introduced for a reason. Let us next assume that it is not unfounded to suggest that spurious dependencies might be inferred to exist between her mood and the various montageelements, since, as noted, that is all there is to condition on. The assumption can be verified further by recalling the famous Kuleshov effect. The effect is a consequence of the principle that additive meaning arises through the juxtaposition of images; in other words, the meaning conferred on single images depends on the images they are juxtaposed with. Given an image of a man and an image of a bed, the man is inferred to be sleepy. This is not a meaning inherent in the first image itself, but is a hidden state added to the compound of the two images. The opening sequence presents a pertinent case of this evaluation problem of inferring the hidden states of a human being – in this case Evelyn, of whom we know nothing with any certainty at this point in the narrative – from the variables this human being co-occurs with. Let us look at some of these variables:

Sun framed through branches; low exposure

Evelyn on her bike, framed through branches

Evelyn eagerly waves to another woman

Insert of a larva; introducing caterpillars

Evelyn studying at the library; serious

Evelyn in elevator; orange tint; serious

Again: framing object, branches/leaves

A mushroom and a beetle

57

Unfolding larvae; abstract colouring

Cynthia superimposed; abstract colouring

The images above naturally represent only a subset of the montage-elements shown in the opening sequence, but as a sample it should indicate the nature of the data available to condition Evelyn’s emotional states on. One additional element should be mentioned, however. The title song, Cat’s Eyes, is audible throughout most of the sequence. A recurring lyric is the line “How people change.” How does such a line predict, if at all, the mood of Evelyn? And what about the rest of the observables? To dig deeper into this question, let us restrict ourselves to operating with a single hypothesis versus its negation, namely: “Evelyn is happy” vs. “Evelyn is not happy.” Assume that, before the film starts, a spectator has a prior distribution over these hypotheses. Typically, the prior distribution will be set to 0.5/0.5, meaning simply that a spectator does not favour any hypothesis over the other. This standard operation is easily accepted considering the fact that, as more data comes in, the value initially set for the prior diminishes in effect. The reason is that the prior is not fixed, but dynamic, which means that the posterior probability at time t becomes the prior probability at time t+1. But what about the posterior probability, that is, the probability of the hypothesis given new data? How does the spectator reason then? We can formalize the scenario in the following way, by looking at the following equation below: 𝑒(𝐻|𝐷) = 𝑒(𝐻) + 10 ∙ 𝑙𝑜𝑔10 [

𝑝(𝐷𝑖 |𝐻) ] 𝑝(𝐷𝑖 |¬𝐻)

In the equation, H represents the hypothesis about the hidden state (in this case, Evelyn’s (binaryvalued) mood). D is observable data. 𝑒(𝐻) is the prior model evidence, which, as we noted, is initially set to 0.5. The expression on the right is the important part, and specifies the ratio of the likelihoods of observable data given the hypothesis and the negation of this hypothesis respectively. To make things more concrete, let us look at some of the images from the opening sequence. How obvious is it that the content of those images forms a dependent relationship to, and is thus a predictor of, Evelyn’s mood? Indeed, some elements in the sequence might be inferred by a spectator to be entirely independent of Evelyn’s emotional state. Given that this is the case, the ratio on the right will be zero, and the evidence for the hypothesis remains unchanged (the log measure is conveniently constructed so that that log10(1) = 0)). To take an example, consider the image of the mushroom on the previous page. Let’s assume that a spectator deems this particular observable to hold no predictive power over Evelyn’s hidden emotional state. This is the same as to

58

say that he infers conditional independence between the two variables, which can be derived mathematically by plotting the hidden state and the observable into the equation: 𝑝(𝑀𝑢𝑠ℎ𝑟𝑜𝑜𝑚|𝐻𝑎𝑝𝑝𝑦) 𝑒(𝐻𝑎𝑝𝑝𝑦|𝑀𝑢𝑠ℎ𝑟𝑜𝑜𝑚) = 𝑒(𝐻𝑎𝑝𝑝𝑦) + 10 ∙ 𝑙𝑜𝑔10 [ ] 𝑝(𝑀𝑢𝑠ℎ𝑟𝑜𝑜𝑚|¬𝐻𝑎𝑝𝑝𝑦) Conditional independence: 𝑝(𝑀𝑢𝑠ℎ𝑟𝑜𝑜𝑚|𝐻𝑎𝑝𝑝𝑦) = 𝑝(𝑀𝑢𝑠ℎ𝑟𝑜𝑜𝑚|¬𝐻𝑎𝑝𝑝𝑦) 𝑒(𝐻𝑎𝑝𝑝𝑦|𝑀𝑢𝑠ℎ𝑟𝑜𝑜𝑚) = 𝑒(𝐻𝑎𝑝𝑝𝑦) + 10 ∙ 𝑙𝑜𝑔10 [1] = 𝑒(𝐻𝑎𝑝𝑝𝑦) + 0 ⇒

𝑒(𝐻𝑎𝑝𝑝𝑦|𝑀𝑢𝑠ℎ𝑟𝑜𝑜𝑚) = 𝑒(𝐻𝑎𝑝𝑝𝑦)

Such independence assumptions between variables, unless they are revised at a later point by a statistical learner, can help keep the internal cognitive model as simple and efficient as possible, by, so to speak, filtering out noise relative to the hypothesis under inspection. At the most extreme, a viewer will ignore any evidence except the most obvious indicators of the state of the inspected variable. It can be costly, however, to subsequently find out that earlier models have been too sparse or exclusive (“I should have paid more attention to a,b,c…”, etc.). But while an observation such as a ‘mushroom’ perhaps reasonably could be excluded from a model, without it being a costly omission, other observations, such as a ‘smile’, which can be observed at least once in the opening sequence, is, for most spectators I presume, a clear enough predictor of mood, even if a smile can potentially deceive, for it to be costly to ignore or overlook. And so, the inferred degree of dependency between observables and states move on a gradient scale. If we are to pick an example which is a bit more tricky to place on this scale, we might ask: does the larva observed in the opening sequence hold a dependent relationship to Evelyn’s mood, and if so, to what degree? The question is far from trivial, perhaps because there is no clear answer. Any given observer might attempt to construct a model where the occurrence of larvas is used to help predict Evelyn’s mood. The validity of such a model cannot be disputed right off the bat. One might, for instance, argue that, since the larva is a stage in a caterpillar’s life, then the concept of “change” forms some spurious relationship to the larva. In turn, this might indicate that Evelyn, and so her mood, might change, or is liable to fluctuate. But the point is that, even without such inferred symbolic connections, the lava can still be used as a spurious predictor. Say, for instance, that a spectator observes the curiosity that a larva “usually” can be observed in the vicinity of someone being unhappy the film. Given a scene in which facial expressions are obscured, and given that a larva can be observed, anyone who deems the larva to be a predictor of mood, given a current model and a cinematic universe, will form the prediction of the hidden states based on this observable, amongst others. If we plug it into the model, we obtain: 𝑝(𝐿𝑎𝑟𝑣𝑎|𝐻𝑎𝑝𝑝𝑦) 𝑒(𝐻𝑎𝑝𝑝𝑦|𝐿𝑎𝑟𝑣𝑎) = 𝑒(𝐻𝑎𝑝𝑝𝑦) + 10 ∙ 𝑙𝑜𝑔10 [ ] 𝑝(𝐿𝑎𝑟𝑣𝑎|¬𝐻𝑎𝑝𝑝𝑦) If

𝑝(𝐿𝑎𝑟𝑣𝑎|𝐻𝑎𝑝𝑝𝑦) < 𝑝(𝐿𝑎𝑟𝑣𝑎|¬𝐻𝑎𝑝𝑝𝑦)

then

𝑒(𝐻𝑎𝑝𝑝𝑦|𝐿𝑎𝑟𝑣𝑎) < 𝑒(𝐻𝑎𝑝𝑝𝑦|¬𝐿𝑎𝑟𝑣𝑎).

59

In general, things can come to mean something statistically by appearing consistently with some elements and not with others. Given co-occurrences, dependencies might be inferred. If we now spread the discussion to the main variables observed in the images we presented earlier, we might make the following visualization of the (in)dependence problem in the shape of a simple Bayesian Network. For simplicity, I have assumed conditional independence between each element, thus restricting dependencies to occur between her mood and each variable.

For simplicity of visualization, I have assumed dependence between any element and Evelyn’s mood in the graph above. This means that, given every dependency, a conditional probability is attached in the form of an emission probability of an observable given a state. We could, theoretically, try to establish this set of probabilities, and observe how the state becomes more or less likely in time. Further, we could implement various such conditional models (say, one per hypothetical spectator) and see how these different priors would influence a viewer’s prediction of Evelyn’s hidden mental state. But you have probably noticed that I have refrained from attaching numbers to the equations. The reason is that the specific numbers a) are not available, b) not important in their specificity whatever they might be. Rather, the point we make is that the mental dynamics of a spectator are likely to emulate the dynamics of the equation presented earlier, and these dynamics of inference remain the same no matter what numbers you operate with. But we might ask then, if the mathematical model above presents a complete image, given a set of inferred dependencies, of how a (rational) spectator infers hidden states from observables. In the model above, evidence is accumulated in proportion to the likelihood of the observed data given a hypothesis and a prior. But we might introduce a further element. As we know, the mind corrects for optical precision error. Thus, imagine the following scenario. In the third image on the previous page, in which Evelyn can be seen to wave, a spectator might think he sees a smile, without being certain. This means that, even if the probability of the state “happy” given the observation “smile” might be high, the evidence for the state might still be uncertain because the imprecision of optical measurement casts doubts as to the validity of the model. We can thus introduce the principle of model weighting. In the above case, we might imagine two operating models, say, 𝑀1 = 𝑝(ℎ𝑎𝑝𝑝𝑦|𝑠𝑚𝑖𝑙𝑒) and 𝑀2 = 𝑝(ℎ𝑎𝑝𝑝𝑦|"ℎ𝑎𝑙𝑓𝑠𝑚𝑖𝑙𝑒"). We might hypothesize that this spectator

60

allocates weight 0.7 to the first model, while allocating weight 0.3 to the second model. This will create two different posterior probabilities, one given each model, and each weighted as noted. We can thus write the following equation, which essentially is the formula for model averaging in a Bayesian set-up: 𝐾

𝑝(𝐻 |𝐷) = ∑ 𝑝(𝐻|𝑀𝑖 , 𝐷) ∙ 𝑝(𝑀𝑖 |𝐷) 𝐾=1

Note that this equation too is an expression of the probability of a hypothesis given some newly observed data. But here, the posterior probability 𝑝(𝐻|𝐷) is weighted by the observer’s belief in each model which might explain the observed data. Thus, the posterior probability is the sum of each posterior probability given a model (the term on left within the summation), weighted by the belief in that model (the term on the right within the summation). If we plug in the data from the example, we obtain: 𝑝(𝐻|𝐷) = 𝑝(𝐻𝑎𝑝𝑝𝑦|𝑔𝑒𝑛𝑢𝑖𝑛𝑒 𝑠𝑚𝑖𝑙𝑒 (𝑀1 )) ∙ 0.7 + 𝑝(𝐻𝑎𝑝𝑝𝑦|𝑛𝑜𝑛 − 𝑔𝑒𝑛𝑢𝑖𝑛𝑒 𝑠𝑚𝑖𝑙𝑒 (𝑀2 )) ∙ 0.3 The two posterior probabilities are calculated as usual by Bayes Theorem (i.e. by factoring the likelihood of observed data with the prior). The only new element introduced here is the model weighting principle. This principle offers the brain, or analyst, a way to allocate precision to measurements of observations, and thus assign validity to models. This in turn, if applied well, works to minimize prediction error. In the opening sequence, for instance, we could anticipate a viewer who standardly allocates low weight to optically uncertain evidence. This must naturally be kept distinct from the act of inferring independence, since we are here not dealing with model exclusion. So far, we have considered two combinable ways of statistically evaluating hidden states given observable data (and a prior model): the accumulation of evidence, and the precision weighting of that evidence. But what role does (probabilistic) negation play in the act of accumulating evidence in the way indicated above? Recall that, one interpretation of negation in our framework is the distance (or divergence) between posterior and prior models relative to the same hypothesis (e.g. a change between posteriors, from one scene to the next, or a clash of posteriors within a scene). Between two models M1 and M2, and two observables D1 and D2, the distance between them can be expressed as follows, using a log measure: 𝑛𝑒𝑔𝑎𝑡𝑖𝑜𝑛 (𝑀2 |𝑀1 ) = 𝑝(𝑀2 ) 𝑙𝑜𝑔 (

𝑝(𝑀2 ) 𝑝(𝐻 |𝐷1 , 𝐷2 ) ) = 𝑝(𝐻 |𝐷1 , 𝐷2 ) log ( ) 𝑝(𝑀1 ) 𝑝(𝐻|𝐷1 )

Within information theory, this is often termed the Kullback-Leibler divergence. Within Bayesian statistics, it is simply called the Bayes factor. If we want to measure not how an observer combines evidence to form a single hypothesis, but how an observer experiences the difference or distance between various (conflicting) evidence, the formula above is useful. To give an example, a viewer might be uncertain as to whether the newly observed data, the lyrics “How people change”, is a predictor of Evelyn’s mood, and therefore operate with two separate distributions relative to the

61

hypothesis, the other being the predictor ‘smile’. If we plot the example into the equation, leaving out numbers for clarity, we obtain: 𝑁 = 𝑝(𝐻𝑎𝑝𝑝𝑦|"How people change", 𝑠𝑚𝑖𝑙𝑒) ∙ log (

𝑝(𝐻𝑎𝑝𝑝𝑦|"How people change", 𝑠𝑚𝑖𝑙𝑒) ) 𝑝(𝐻𝑎𝑝𝑝𝑦|𝑠𝑚𝑖𝑙𝑒)

The expression above represents the amount of information gained/lost by using the earlier model M1 (smile) instead of the new model M2 (smile + “How people change”). It thus represents the cost of not updating the mental model in light of new evidence. To sum up, during the opening sequence, a viewer is likely to rationally accumulate implicit evidence of Evelyn’s emotional state on the basis of inferred dependencies and independencies to that state, and weight that evidence proportionally to the precision of measurement. However, given high degrees of uncertainty, say, if the distance between two models is large, a viewer might operate with distinctive conflicting models until new evidence is strong enough to resolve the epistemic conflict and collapse all models into one. Another viewer might simply ignore the contradictory evidence. Thus, viewing the opening sequence, one spectator might attempt to initially reduce the experience of negation by attributing low weight to all observables except the observable “smile”, relative to the hidden emotional state of Evelyn, and thus conclude with a fair amount of certainty that Evelyn is happy. Another spectator might experience his or her predictions or evaluations of Evelyn’s mood as being constantly deflected from their curve by the wealth of various objects in the opening sequence, each carrying their own more or less spurious connection to Evelyn’s character. However, from the viewpoint of a brain trying to minimize prediction error, neither of these two spectators’ models, or any other hypothetical spectator’s model for that matter, can be said to be exactly right or wrong in the moment of measurement. Such terms can only be used ex post facto, when the set of all necessary observations has been exposed. The reason is that, as facts arrive, models fail or succeed – but not until then. This encapsulates the stochastic and dynamic quality of statistical learning. Summary To summarize, cognitive science tells us that the brain is an organ which, to achieve self-consistency, constantly evaluates which way to best minimize the distance between internal models and observations from the environment. This is the process of finding the generative mechanism which minimizes the total prediction error over time. In this chapter, I have argued that the opening sequence of The Duke of Burgundy, and the dance performance in O Último Mergulho, represent two cases in which this process of inferring the generative mechanism of a cinematic environment by means of an iterative process of discovering what evidence the evolution of chosen variable (hypothesis) depends on – shot-length in the latter example, and the mental state of a character in the former – can be argued to be conflicting or deviating enough to make the statistical learning process cloudy and uncertain. In conclusion, we have argued that João César Monteiro and Peter Strickland, in each their own way, set the scene for negation. From the cognitive point of view, however, the state or experience of negation is neither terminal, nor fatal. Rather, it is a necessary calibrating mechanism which dynamically aligns internal models with each other and with observations. In sum, in this updated framework, negation is a mechanism which helps facilitate learning.

62

Conclusion In this thesis, I have attempted to show how, by changing the framework for the question of whether cinema is a language or not, cinema can indeed be said to be a language. If modelled against the English language, cinema turns out to manifest many of the same properties. But the conclusion we arrived at in our discussion with Carroll and Currie was that these similar properties are best studied statistically, since they come in degrees rather than principles. This means that, if cinematic language is conventional, or canonically decompositional, it is so to a degree. But if this is true, then cinematic language must be said to inhabit stochastic qualities that can only be inferred, rather than decoded. For this inference process to be successful, it would require the application of an empirically sensitive learning mechanism, rather than a rule-based system. As it turns out, such a line of reasoning happens to be isomorphic with not only recent neuroscientific claims to the probabilistic nature of the brain, but also with more established theories of the statistical root of language learning. Briefly, the idea is that any contingent learning process takes shape as a continuous refinement of cognitive generative models that seek to explain the causes of observations and sensations from the environment. Stated roughly, such models can either be good or bad. If a model is good, this means that it generates predictive power by maximizing the likelihood of observations, in which case it is likely to self-sustain as an interpretative framework in the brain which further inductive learning can evolve from. Given an uncertain environment, however, models are often bad and will eventually break. Translated, this means that if observations become increasingly unlikely given a model, then the model must be updated, or new models invented, to continually maximize the likelihood of observations. Conclusively, this fosters a perpetual “model”– “model failure” – “new model” cycle, in which the probabilistic distance between the models defines and measures the concept of negation in our framework. Formulated in this way, and given that cinema is a stochastic environment, negation can be said to play an integral part of its language, which was what we set out to argue in the first place. To showcase the principle of negation at play, I analysed two films. In my analysis of O Último Mergulho, I argued that a large statistical deviation in shot-length was likely to facilitate a limited univariate observer to apply various mutually negative evaluative models. Similarly, in my analysis of The Duke of Burgundy, I argued that its opening sequence was likely to produce a many-faced and difficult epistemic situation for a spectator who attempts to infer the mood of one of the main characters by means of implicit learning techniques and prior models. While in both cases, negation could be said to both impair and foster statistical learning, the latter analysis was an attempt at a multivariate analysis using the proposed logic of Boolean decomposition. As we saw, however, major computational problems arise in the multivariate case. When the network gets big, dependency structures are difficult to delineate, visualize or compute. This explains the dire verdict by Phung et al. (2005) when they argue that the “organization of content in generic videos (e.g., movies) is too diverse to be fully characterized by statistical models” (Phung et al. 2005). Optimally, an exhaustive function of the evolution of the entire set of conditional probabilities in a cinematic work would be delineated. Unfortunately, while in theory possible, it turns out that such a function is largely intractable to compute using current modelling techniques. But if this suggests, which it might very well do, that the brain cannot compute such functions either, then the upshot is that this merely supports the theory that, when data gets messy and vast, multi-model cognition and negation leads the way.

63

Literature Abell, Peter (2009). “Singular Mechanisms and Bayesian Narratives” in Demeulenaere, Pierre, (ed.) Analytical Sociology and Social Mechanisms. Cambridge University Press: Cambridge, UK. ISBN: 9780521154352. Anderson, Joseph (1998). The Reality of Illusion: An Ecological Approach to Film Theory. Southern Illinois University Press: Carbondale. ISBN: 0-8093-2196-3. Bateman, John A. et al. (2012). Multimodal Film Analysis: How Films Mean. Routledge: New York. ISBN: 978-0-415-88351-1. Bateman, John A. et al. (2016). ”Towards Next Generation Visual Archives: Image, Film, and Discourse” in Visual Studies 31:2. pp. 131-154. Bazin, André (1960). “The Ontology of the Photographic Image” in Film Quarterly, Vol. 13, No. 4. pp. 4-9. Biederman, Irving (1987). “Recognition-by-components: A Theory of Human Image Understanding” in Psychological Review Vol. 94, No. 2. pp. 115-147. Blumson, Ben (2010). “Defining Depiction” in The British Journal of Aesthetics, Volume 49, Issue 2, 1. pp. 143-157. Blumson, Ben (2014). Resemblance and Representation: An Essay in the Philosophy of Pictures. Open Book Publishers: Cambridge. ISBN: 978-1-78374-072-7. Bordwell, David (1985). Narration in the Fiction Film. University of Wisconsin Press. ISBN: 0-29910174-6. Branigan, Edward (1986). “’Here is a Picture of No Revolver!’: The Negation of Images and Methods for Analyzing the Structure of Pictorial Statements” in Wide Angle 8, 3–4. pp. 3-17. Branigan, Edward (1992). Narrative Comprehension and Film. Routledge: New York. ISBN: 0-41507-511-4. Branigan, Edward (2006). Projecting a Camera: Language Games in Film Theory. Routledge: New York. ISBN: 978-0-415-942539. Buckland, Warren (2000). The Cognitive Semiotics of Film. Cambridge University Press: Cambridge: ISBN: 0-521-78-005-5. Carnap, Rudolf (1937). The Logical Structure of the World. Open Court Publishing: Illinois. ISBN: 08126-9523-2.

64

Carroll, Noël (2008). The Philosophy of Motion Pictures. Blackwell Publishing: Oxford. ISBN: 978-14051-2025-8. Charniak, Eugene (2016). Language Modeling and Probability. Brown University. Access: https://cs.brown.edu/courses/csci1460/assets/files/langmod.pdf Clark, Andy; Lupyan, Gary (2015). “Words and the World: Predictive Coding and the LanguagePerception-Cognition Interface” In Current Directions in Psychological Science Volume 24 (4). pp. 279-284. Clark, Andy (2017). “Busting Out: Predictive Brains, Embodied Minds, and the Puzzle of the Evidentiary Veil” in NOUS 51 (4). pp. 727-753. Clark, Andy (2017 (II)). “Predictive Processing and the Active Pursuit for Novelty” in Phenomenology and the Cognitive Sciences. pp. 1-14. Crane, Tim (2009). “Is Perception a Propositional Attitude?” in The Philosophical Quarterly. Volume 59. Issue 236. pp. 452-469. Currie, Gregory (1993). “The Long Goodbye: The Imaginary Language of Film” in The British Journal of Aesthetics, Volume 33, No. 3. pp. 207-218. Cutting, James E. (2015). “Shot durations, shot classes, and the increased pace of popular movies” in Projections Vol. 9, Issue 2. pp. 40-62. Davidson, Donald (2001 1984). Enquiries into Truth and Interpretation. Oxford University Press: Oxford. ISBN: 0-19-824617-X. Ehrat, Johannes (2015). Cinema and Semiotic: Pierce and Film Aesthetics, Narration, and Representation. University of Toronto Press: Toronto. ISBN: 0-8020-3912-X. Ellis, Nick C. (2011). “Optimizing the Input: Frequency and Sampling in Usage-based and Formfocussed Learning” in Michael Long and Cathy Dorthy (eds.) The Handbook of Second and Foreign Language Teaching. Wiley Blackwell. ISBN: 978-1-4051-5489-5. Ewerth, Raplh et al. (2005). “Videana: A Software Toolkit for Scientific Film Studies”. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.157.9398 Feldman, Jacob (2016). “What are the True Statistics of the Environment?” in Cognitive Science (2016). pp. 1-33. Festinger, Leon (1957). A Theory of Cognitive Dissonance. Stanford University Press. ISBN: 9780804709118 Fodor, Jerry A. (2005). LOT 2. The Language of Thought Revisited. Clarendon Press: Oxford. ISBN: 978-0-19-954877-4.

65

Friston, Karl (2010). “The Free-Energy Principle: A Unified Brain-Theory?” in Nature Reviews. Macmillan Publisheres Limited. Friston, Karl (2017). “The Variational Principles of Cognition” in Igor S. Aronson et al. (eds.) Advances in Dynamics, Patterns, Cognition: Challenges in Complexity. Springer International Publishing. pp. 184-210. ISBN: 978-3-319-53672-9. Geman, Stuart (2006). “Context and Hierarchy in a Probabilistic Image Model” in Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. pp. 2145-2152 Grodal, Torben (2016). “Film, Metaphor, and Qualia Salience” In Fahlenbrach (ed.) Embodied Metaphors in Film, Television, and Video Games: A Cognitive Approach. Routledge: New York. ISBN: 978-1-138-85083-5. pp. 101-114. Gigerenzer, Gerd (2016). “Towards a Rational Theory of Heuristics” in Frantz R., Marsh L. (eds.) Minds, Models and Milieux. Archival Insights into the Evolution of Economics. Palgrave Macmillan: London. Helmholtz, Hermann von (1866). Treatise on Physiological Optics. Optical Society of America. Inoue, Katsumi; et al. (1998). “Negation as Failure in the Head” in The Journal of Logical Programming 35. pp. 39-78. I-Tseng, Chiao (2013). Cohesion in Film: Tracking Film Elements. Palgrave Macmillan: New York. ISBN: 978-1-349-45050-3. Jaynes, E. T. (2002). Probability Theory: The Logic of Science. Cambridge University Press: New York. ISBN: 978-0-521-59271-0. Johnson, Justin et al. (2015). “Image Retrieval Using Scene Graphs” in IEEE Conference on Computer Vision and Pattern Recognition (2015). http://hci.stanford.edu/publications/2015/scenegraphs/JohnsonCVPR2015.pdf Jurafsky, Daniel (1996). “A probabilistic Model of Lexical and Syntactic Access and Disambiguation” in Cognitive Science 20. pp. 137-194. Kahneman, Daniel (2011). Thinking Fast and Slow. Farrar, Straus, and Giroux. ISBN: 9780374275631 Keynes, J.M. (1929). A Treatise on Probability. Macmillan and co. Ltd: London. Krishna, Ranjay et al. (2017). “Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. ” Preprint: https://arxiv.org/abs/1602.07332

66

Knill, David C. (2007). “Robust cue integration: A Bayesian model and evidence from cue-conflict studies with stereoscopic and figure cues to slant” in Journal Of Vision, Vol. 7, No. 7. pp. 1-24. Knill, David C. et al. (2012). “Ideal-observer Models of Cue Integration” In Trommershäuser et al. (eds.) Sensory Cue Integration. Oxford University Press: Oxford. ISBN: 978-0195387247. Kovács, András Bálint et al. (2016). “Shot Scale Distribution in Art Films” in Multimedia Tools and Applications, Vol. 75, Issue 23. 9p. 16499-16527. Kripke, Saul (1980). Naming and Necessity. Blackwell-Wiley. ISBN: 978-0631128014. Kukkonen, Karin (2016). “Bayesian Bodies: The Predictive Dimension of Embodied Cognition and Culture” In Garratt P. (eds) The Cognitive Humanities. Palgrave Macmillan, London. Laptev, Ivan et al. (2008). “Learning Realistic Human Actions from Movies” in IEEE Conference on Computer Vision and Pattern Recognition (2008). http://www.irisa.fr/vista/Papers/2008_cvpr_laptev.pdf Loschky, Lester C. (2015) “What Would Jaws Do? The Tyranny of Film and the Relationship between Gaze and Higher-Level Narrative Film Comprehension” in PLoS One 10. pp. 1-23. Lupyan, Gary (2015). “All Concepts are Ad Hoc Concepts” In E. Margolis & S. Laurence (Eds.) The Conceptual Mind: New directions in the study of concepts. pp. 543-566. Cambridge: MIT Press. Mathys et al (2014). “Uncertainty in perception and the Hierarchical Gaussian Filter” in Frontiers in Neuroscience, 8, 825. 1-24. Metz, Christian (1974). Film Language: A Semiotics of the Cinema. Oxford University Press: Oxford. ISBN: 978-3-11-081604-4. Millikan, Ruth Garrett (2017). Beyond Concepts: Unicepts, Language, and Natural Information. Oxford University Press: Oxford. 978-0-19-871719-5. Mumford, David et al. (2006). “A Stochastic Grammar of Images” in Foundations and Trends in Computer Graphics and Vision Vol 2, No. 4. pp. 259-362. Mumford, David et al. (2003). “Hierarchical Bayesian Inference in the Visual Cortex” in Journal of the Open Optical Society in America, Vol 20, No. 7. pp. 1434-1448. Pelowski, Matthew (2017). “Move me, astonish me. . . delight my eyes and brain: The Vienna Integrated Model of top-down and bottom-up processes in Art Perception (VIMAP) and corresponding affective, evaluative, and neurophysiological correlates” In Physics of Life Reviews.

67

Plantinga, Carl R.; Smith, Greg M. (1999). Passionate Views: Film Cognition and Emotion. Johns Hopkins University Press: Baltimore. ISBN: 0-8018-6011-3. Pouget, Alexander et al. (2012). “Not Noisy, Just Wrong: The Role of Suboptimal Inference in Behavioral Variability” in Neuron 74. pp. 30-37. Pouget, Alexander et al. (2013). “Probabilistic Brains: Knowns and Unknowns” in Nature Neuroscience, Vol 16, No 9. pp. 1170-1178. Putnam, Hilary (1981). Reason, Truth, and History. Cambridge University Press. ISBN: 9780521297769. Quine, W. V. (1995). From Stimulus to Science. Harvard University Press: London. ISBN: 0-67432636-9. Quine, W. V. (1968). “Ontological Relativity” in The Journal of Philosophy, Volume 65, No. 7. pp. 185-212. Redfern, Nick (2011). “Modelling Inference in the Comprehension of Cinematic Narratives” https://www.semanticscholar.org/paper/Modelling-inference-in-the-comprehension-of-cinemaRedfern/811b5ec8b70569f23714f4a3640fefac8b9c582a Russell, Stuart et al. (2010). Artificial Intelligence: A Modern Approach. Prentice Hall: Upper Saddle River. ISBN: 978-0-13-604259-4. Phung, D. Q. et al. (2005). “Topic Transition Detecting Using Hierarchical Hidden Markov Models and Semi-Markov Models” in Proceedings of the 13th annual ACM international conference on Multimedia. pp. 11-20. Rowbottom, Darrell P (2015). Probability. Polity Press: Cambridge. ISBN: 978-0-7456-5256-6. Merabti, Bilal (2015). “A Virtual Director Using Hidden Markov Models” in Computer Graphics Forum 4 (2015), Wiley. Misyak et al. (2011). “Statistical-sequential Learning in Development” in Rebuschat, Patrick et al. (eds.) Statistical Learning and Language Acquisition. De Greuter: Berlin. ISBN: 9781934078235. Sainsbury, R.M. (2005). Reference Without Referents. Clarendon Press: Oxford. ISBN: 978-0-19924180-4. Salt, Barry (1974). “Statistical Style Analysis of Motion Pictures” in Film Quarterly, Volume 28, No. 1. pp. 13-22. Seppälä, Jaakko (2015). “On the Heterogenity of Cinematography in the Films of Aki Kaurismäki” in Projections Vol. 9, No. 2. pp. 20-39.

68

Schrödinger, Erwin (1978 1924). “Notes on the Problem of Causality” in Maria Reichenbach & Robert S. Cohen (ed.) Hans Reichenbach: Selected Writings 1909-1953, Volume II. pp. 328-332. ISBN: 978-90-277-0910-3. Siegel, Susanna (2010). The Contents of Visual Experience. Oxford University Press. ISBN: 978-0-19530529-6. Smith, Murray (2017). Film, Art, and the Third Culture: A Naturalized Aesthetics of Film. Oxford University Press: Oxford. ISBN: 978-0-198-79064-8. Stocker, A. A. et al. (2007). “A Bayesian Model of Conditioned Perception” in Adv Neural Inf Process Syst. pp. 1419-1416. Thiessen, Erik D. et al (2015). ”Statistical learning of language: Theory, validity, and predictions of a statistical learning account of language acquisition” in Developmental Review 37. pp. 66-108. Tijana, Mamula (2013). Cinema and Language Loss: Displacement, Visuality, and the Filmic Image. Routledge: London. ISBN: 978-0-415-80718-0. Vo, Nam M. et al. (2014). “From Stochastic Grammar to Bayes Network: Probabilistic Parsing of Complex Activity” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2014). http://openaccess.thecvf.com/content_cvpr_2014/papers/Vo_From_Stochastic_Grammar_2014_ CVPR_paper.pdf Wildfeur, Janina (2014). Towards a New Paradigm for Multimodal Film Analysis. Routledge: New York. ISBN: 978-0-415-84115-3. Wittgenstein, Ludwig (2009). Philosophical Investigations. Blackwell Publishing Ltd.: Oxford. 978-0415-5929-6. Worth, Sol (1984). “Pictures Can’t Say Ain’t” in Larry Gross (eds.) Studying Visual Communication. University of Pennsylvania Press. ISBN: 0-8122-1116-2. Wyeth, Peter (2017). The Matter of Vision. Indiana University Press. ISBN: 978-0-861-96712-4. Zeki, Semir (2014). “Non-binding Relationship Between Visual Features” in Frontiers in Human Neuroscience, Vol 8, Article 749. pp. 1-9. Zwaan, Rolf A. (2014). “Embodiment and Language Comprehension: Reframing the Discussion” in Trends in Cognitive Sciences, Vol. 18, No. 5. pp. 229-233.

69

Appendix Appendix 1: In this first appendix, we review a few new currents within the formal side of film theory. They are included here to provide a sense of how film theory currently tackles the formalization of the medium. However, since the studies are not of vital importance to our own approach, which has a finer granularity of analysis, and is thus closer in affinity to the approach within machine learning, I present it here in the appendix which can be read by the interested reader.

A review of earlier approaches to modelling of visual data In this appendix, we provide an overview of film scholars who have attempted to answer questions pertaining to the formal or structural nature of cinema, video, or images in general, and who have thus, in various ways, engaged in a process of modelling. Currently, one can observe two general strands of formal approaches to modelling of visual data. The first approach traces stylistic principles and patterns under observation in various works, traditions, or historical epochs This, I believe, is what David Bordwell would term a “poetics of cinema”. The second and more linguistically inspired strand pursues the effort of deriving general systemic rules and logics of film structure and organization. In the following, we will review these tendencies in brief. Cinemetrics and the mathematical analysis of stylistic variables It would be to give history its due to start this section with the name of Barry Salt, who, with his seminal paper “Statistical Style Analysis of Motion Pictures” (1974) initiated the field of formal film studies. While being a lone wolf in the field for decades, Salt has recently become part of a considerably more unified attempt at answering formal questions relating to cinema through the concept and study of cinemetrics. The approach, first and foremost interested in analysing patterns in shot-duration over large corpuses of films, has gained traction and expanded in recent years through the work of James E. Cutting and others. One might for instance encounter the study “On the Heterogenity of Cinematography in the Films of Aki Kaurismäki” by Jaakko Seppälä” (2015) which traces stylistic regularities and irregularities between the works of the Finnish director over variables such as cutting rate, type of shot, shot relations, relations between shot type and shot length, and other factors, or a broader study such as “Shot Durations, Shot Classes and the Increased Pace of Popular Movies” (Cutting et al. 2015). As embodied in the latter study, the quantification or formalization of cinema has usually been used to quantize differences between works. As a general rule, numbers seem to speak the most when situated in what we might call comparative networks. This is probably the main reason why the large part of formal studies analyses only a few variables, but does so over a wide range of films. When I say ‘a few variables,’ what I really mean is ‘a few continuous variables.’ A continuous variable is, for instance, shot-length, because it can take on an infinite number of values. A melon, say, or a car, on the other hand, are categorical variables. Categorical variables are practically not studied formally, and perhaps for a good reason. If they are, they take the form of being extra-diegetic universal constituents of the moving image such as shotclasses. Since these variables appear naturally in every film, comparison of variables across films comes naturally. The main text of the thesis indicates an attempt to introduce a methodological

70

framework for a multivariate analysis of categorical variables. As noted, it was done without computer-power, which is otherwise widely used in the field of cinemetrics. Computer technology and visual feature extraction Most of the recent studies in the field of Cinemetrics use various machine learning algorithms and computer technology to automatize, as much as possible, the desirable measurements. Thus, the study “Shot scale distribution in art films” by Kovacs et al. (2016) uses computer vision and so-called random forest algorithms to detect shot scales in films by Michaelangelo Antonioni. The paper shows that, by means of machine learning, a computer is able to predict, with a certain amount of certainty, the shot scale structure of a film it has not yet analysed. The process is one of transduction which basically means inductive training with test cases which are then applied to specific cases for analysis. For continuous variables, transductive training can be performed with little noise. However, for categorical variables, the task becomes exponentially harder. A software toolkit called Videana have been developed to automatically perform “shot boundary detection, camera motion estimation, detection and recognition of superimposed text, detection and recognition of faces in a video, and audio segmentation” (Ewerth et al. 2005: 1). Some have tried to go even further. In a paper by Bateman et. al. (2016) the approach is extended to extract significant visual patterns by using a so-called SIFT algorithm which “detects highly structured regions in images… believed to comprise more important information than unstructured image regions” (Bateman et. al. 2016)

Bateman et al. 2016: 140 Automatic detection of structured patterns Such extraction comprises the “general approach of decomposing recognition into freely recomposable bundles of properties” (Bateman et al. 2016: 141). Comparing the similarity of the properties extractions allows for similarity observations which allow the viewer to form inferences about proper segmentation. On a timeline, a computer can thus, with certain confidence, track similarities across scenes:

Bateman et al. 2016: 144 The advantage gained by using automation is not only that it in some respects are more precise, it also significantly reduces manual labour and makes possible modelling of large numbers of films. In a similar vein, Cutting et al. has used a combination of computer vision and the computer program Matlab to produce a metric/measure of movement in cinema – camera movement as well as diegetic movement of actors etc. – called VAI (abb. of Visual Activity Index). In an earlier paper, I analysed a corpus of films from a catalogue of a local cinema, using a shot detection algorithm and the mentioned VAI algorithm, to study if the screened films had become more visually dramatic

71

over the years. I compared, for instance, how visual activity in the film affects shot structure. Let us take a short look at the structure of Park Row (1952) directed by Samuel Fuller:

One thing to note is that, around 2/3 through the film, the visual activity seems to peak. The right graph tells us that, simultaneously, the shot length increases visibly (the bump on the right graph.) Similarly, in the first five shots or so, the activity is very high as well. The same structure is seen on the right: for the first five shots or so, the graph is quite steep relative to the general regression, i.e., the shot-length is relatively high. This inverse relationship between activity and shot-length in this particular film could form the basis of a tentative hypothesis that such tendencies might be present in other films as well by the same director. Obviously, one would have to support such naked singevariable data with looking at the film itself in order to carry the analysis further. The point is that data visualisations might reveal information you might not otherwise have paid attention to. The studies above have, by dealing only with stylistic variables, not concerned themselves with the formalization of narrative mechanisms. In the next section, I will review a recent tradition which has as one of its aim formalize the structure of cinema more broadly. Structural-linguistic attempts and non-statistical modelling Within cinema There are a few scholars, primarily professors of linguistics such as John Bateman (2011), Janina Wildfeur (2014), and Chiao i-Tseng (2013), who, in publication, have pondered extensively on the intricacies and possible benefits of going further down the road paved by the field of Cinemetrics by means of putting a more structural and semiotic, rather than mathematical, view at the centre of attention. John Bateman articulates their problem-goal set as follows: “Finding these ‘systems’ that contribute to filmic interpretation and provide sense to the stream of technical features deployed has proved itself to be a major stumbling block for taking film analysis further. More detailed characterizations of the “formal play of differences” underlying every “process of signification” have proved particularly resistant to isolation: it is simply unclear which groupings of filmic technical features will do the job. Without systems of contrasts, there are no formal ‘differences’ out of which processes of signification can grow and the entire

72

process does not get off the ground. We see this as one, perhaps the, central problem of film analysis”1 (Bateman 2011: 18) The common goal of the scholars of the multimodal, discursive tradition is to find the elements that link shots together and thus allow viewers to find meaning through perceived cohesion. While Bateman focuses on syntagmatic and paradigmatic discourse structures, Wildfeuer seeks to construct necessary logical forms, and i-Tseng purports to build so-called identity-chains. Common to their approach is that the focus is, rather than on the film material itself, to classify and designate clusters and, more importantly, relations. They thus engage in a form of mapping between preordained concepts and cinematic structure, with the consequence that the studies in this tradition take on a more visibly deductive bent. Chiao-I Tseng mentions that the overall contribution is directed at providing “methods for the future development of an empirically-grounded framework for film discourse” (Tseng 2013: 155). However, as Bateman says, tied to their focus on meaning is a focus on more high-level salient features in the image: “We do not consider… individual visual components of films that are no doubt necessary for low-level perception to do its job (such as corners, changes in brightness, texture, etc.), since we do not have unmediated access to this level of perception; we see what is shown, not uninterpreted moving patches of shade and light but houses, dogs, cars, people, dinosaurs and three meter tall blue aliens and the activities any of these may participate in.” (Bateman 2011: 163) Is such a restriction a shortcoming? One should perhaps not a priori exclude that low-level perception guides higher forms of perception and to a certain extent can be autonomous; the question of which variables and features of an image are salient, and which one can marginalize over, is a fundamental one. We will return to this question in our chapter on depiction. In the following, to restrict myself, I will give a short review of one of the methods of three studies mentioned above. Wildfeur’s so-called logical forms, in schematic form, looks like below. The shots are from Das Leben der Anderen:

Wildfeur 2014: 186

1

The notion of cinema as a ‘formal play of differences’ was coined by Branigan (1984: 29) as a way to describe the fuzziness of film signification.

73

Wildfeur calls a schema such as the above a logical form, meant to provide evidence of a films construction principles and cohesive chains across shots. As the sound argument goes, such observed uniformities help guide comprehension. Above, ‘piano’ and ‘hands’ create the eventuality ‘play’, whereas ‘wiesler’ and ‘piano playing’ creates the eventuality ‘listening.’ Wildfeur develops a full-bodied logical system of the possible ways in which shots can relate to one another, but it is too expansive to go be studied here. Let it suffice to say, that in the case above, the relation between the two shots are said to be of the form ‘parallel’, since, Wildfeuer argues, the camera-pan which moves in the same direction across shots, as well as the sound-bridge, creates coherence among two otherwise disparate shots. Thus, a pattern is recognized and formalized. When, in the next few shots, Wiesler begins to cry, the relation shifts to one of ‘result.’ While in spirit, our approach will have a lot in common with such a logical approach, in method, our approach will lean closer to the formal methods of the former sub-section. From the viewpoint of our statistical and inductive perspective, the “formalization by means of mapping” approach leaves a few things desired. By partitioning film material into abstract and artificially created classifications and clusters, rather than explicating the structure of visual elements depicted in the film, a certain granularity and veridicality is lost. One does not see ‘a sequence’ as one watches a film, unless perhaps one is a film scholar. The argument in this paper is that, if film is not a traditional language, but more like a statistical (or stochastic) language, then we must treat it as such. If the point of discourse-analysis is to classify things abstractly, then such a method is not veridical or precise enough for our purposes. If we combine the attempts so far, a tendency emerges. While the study of cinemetrics formalize mathematically over few continuous non-diegetic variables, multimodal discourse studies formalize in structurally descriptive and non-mathematical ways, but do so over a larger set of both nondiegetic, diegetic, continuous and categorical variables. As should be clear from the thesis, my contribution is an attempt to combine the two approaches. One field which too has as its aim to combine mathematics with conceptual analysis is the field of Machine Learning, which we now turn to.

From cinema studies to machine learning: images parsing and video content modelling If humans naturally ‘read’ images, then why is that so, and how is that so? How have they learned to automatically predict the causes of stimuli with generally little loss? Within machine learning, the question is asked in the form: How do we get a computer to understand images? Such a question has led to an abundance of studies on modelling of visual material. While employing a host of different methods, the studies converge on the simple fact that every form of modelling requires a principled model of a type of structural decomposition. This entails the view that images possess compositional architecture, and that they are canonically decomposable to a probabilistic degree. The philosophy of these studies is thus much in line with the view we propose in this paper. In the words of Geman et al. (2006), “the principle of compositionality holds that humans perceive and organize information as a syntactically constrained hierarchy of reusable parts” (Geman et al. 2006: 2) The term reusable part is somewhat of a platitude within machine vision literature. A reusable part is a visual property which has a non-unique quality which allows it to engage in various configurations. From a vocabulary of reusable parts, various visual configurations can be built. As

74

such, “a possibly small number of reusable parts might be sufficient to compose the large ensemble of shapes and objects that are in the repertoire of human vision” (Geman et al. 2006: 1) If Geman et al. are right, one could imagine an image browser containing the space of all possible valid visual configurations. In fact, such databases already to a certain extent exist. ImageNet, Visual Genome, and Lotus Hill are all examples hereof. If reliable, such datasets can help construct tentative probability distributions of the configuration of our visual world. Johnson et al. (2015), for instance, portrays “the 150 most frequently occurring (object, relationship, object) and (object, attribute) tuples” (Johnson et. al. 2015: 5.) Krishna et al. (2017), who participates in the Visual Genome project, names this type of formalization of visual data “relationship statistics (…) if a man is swinging a bat, we write swinging(man, bat)” (Krishna et al. 2017: 27) The idea is that ‘swinging’, ‘man’, and ‘bat’ are reusable parts which can engage in other valid configurations according to some probabilistic production rule. The idea of reusable parts does not entail a strict object-focus. Even given the visual entropy of, say, a Stan Brakhage film, such that one can arbitrarily reconfigure the image without altering its semantic function drastically, the image is still made up of low-level geometric reusable parts. Reusable parts can thus occupy any spot on the low-level to high-level spectrum of vision, from pixels to semantic concepts and objects. According to Mumford and Zhu (2006), one major issue which kept image grammar tendencies from establishing itself in the 70’s was the semantic gap between pixels and symbols. The general problem is one of connecting low-level visual properties with high-level semantics. While today, Machine Learning algorithms have largely solved this particular problem, many problems continue to exist. We might at this point decompose the formalization problem into three sub-problems. The first is related to the low-high mapping referred to above which we might term vertical or hierarchical organization – such that a pixel is part of a visual cluster which is part of a finger which is part of a hand which is part of an arm, which is part of a part of a picture, and so forth; a second and perhaps more computationally demanding, not to say intractable problem is to model horizontal or temporal organization and dependencies over potentially high-dimensional space and long time-scales – such that a hand moves towards a glass and moves it towards the mouth, associated with the body, to which the hand belongs, etc. The third and last problem consists in formalizing and modelling the two aspects in combination. To deal with these issues, the literature standardly applies a type of computational and/or probabilistic graphical model. Often, some type of dynamic Bayesian network, such as augmented Markov models, stochastic grammars, or neural networks are typically employed. Since these models can be quite complex, I have chosen to present (excluding neural networks) this material in Appendix 2. In the following section, I will thus deal mainly with the purely formal (and not inferential) part of the approach. We will start with the formalization of the hierarchical configuration of images. A first general step in the process of formalization of visual data is one of decomposition of the material into reusable parts. Mumford describes the process in the following way: “any visual pattern can be conceptualized as a statistical ensemble that observes a certain statistical description. For a complex object pattern, its statistical ensemble must include a large number of distinct configurations. Thus our objective is to define an And–Or graph, thus its image grammar, such that its language, i.e., the set of valid configurations that it produces, reproduces the ensemble of instances for the visual pattern.” (Mumford 2007: 321). We will define an And-Or graph anon, but first, consider the image below, representing a decomposition of a football match scene:

75

Mumford/Zhu 2006: 262

The image below generalizes the picture above by depicting a given possible configuration and its path between high-level and low-level resolution parts:

Geman et. al. 2006: 2

The graph above depicts a hierarchical and compositional image model. The lower-most white circles represent pixels. All other circles are hierarchically different “bricks” of the image, as in “LEGO-bricks.” These represent reusable parts. Any brick can be on or off, and has a probability vector associated with it. Red signals on, white signals off. Together, the red circles specify a configuration which represents an interpretation of a visual scene. The terminal bricks are the lower most semantic parts, while the broken ovals signify the possible valid children of a brick. These limit the number of compositions. See Geman et al. (2006) for further details. The figure above forms a so-called stochastic context sensitive grammar, and can be represented as an and-or graph. Here, OR-nodes choose a specific abstract component of the image (such as the hands on a clock) – and thus denotes the vocabulary – whilst and-nodes specifies an actual possible composition or instantiation (often termed production notes) of the or-node (2 hands, 3 hands on the clock). As such, each Or-node has attached to it a probability distribution. More precisely, an and-or graph is defined as the 6-tuple: And-Or Graph (S, VN, VT, R, , P): S: root note VN: Set of non-terminal and-nodes and or-nodes VT: Set of terminal nodes, including primitives, parts, and objects

76

R: Number of relations between nodes (in the clock example, this might be adjacency relation + causal relation between the minute hand and the second hand) : The set of all valid configurations on the graph P: The probability model, specified as the product of the (probabilistic) frequencies at each or-node. An and-or graph can quickly become huge if a low-resolution image-analysis strategy is chosen. Lin et al. (2009) can be consulted for examples of actual decompositions of specific objects. For the general representation of an and-or graph, see below:

Mumford, Zhu (2003): 270 Depending on the resolution of the analysis, the terminal bottom nodes might be pixels, parts of objects, or even objects. Mumford et al. operates with three levels: image primitive configurations, part-object configuration, and highest object level (concepts). They might also be actions, in case one models moving images. In event modelling, action is decomposed into action primitives (Bimbo 2010: 283,) which are then bundled together to form a trajectory. The lower-most level in the graph thus represents these atomic actions (see Vo et al. (2014), as well as Laptev (2008) for further information). A large part of the studies above, and computation by image primitives in general, refers to Irving Biederman’s Recognition-by-Components: A Theory of Human Image Understanding (1987) and its theory of primitive visual geometric elements called geons. Biederman’s idea is that, with just 36 geons, 154 million possible three components objects with five possible relations at edges between geons (curvature, collinearity, symmetry, parallelism, and cotermination) can be created, well exceeding the number of objects human beings are known to know (likely well under 100.000.) In the image below by Mumford et. al. (2006), one can get a glimpse of the computational use-value of modelling with image primitives.

77

Mumford et al. 2006: 303. The input image is thus recursively reconstructed from bottom up. To iterate, we will not perform our analysis at this very low-resolution level. Operating at this level introduces a set of further problems which we gladly dodge. As Mumford mentions, “Images not only have very regular and highly structured objects which can be composed by production rules, they also contain very stochastic patterns, such as clutter and texture which are better represented by Markov random field models” (ibid.: 295.) Since we are interested in accumulated knowledge in time across scenes and images, such resolution issues are not decisive in our case. Logically, the resolution level decides the dictionary or vocabulary; i.e., a mouth seen at a distance might not have any configurations at all, since it is only barely visible. A close-up of a mouth might have an integral of configurations generated by for instance emotional states. These primitives are so detailed that they need have no name. As Mumford and Zhu mentions, “each non-terminal node may exit the production for a low resolution case” (ibid.: 218). Our dictionary is going to be objects or reusable object parts, for the simple reason that no computer modelling will be employed in this paper.

Appendix 2: The mathematics of rational inference and stochastic modelling This second appendix will review the mathematics of rational inference in depth. It will focus in particular on Bayesian statistics and Markov modelling, which are distinct but related forms of probabilistic modelling. To start off, let us for simplicity assume that some sensory data given to an observer in a cinematic situation. Let that sensory data be a close-up of a face holding a certain expression. As the canon goes, in order to minimize free energy, any observer is likely to reason about the mechanism which generated the observed data. Thus, it seeks to establish a generative model. A generative model is a model which attaches values (probabilities) to all states inferred by an observer. States are traditionally split into hidden states and observables. Further, the hidden states are said to generate the observables. In the Bayesian view, the hidden states are often simply called hypotheses, the reason being that they are always a product of subjective inference. Let us say the expression on the face is a smile. Assume this as an observable. The hidden state which has generated this observation might be the emotional state “happy.” In theory, however, an observer can choose

78

freely what to see as hidden, and what as observable. Thus, if we decrease the granularity of analysis, we might say that the particular “expression,” given by some visual attributes, represents the observable data, whereby the “smile” becomes an inferred hidden state of the image. What is the probability that the concept “smile” has generated the particular expression? According to the metrics of probability theory, the probability must be given as the joint probability of the two elements divided by the overall probability of the occurrence of a smile. If the denominator and numerator are equal, then the smile always occurs with exactly that expression, and the probability is 1. If we formalize, we obtain: 𝑝(𝑠𝑚𝑖𝑙𝑒 |𝑒𝑥𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛) =

𝑝(𝑒𝑥𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑎𝑛𝑑 𝑠𝑚𝑖𝑙𝑒) 𝑝(𝑠𝑚𝑖𝑙𝑒)

In the same way, the probability that such exact expression occurs given that it is true within a model that someone smiles is: 𝑝(𝑒𝑥𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 |𝑠𝑚𝑖𝑙𝑒) =

𝑝(𝑒𝑥𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑎𝑛𝑑 𝑠𝑚𝑖𝑙𝑒) 𝑝(𝑒𝑥𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛)

These are essentially formulas for conditional probability, a central concept of probability theory which represents the probability of one event occurring given that another has occurred. This measure presents, more informally, the causal strength between the observed (expression) and the hidden data (smile), which, to reiterate, represents the hypothesis under inspection. A second important concept of probability theory is expressed in the numerator in the above equation, namely the concept of joint probability, which can be generalized by rewriting the equation above: 𝑝(𝑥1 … . 𝑥𝑛 ) = 𝑝(𝑥1|𝑥2 … 𝑥𝑛 )𝑝(𝑥2 … 𝑥𝑛 ) This equation allows an analyst to calculate the total (joint) probability of any multivariate event – such as a scene in a film. Looked at in one way, any event is unique. But any event is also likely to consist of parts which are not unique in themselves. Thus, the event can be decomposed into parts and thus dealt with probabilistically. The concept of joint probability, however, is less important to our study than that of conditional probability, the reason being that conditional probability better captures the process of figuring out what data to condition on, and what to leave out. Luckily, the above equation can be expressed using only conditional probabilities. As stated, we want to know the probability that the expression is a smile, that is, we want to know 𝑝(𝑠𝑚𝑖𝑙𝑒|𝑒𝑥𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛). By combining the two formulas above by inserting the formula for 𝑝(𝑠𝑚𝑖𝑙𝑒|𝑒𝑥𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛) into the upper equation, we obtain an expression with only conditional (and marginal) probabilities: 𝑝(𝑠𝑚𝑖𝑙𝑒 |𝑒𝑥𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛) =

𝑝(𝑒𝑥𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛|𝑠𝑚𝑖𝑙𝑒) ∙ 𝑝(𝑠𝑚𝑖𝑙𝑒) 𝑝(𝑒𝑥𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛)

The equation above is essentially Bayes Theorem, or Bayes Rule, as seen through a specific example. If we generalize, we obtain:

79

𝑝(𝐻|𝐷) =

𝑝(𝐷|𝐻) ∙ 𝑝(𝐻) 𝑝(𝐷)

Here, D is the observed data, H is the hypothesis being tested, (𝐷|𝐻) is the likelihood, and 𝑝(𝐻) is the prior probability of the hypothesis. 𝑝(𝐻|𝐷) represents the posterior probability of the hypothesis being true, given the new data. As Karl Friston says, “the posterior density is the probability of causes after seeing their consequences” (Friston 2017: 194) In the example above, 𝑝(𝑠𝑚𝑖𝑙𝑒) is the prior probability of the hypothesis smile before the observation of the certain expression took place. 𝑝(𝑒𝑥𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛|𝑠𝑚𝑖𝑙𝑒) represents the probability of seeing the observed expression given that the hypothesis smile is true, and 𝑝(𝑠𝑚𝑖𝑙𝑒|𝑒𝑥𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛) represents the probability of the hypothesis given the new data (the expression). One might say that the above equation only makes sense when somebody is testing an hypothesis in time. As such, the hypothesis smile might, before the smile, be close to zero. Once the person starts actually smiling, the data changes, and the probability changes according to the above formula for rational belief. Thus, given that the observer observes the formation of the smile, the data (the expression) is a vector and the belief change will have to be calculated as a continuous integral. In this thesis, however, we will discretize such continuous functions in order to simplify. The process is one of figuring out a generating mechanism of observed data. But there may be more than one hypothesis at play as regards what the generating mechanism is. Traditionally, the probability of the data p(d) is said to be proportional to the sum of its probability conditional to all the various hypotheses at play: 𝑝(𝐷) = ∑ 𝑝(𝐻𝑖 ) ∙ 𝑝(𝐷 |𝐻𝑖 ) 𝑖

This is the Rule of Total Probability. If we insert this into Bayes Theorem, we obtain the full expression: 𝑝(𝐻𝑖 |𝐷) =

𝑝(𝐷|𝐻𝑖 ) ∙ 𝑝(𝐻𝑖 ) ∑𝑖 𝑝(𝐻𝑖 ) ∙ 𝑝(𝐷 |𝐻𝑖 )

This equation tells us how one hypothesis holds up against another relative to emergent data. Note here that while the likelihood function 𝑝(𝐷 |𝐻𝑖 ) need not sum to one, any probability distribution, such as the posterior distribution 𝑝(𝐻𝑖 |𝐷), must sum to 1. As a general rule, any observer will seek to maximize the probability of the data given parameters, which we may write: best estimate = arg 𝑚𝑎𝑥𝐷 = [𝑝(𝐷|𝐻𝑖 ) ∙ 𝑝(𝐻𝑖 )] This is equal to wanting to maximizing model evidence. However, watching a film, a constant stream of images is out forward. It is the job of any spectator to figure out what data to condition on relative to any hypotheses he or she – or he or she’s brain – is interested in. For this reason, many models may be applied. The problem is, however, that in a big, multivariate space, conditioning can quickly

80

grow enormous due to the number of possibly relevant pieces of information relative to each hypothesis. Thus, with just three pieces of data A, B, and C, we can condition A on B, A on C, and A on BC. Likewise, we can condition B on A, B on C, and B on BC, and so forth. Now, suppose we have a spectator interested in weighing the hypothesis that a character is “happy” at time t against the background of all information gathered so far up until time t. Clearly, any spectator is likely to choose some elements as having a greater influence on the hypothesis than others. One would need a good reason to condition the state “Happy” against, say, a particular chair, unless that chair for some good reason is deemed to have causal powers over that character’s mental state. Suppose that the data A, B, C is deemed by the spectator to have a somewhat dependent relationship to the character’s emotional state, that is, it is deemed to be part of the generating mechanism of the hidden state. Suppose then, for example, that A could be the presence of a friend, B a certain dress, and C a location, say, the beach. The probability of H conditioned on the other data available is then given by following expression, which is expanded using the rule of conditional probability: 𝑝(ℎ𝑎𝑝𝑝𝑦|𝑓𝑟𝑖𝑒𝑛𝑑, 𝑑𝑟𝑒𝑠𝑠, 𝑏𝑒𝑎𝑐ℎ) =

=

𝑝(𝑓𝑟𝑖𝑒𝑛𝑑, 𝑑𝑟𝑒𝑠𝑠, 𝑏𝑒𝑎𝑐ℎ, ℎ𝑎𝑝𝑝𝑦) 𝑝(𝑓𝑟𝑖𝑒𝑛𝑑, 𝑑𝑟𝑒𝑠𝑠, 𝑏𝑒𝑎𝑐ℎ)

𝑝(𝑓𝑟𝑖𝑒𝑛𝑑|𝑑𝑟𝑒𝑠𝑠, 𝑏𝑒𝑎𝑐ℎ, ℎ𝑎𝑝𝑝𝑦) ∙ 𝑝(𝑑𝑟𝑒𝑠𝑠| 𝑏𝑒𝑎𝑐ℎ, ℎ𝑎𝑝𝑝𝑦) ∙ 𝑝(𝑏𝑒𝑎𝑐ℎ|ℎ𝑎𝑝𝑝𝑦) ∙ 𝑝(ℎ𝑎𝑝𝑝𝑦) 𝑝(𝑓𝑟𝑖𝑒𝑛𝑑|𝑑𝑟𝑒𝑠𝑠, 𝑏𝑒𝑎𝑐ℎ) ∙ 𝑝(𝑑𝑟𝑒𝑠𝑠|𝑏𝑒𝑎𝑐ℎ) ∙ 𝑝(𝑏𝑒𝑎𝑐ℎ)

Imagine a situation where the character sits with her back to the camera. In such a case, it is necessary to infer her mental state by conditioning on other variables present. We can carry out the decomposition of the equation further, but there is perhaps no need to, since we might not have all the data that allows us to carry out to make use of such equation. The above specifies an ideal situation in which you have the data needed to use such a full-fleshed model. But as this is often not the case, this means that different models will have to be chosen. Much of probabilistic reasoning consists in figuring out what simplifying assumptions can be carried out to arrive at bestbet approximations. One option is to assume conditional independence between the three variables by which we achieve the simplified model: 𝑝(ℎ𝑎𝑝𝑝𝑦|𝑓𝑟𝑖𝑒𝑛𝑑, 𝑑𝑟𝑒𝑠𝑠, 𝑏𝑒𝑎𝑐ℎ) ≈ 𝑝(ℎ𝑎𝑝𝑝𝑦|𝑓𝑟𝑖𝑒𝑛𝑑) ∙ 𝑝(ℎ𝑎𝑝𝑝𝑦|𝑑𝑟𝑒𝑠𝑠) ∙ 𝑝(ℎ𝑎𝑝𝑝𝑦|𝑏𝑒𝑎𝑐ℎ) Naturally, such an expression is less computationally prohibitive. But imagine now a spectator who only carries knowledge about her general emotional state as well as her state when being with this certain friend. Or suppose that a spectator decides that dress and beach do not carry causal impact on her mood, and thus marginalize over them. Then, the model becomes: 𝑝(ℎ𝑎𝑝𝑝𝑦|𝑓𝑟𝑖𝑒𝑛𝑑, 𝑑𝑟𝑒𝑠𝑠, 𝑏𝑒𝑎𝑐ℎ) ≈

𝑝(ℎ𝑎𝑝𝑝𝑦|𝑓𝑟𝑖𝑒𝑛𝑑, 𝑑𝑟𝑒𝑠𝑠, 𝑏𝑒𝑎𝑐ℎ) ≈

𝑝(𝑓𝑟𝑖𝑒𝑛𝑑, ℎ𝑎𝑝𝑝𝑦) 𝑝(𝑓𝑟𝑖𝑒𝑛𝑑)

𝑝(ℎ𝑎𝑝𝑝𝑦|𝑓𝑟𝑖𝑒𝑛𝑑) ∙ 𝑝(ℎ𝑎𝑝𝑝𝑦) 𝑝(𝑓𝑟𝑖𝑒𝑛𝑑)

The point is that a host of models can be applied, each being in essence rational but each resulting in different estimations about observable data. Notice that, so far, we have only considered

81

conditioning on data which is present in the image under scrutiny, but the principles remain the same no matter what model is applied. The process of reduction of a network into its dependent constituents is neatly tied to the natural next subject, namely that of Bayesian networks. Tractable inference and Bayesian networks Expressed simply, a Bayesian network specifies dependencies and independencies in a network. Given assumed independencies, inference in large networks can be made tractable. These (in)dependencies are typically depicted in a so-called Directed Acyclic Graph. To take a simple example, consider the following Bayesian network, here leaving out the associated probabilities for simplicity:

The lines represent explicit dependencies from top to bottom. A conditional probability is specified for every dependent relation in the network. Thus, the joint probability is not given by exhaustively conditioning on every possible variable, but by conditioning only on variables deemed to have a dependent relationship to the hypothesis. These variables are called the parents. In equation form, we can express the probability of such a network as: 𝑛

𝑝(𝑋1 , 𝑋2 , … 𝑋𝑛 ) = ∏ 𝑃(𝑋𝑖 |𝑥Π𝑖 ) 𝑖=1

where Π denotes the set of parents of each X. The idea is that, eliminating everything but the parents relative to the variable under inspection not only increases computational efficiency but also accuracy. Since we are after all mainly dealing with time series prediction and modelling (cinema is of course a temporal medium), we should introduce the aspect of temporality or dynamicity into our system of dependencies. A Dynamic Bayesian Network does exactly that. The network above can be made dynamic quite simply by replicating its structure over two time steps and adding the extra parameter of transition probabilities. Modelling temporal (time-series) data, however, adds significant difficulties to the table. When the network gets big, dependency structures are difficult to delineate, visualize or compute. The dire verdict by Phung (2005) is that “organization of content in generic videos (e.g., movies) is too diverse to be fully characterized by statistical models“ (Phung et al. 2005). One of the major problems is thus to represent dynamic relationships over large time-scales. As in non-temporal visual material, the process is one of connecting image to language through decomposition and (re-)composition. In the literature, largely two approaches are used to model

82

and handle such temporal video data in a formal way: 1) stochastic grammars, for instance contextfree (or context sensitive) stochastic grammars 2) A special subset of the dynamic Bayesian network (DBN) such as the hidden Markov model (HMM) or augmented HMM’s. Below, we describe how to use the Markov model as a way to model time series data such as a film sequence, as well as its shortcomings. Time series modelling and Markov models There is a close connection from the above to the so-called Hidden Markov Model (HMM). In this model, some hidden state is inferred from a given observation. But there is an important difference between the Bayesian approach and the Markov approach. In the Markov approach, the prior is not static, but dynamic. Thus, we operate with an additional probability factor called transition probability. The Markov and Bayes framework can still be combined, however, which we will do in the section on inference in the HMM. The Markov model is the simplest version of a Dynamic Bayesian Network (DBN). Stated simply, it models state changes in a time series – such as the change of location in a film – and the associated probabilities for those state changes. The most oft used variant of the Markov model is the Hidden Markov Model, which has been used extensively in image processing and speech recognition. However, increasingly, other applications have been found such as applications to modelling of cinematic sequences. The image below by Merabti et al. (2015) is a standard representation of an HMM, here used to represent a model of a cinematic sequence in the Markov framework:

Merabti et al. (2015) In the above model, the set of S’s represent the so-called hidden states, while the e’s represent the observables, which are said to be generated by the hidden states. In general, the observable depends on the state, but the state is independent of the observable. Further, a strong assumption – called the Markov property – is built into the model. The assumption is that the current state only depends on the earlier state. We will come back to discuss the validity and consequence of this assumption at the end of the chapter. Let us consider the same example which we used earlier. Suppose a spectator is interested in modelling a character’s mood in a film. Naturally, the hidden states will then represent the character’s mood at time t, which is unknown. At a given time-step in the cinematic time-series, the hidden state might be “happy.” This state is then said to generate, with a certain probability, at that time-step, an observable, say, a “smile”, which is known. So far, the approach is similar to the general Bayesian approach. The idea is, then, that each state transition – as depicted in the above image – and observable has a probability associated with it according to a specified model, which is simply the HMM. This HMM could, for instance, represent a spectator’s internal model of the (probabilistic) relationship between observations and hidden states. More formally, the model can be represented by two matrices – an observation (emission) matrix, and a transition matrix, plus a

83

prior probability vector of entering the first state. The observation matrix, specifying the so-called emission probabilities, denotes the probability of the observation given the state, and the transition model specifies the probability of transitioning from one state to another. All in all, the language of an HMM can thus be described as the 4-tuple (state, observation, emission-probability, transitionprobability, initiation probability of initial state). A transition matrix for a character’s emotional state will look like this: Example of transition matrix

The probabilities here represent the probability of transitioning from one emotional state into another. The numbers here are merely fictional. The model might represent a given viewer’s internal model. It might also be a learned model from a set of observed data. The emission matrix will take the same form, but specify the probability of observations given the states. Together, these two matrices specify the parameters of the HMM. Three problems are typically solved in HMM’s, respectively named the evaluation-, decoding-, and learning problem. If a string of visual symbols or observations are given, and an HMM model is specified, you have an evaluation problem. The evaluation problem consists in finding the probability that the internal model has generated the observed data. One could argue that the evaluation problem resembles a primary task of the brain during cinema watching. At each point in time, the brain has some prior model, and it has a set of observations which evolves in time. Given this, its task is to evaluate the observations given the model in the form of the question: what is the probability that my internal model X has generated the observed data? At the start of a film, the audience does not have a lot of data, so one could argue that the problem initially is largely one of evaluation. But what if we assume that our observer knows the states, say, that our observer wants to simply track the observables and learn their structure, that is, to familiarize itself with the language as it evolves? If so, the problem is one of learning, in which case the job is to discover and update the proper transition probabilities and emission probabilities as time unfolds. In practice, the learning problem and the evaluation problem will intertwine and coexist. The decoding problem regards the problem of finding the most likely sequence of states, which, since our focus is on negation, is of less relevance. Having now informally introduced the Markov model, we turn to look at the mathematics of how to perform inference in it.

84

Inference in the Hidden Markov Model We will look at how probabilistic reasoning functions over time. The evaluation problem is solved by what is called the forward algorithm. The idea behind the forward algorithm is to evaluate the probability distribution of all possible states at time t, given all past observations, by using the Markov assumption. It is thus a form of epistemic updating mechanism. Let the state be x, and observations y. The Markov assumption then says that, given x’s and y’s parents, x and y are independent of the past. More formally, xt is only dependent on xt-1, and yt is only dependent on xt. Using this, we can express the probability of the current state given evidence at t -1 and at t: 𝑝(𝑥𝑡 |𝑦1:𝑡 ) = 𝑝(𝑥𝑡 |𝑦𝑡 , 𝑦𝑡−1 ) Using Bayes rule and the Markov assumption, we obtain: 𝑝(𝑥𝑡 |𝑦1:𝑡 ) =

𝑝(𝑦𝑡 |𝑥𝑡 )𝑝(𝑥𝑡 |𝑦1:𝑡−1 ) ∑𝑥𝑡 𝑝(𝑦𝑡 |𝑥𝑡 ) 𝑝(𝑥𝑡 |𝑦1:𝑡−1 )

The denominator is a normalizing constant. If we denote this constant, 𝜕 we can simplify: 𝑝(𝑥𝑡 |𝑦1:𝑡 ) = 𝜕 𝑝(𝑦𝑡 |𝑥𝑡 )𝑝(𝑥𝑡 |𝑦1:𝑡−1 ) The second term is a prediction (since it conditions on the past event), while the first term is an update (it condition current evidence on current state). The first term is thus directly derived from the emission matrix. The second term, being a prediction, must be summed over all states at the last time-step. Mathematically, it looks like this:

𝑝(𝑥𝑡 |𝑦1:𝑡 ) = 𝜕 𝑝(𝑦𝑡 |𝑥𝑡 ) ∑ 𝑝(𝑥𝑡 |𝑥𝑡−1 ) 𝑝(𝑥𝑡 |𝑦1:𝑡 ) 𝑥𝑡−1

This is the forward algorithm. The first term within the summation represents the transition probability, that is, the probability of the current state given the last state. The second term within the summation represents the observation probability, that is, the probability of the current state given all observations. These two factors are multiplied, and summed over all the various states/hypotheses. This equation provides a way to calculate the probability of the current hypothesis about the hidden state, given the earlier state, by marginalizing over all the states at the last time-step. One can visualize the above calculation, whose process is an evaluation of a language (a cinematic sequence) given some prior language (a viewer’s internal model), by the so-called forward trellis:

85

Forward trellis

Dan Jurafsky 2

In the image above, squares represent observations, while circles represent (hidden) states. The trellis visualizes the computation of the likelihood that a model (a viewer’s internal world-model) has generated the particular observation sequence (a cinematic sequence), at each time step, given all past observations. The idea is to compute the likelihood recursively. This means that the likelihood of a state at each time step, as the trellis attempts to show, is computed by multiplying the likelihood of each state at the previous time-step with the transition probability to the current state, multiplied by the probability of the particular observation given the model. The equations above have been used to predict motion and activity of an object in video, and in speech recognition, and generally within temporal pattern recognition. It depends on the observation model, and on the predictive distribution, and is essentially a sequential (recursive) update of an interpretation of a hidden state given some sequence of observations. There is a major assumption, however, that it would be good to mention again. In a HMM, it is assumed that the emission and transition probabilities are given and thus time-independent. That is, we already have an HMM, and we from this, we can predict probabilities of sequences. The HMM is assumed not to change, since it is what the data – which does change, is conditioned against. Thus, two assumptions are usually made in the Markov framework. The first is the assumption of the Markov property – which is a limited horizon principle. The second is the time invariance principle, which states that the model is stationary (not static). To achieve tractability, such simplifying assumptions are often necessary. Whether they are “good” assumptions depend on the thing modelled. How they hold up against the case of cinema will be discussed in the next chapter, but we can slightly anticipate the discussion here. One solution is to construct several HMM’s for each inferred dependent relationship in the cinematic world, each specifying different priors. Each model can then be evaluated against the data. The spectator must not only infer dependency relationships, but faces also the task of 2

https://web.stanford.edu/class/cs224s/lectures/224s.17.lec3.pdf

86

determining which model to employ to evaluate the observations between the chosen dependent variables. This concerns the general task of conditioning and of model application. Does the location in a film play a part in generating the mood a given character? Does the location at time t depend on the location at time t-1. Any viewer’s decision with regards to such questions, unconsciously made or not, will influence the generative model he or she applies. Finally, most readers have probably noticed that the above model deals with single observations and single states, as limitations beyond the two Markov assumptions. Given that cinema is a complex set of observations and inferred states, the above model as it stands cannot model a cinematic sequence in its full complexity. In general, modelling all combinations and dependencies in a large combinatorial network is an intractable solution. One can extend the above model in some ways, however. Recall that, we noted that the Markov model is the simplest version of a Dynamic Bayesian Network. In theory, every complex network can be translated into an HMM. As Stuart Russell notes, one can “combine all the state variables in the DBN into a single state variable whose values are all possible tuples of values of the individual state variables” (Russell et al. 2010: 590). This is not a tractable solution, however. The transition and emission matrices increase exponentially in size as the number of variables rise. What this means is that HMM’s are best used to model simplified scenarios such as modelling just one state – such as cutting rhythm, or the position of an object, or the mood of a character, etc. – rather than full-scale, high-dimensional analyses. A host of augmented Markov models have been suggested to deal with various more complicated scenarios. The number of different models makes it difficult to go into here. Crucial is that all of these augmented models can be made to be represented as a simple HMM. To take an example, so-called factorial HMM’s extend classical HMM’s by letting each state be a collection of hidden variables independent of each other. Another example is segment-HMM’s, in which every state can produce a sequence of observations, rather than just one such observation. We will present a couple of ways to use such augmented models to segment a cinematic scene or sequence in the next chapter. The problem, however, is twofold: one, that exact inference may be intractable, due to high complexity, and two, which is more certain, that no model represents a true configuration and probabilistic model. What this means is that two solutions suggest themselves: one is to construct relevant and simplified DBN’s which specifies what connections are worth looking at, which helps predict in the most efficient way possible. This specification of dependencies, statistical in nature, lays out the language of film, as it is a form of causal structure learning. We will return to this idea in relation to film in the next chapter. What we will focus on presently, however, is that such dependencies are not specified (since cinema is not a formal language or deterministic physical system) but must be learned. This means that multiple models are likely to be employed as time moves forward (say, in a film). The pathway between the necessity of multiple models, and the idea of negation in the probabilistic sense, is quite short. Negation in the probabilistic view – model difference, selection, and averaging This is where negation comes to play a part. As we have argued elsewhere, negation is always relational and the relation is always one of a produced probabilistic gap, but the gap can have

87

various sources and be of various types. To regurgitate, negation can either take the form of a lowprobable event given a model, or be a clash between two or more models relative to a given hypothesis. Since it is relative to a hypothesis, the model parameters will be dependent. Given independent data, any talk of negation is less meaningful. If we are interested in the full experience of a scene, we focus on all variables across all combinations. Calculations of this sort are likely to be intractable. Further, since there is no one model that attaches a true probability to events, various models will be assumed. Further, there is no way to use the same model for different variables, since their values are accumulative to different degrees. We will return to this subject later. For now, assume simply that we are justified in applying various models if this move is believed to increase efficiency of prediction. But perhaps one might struggle finding out what model to choose. We can easily measure the information or surprise by a given sample from how much the probabilities have moved between a prior and posterior distribution, or between two models. In seeing a film, someone might have a model of evidence relative to real life, and one relative to the film. These models might clash. We can express the ratio probabilistically as follows: 𝑝(𝐷|𝐻1 ) ∙ 𝑝(𝐻1 ) |𝐷) 𝑝(𝐻1 𝑝(𝐷|𝐻1 ) ∙ 𝑝(𝐻1 ) 𝑝(𝐷) = = 𝑝(𝐷|𝐻2 ) ∙ 𝑝(𝐻2 ) 𝑝(𝐻2 |𝐷) 𝑝(𝐷|𝐻2 ) ∙ 𝑝(𝐻2 ) 𝑝(𝐷) This gives us the intuitive result that the ratio between two models or hypotheses are proportional to the ratio of the two posteriors: 𝑝(𝐻1 |𝐷) 𝑝(𝐷|𝐻1 ) ∝ 𝑝(𝐻2 |𝐷) 𝑝(𝐷|𝐻2 ) This left-hand side is called the Bayes factor or simply odds on the model or hypothesis H1 relative to model or hypothesis H2, and is in general a good measure of negation. One model could be an HMM, and another an observed distribution. Often H1 and H2 will be mutually exclusive, so that if H1 is true, H2 is false. It might be a spectator’s propositions “John is happy” versus “John is not happy” conditioned on the very same data, e.g. a certain expression on John’s face. Let us therefore write H2 as ¬𝐻, meaning “not-H.” The posterior odds given new data is then given by: 𝑜𝑑𝑑𝑠 (𝐻|𝐷) = 𝑜𝑑𝑑𝑠(𝐻) ∙

𝑝(𝐷|𝐻) 𝑝(𝐷|¬𝐻)

This means that the odds that the hypothesis is true in a binary situation of mutually exclusive hypotheses or models, is equal to the prior odds of the hypothesis being true multiplied by the likelihood ratio. This is a good notion of surprise. Jaynes (2003) uses the log10 logarithm and multiplies by ten in order to measure evidence in decibel. Doing so allows one to multiply factors and add up terms. We no longer call it odds, since we are no longer strictly measuring probabilistically. After Jaynes, we call evidence e and obtain: 𝑝(𝐷|𝐻) 𝑒(𝐻|𝐷) = 𝑒(𝐻) + 10 ∙ 𝑙𝑜𝑔10 [ ] 𝑝(𝐷|¬𝐻)

88

where: 𝑒(𝐻) = 10 ∙ 𝑙𝑜𝑔10 ∙ 𝑜𝑑𝑑𝑠(𝐻) This can be viewed as a measure of causal strength of H on D; that is: what is the probability that the hidden state, say, “happy”, was what caused the “smile”? In the log10 rescaling, when evidence is zero, this equals a probability of 0.5 of H. When decibel is 40, the probability of H is 0.99999. Now, in a film, there might be a lot of data either supporting or not supporting the hypotheses at play. Thus, we might need to distinguish between various data in the image. Since we are talking evidence on a linear scale now, and not probability, we can use a simple product rule: 𝑝(𝐷1 |𝐻) 𝑝(𝐷2 |𝐻 ) 𝑒(𝐻|𝐷 ) = 𝑒(𝐻) + 10 ∙ 𝑙𝑜𝑔10 [ ] + 10 ∙ 𝑙𝑜𝑔10 [ ] … 𝑒𝑡𝑐. 𝑝(𝐷1 |¬𝐻) 𝑝(𝐷2 |¬𝐻) 𝑒(𝐻|𝐷) = 𝑒(𝐻) + 10 ∙ 𝑙𝑜𝑔10 [

𝑝(𝐷𝑖 |𝐻) ] 𝑝(𝐷𝑖 |¬𝐻)

This equation of model evidence might tell us what the evidence is for the hypothesis that, let’s say, “Cynthia is happy,” given a whole string of evidence present in a single scene and some prior knowledge regarding the hypothesis before new data is perceived. When 𝐻 and ¬𝐻 are both H, model evidence equals the prior probability. For every piece of new data D, say, she’s crying, there is a likelihood ratio that this means that she is happy/¬happy – that is: given data D (crying), how plausible is the hypotheses? This likelihood ratio can in some sense be described as a loss function, in the way that the ration explicates how much information is gained or lost by going from one model to the other. If the two models predict the data equally well, they are the same model. But there may be many models at work. One model might condition on real life, and two models on the cinematic world. An observer might weight these different models differently in different situations, according to the principle of minimization of free energy. Whatever model does the job does the job. The model (average) which yields the highest posterior probability is likely to be chosen. If a model assigns low probabilities to an observed state, this model is likely to be given low weight by the observer, at least if we are to believe the free energy hypothesis. One model might take in as much dissonant information it can carry; another might try to reduce dissonance immediately. The one is proactive, the other not. Any model might choose which data to condition on; this will guide his or her inference processes. Given any uncertain data, we ask: what model has produced this data? If we cannot find a model, uncertainty and negation persists. A model is, we remember, a set of probability distributions. But, “in most situations, evaluating surprise or model evidence is an intractable problem” (Friston 2017: 193). So how does one do it? Weighting and model averaging The fundamental process is that of conditioning, which means explaining a complex event in terms of simpler events by the act of decomposition. But what if conditioning on different data leads to widely different predictions? We saw earlier that different models may be used ad hoc. Model

89

averaging means simply averaging over different posteriors relative to every model, weighted by the probability that this model is correct. The formula is typically given as: 𝐾

𝑝(𝐻 |𝐷) = ∑ 𝑝(𝐻|𝑀𝑖 , 𝐷) ∙ 𝑝(𝑀𝑖 |𝐷) 𝐾=1

The posterior of Mi is given by Bayes Theorem. The last term on the right represents the weighting factor. The question is whether there is evidence that the brain performs such averaging. It plausibly could. Another tactic to reduce variance is the application of heuristics. According to Gigerenzer (2015), heuristics forms a fundamental part of everyday reasoning. Here, a certain bias will be strategically and systematically employed in order to minimize variance in predictions. Thus, instead of model averaging, one might simply select the model which maximizes the data, and ignore the space of all other possible models. If, on the other hand, the data in one way or another has a certain way of forcing, say, incongruence, simple heuristics might fail, leaving the subject unable to find a model which explains the data at hand. Arthouse cinema is known for using such tactics, thereby throwing the viewer off the rails. In the analytical section, we saw a case where this last option, modelconfusion and thus negation, rather than model selection or model averaging, seems the most plausible process outcome. This concludes the mathematical introduction to modelling of (inference in) stochastic environments. For a more in depth introduction to the subject, I refer to Stuart Russell’s and Peter Nordvig’s book Artificial Intelligence – A Modern Introduction (2010), as well as E.T. Jaynes’ book Probability Theory (2003).

90

Quantifying Uncertainty

Quantifying Uncertainty

Suggest Documents