Accepted Posters

Reference versus Information

Daniel Couto-Value

RWTH Aachen University, daniel.couto-vale [at] ifaar.rwth-aachen.de

Geographic Referring Expression Generation

Rodrigo de Oliveira, Somayajulu Sripada and Ehud Reiter

University of Aberdeen, rodrigodeoliveira [at] abdn.ac.uk

Referring to the location of things on a map involves at least 2 strategies: one to partition the area depicted by the map, and another strategy to select the ‘best’ partition(s) to be used as the description of the target location. The partitioning strategy has been labelled Geocharacterisation (Turner, 2010), which is the stage where one decides, for example, if the difference between “coast” and “inland” is relevant, and where the boundaries between each will be. The selection strategy is what we can refer to as a strategy (or algorithm) for Geographic Referring Expression Generation (G-REG). In the coast vs. inland example, the job of the G-REG algorithm would be to select “coast” as the final description (and reject “inland”). In addition to Geocharacterisation and G-REG, yet another strategy may take place: one to partition the thing being located. In many domains (for example, meteorology), the things we locate cover a large portion of the map. For instance when describing the location of where rain will fall tomorrow, the extent of the ‘rainy area’ may be wide enough to allow partitioning of the rain itself. With so many strategies to heed, it is not a surprise that for the same data set, different people can potentially generate a variety of different descriptions. In fact, our experiments suggest that a full agreement between humans is impossible to achieve. People select different descriptors and/or subdivide the thing being located (e.g. the many data points that correspond to rain on a map) differently. It is also not clear whether they agree on boundaries of different areas (e.g. between northeast and east), or even whether they are mutually exclusive. This became evident to us when evaluating our currently deterministic G-REG algorithm (de Oliveira et al, 2015) by comparing our output with those of humans, and obtained very disappointing results. With so much variation in the Gold Standard, it was impossible to reproduce what everyone said, because our algorithm outputs a single final description. We now believe that a probabilistic algorithm would be a better model for human G-REG. The current idea is to allow all possible descriptions to be considered and assign a probability to each description, i.e. how likely it is that each description describes the location of a phenomenon on a geographic scene.

Integrating language processing and vision: learning word meaning from situated dialogue

Arash Eshghi, Yanchao Yu and Oliver Lemon

Heriot-Watt University, eshghi.a [at] gmail.com

We present work in progress addressing the problem of interactively learning perceptually grounded word meanings from the external world and a human teacher. We focus on the design and integration of language and visual processing systems for supporting a concept learning process with a tutor in Natural Language (NL), where the system does not know the meaning of words beforehand. Our work is close in spirit to the Cog-X George System [5], but they, like most other work in this area, do not do much in the way of dialogue processing (e.g. semantic parsing/generation). Our system on the other hand, makes use of an existing incremental, semantic parser and generator (DS-TTR, Dynamic Syntax (DS) and Type Theory with Records (TTR) [3, 1]) , producing semantic representations (logical forms) in TTR [2], which are then grounded in visual classifiers, acquired through dialogue (c.f. [4] who ground words directly in visual classifiers).

As noted, the system is made up of two key components – a Vision system and the DS-TTR parser/generator. The Vision system classifies a (visual) situation, i.e. deems it to be of a particular type, expressed as TTR Record Types (RT). This is done by deploying a set of binary attribute classifiers that ground the simple types (atoms) in the system (e.g. ‘red’, ‘square’) and composing their output to construct the more complex type of the (visual) situation. This representation then acts as (1) the non-linguistic context of the dialogue for DS-TTR, for definite reference/pronoun/indexical/ellipsis resolution; and (2) the logical database from which answers to questions about the objects’ attributes are retrieved - the question is parsed and its representation acts directly as a query on the non-linguistic/visual context to retrieve its answer. Conversely, the system can form questions about the scene, where the teacher’s answer then acts as a training instance for the classifiers (basic, atomic types) involved.

The Vision module focuses on extracting robust visual features for particular objects using state-of-the-art extraction methods – Bag-of-Visual- Words (BoVW) by Vlfeat [6], and then learns to classify novel objects with visual attributes. This baseline system currently supports 2 attribute categories: 6 colors and 3 shapes. Prior work [7, 8] shows that the standard offline binary/multi-label classification models are not suitable in an inter- active learning setting, where examples are provided one by one, and so here we use incremental Logistic Regression (Stochastic Gradient Descent (SGD) from the Weka library) instead of an offline model.

We will show an interactive demonstration of this prototype system at the conference, illustrating how questions are asked, parsed, and answers retrieved and generated. Work in progress addresses: (1) data-driven dialogue management with more complex dialogues; (2) an incremental model of reference resolution within this general frame-work, and; (3) experimentation on the rate of incremental concept learning given different parameter settings in DS-TTR -- allowing or disallowing specific interactional phenomena, such as corrections, clarification interaction and self-repair -- and comparison to other similar systems.

References

[1] R. Cann, R. Kempson, and L. Marten. The Dynamics of Language. Elsevier, Oxford, 2005.

[2] R. Cooper. Records and record types in semantic theory. Journal of Logic and Computation, 15(2):99–112, 2005.

[3] A. Eshghi, J. Hough, M. Purver, R. Kempson, and E. Gregoromichelaki. Conversational interactions: Capturing dialogue dynamics. In S. Larsson and L. Borin, editors, From Quantification to Conversation: Festschrift for Robin Cooper on the occasion of his 65th birthday,volume 19 of Tributes, pages 325–349. College Publications, London, 2012

[4] C. Kennington and D. Schlangen. Simple learning and compositional application of perceptually grounded word meanings for incremental reference resolution. In Proceedings of the Conference for the Association for Computational Linguistics (ACL-IJCNLP). Association for Computational Linguistics, 2015.

[5] D. Skocaj, M. Kristan, A. Vrecko, M. Mahnic, M. Jan ??cek, G. M. Kruijff, M. Hanheide, N. Hawes, T. Keller, M. Zillich, and K. Zhou. A system for interactive learning in dialogue with a tutor. In 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2011, San Francisco, CA, USA, September 2530, 2011, pages 3387–3394, 2011.

[6] A. Vedaldi and B. Fulkerson. Vlfeat: an open and portable library of computer vision algorithms. In Proceedings of the 18th International Conference on Multi-media 2010, Firenze, Italy, October 25-29, 2010, pages 1469–1472, 2010.

[7] Y. Yu, A. Eshghi, and O. Lemon. Comparing attribute classifiers for interactive language grounding. In Proceedings of ENMLP workshop on Vision and Language, 2015.

[8] Y. Yu, A. Eshghi, and O. Lemon. Interactive learning through dialogue for multimodal language grounding. In Proceedings of SEMDIAL, 2015.

Redundancy, not Ambiguity, Immediately Disrupts Comprehension During Reading

Kumiko Fukumura

University of Strathclyde, kumiko.fukumura [at] strath.ac.uk

While a common assumption is that referential ambiguity causes immediate comprehension difficulty, it remains controversial whether redundancy also disrupts comprehension immediately (Arts et al., 2011; Davies & Katsos, 2009; Engelhardt et al. 2011). Three eye-movement experiments examined violations of the Gricean maxim of quantity during reading, specifically asking whether both ambiguity and redundancy result in immediate comprehension difficulties (Altmann & Steedman, 1988).

In Experiment 1, English speakers read one of the context sentences in (1a-b), followed by one of the target sentences in (3a-b). The bare noun, such as “towel” in (3a), was ambiguous following (1b) (two-referent), but not following (1a) (one-referent). The size adjective, such as “small towel” in (3b), was redundant following (1a), but not following (1b). First-pass and regression-path times for size-contrasted descriptions were longer after a one-referent (1b) than a two-referent context (1a), and redundancy also resulted in longer regression-path times in the final region (“on the floor”). In contrast, first-pass and regression-path times for bare nouns (3a) did not slow down after (1b) relative to (1a). Regression-path times for the final region were longer when the bare noun followed a two- rather than one-referent context, indicating that ambiguity had an effect during later reading.

In Experiment 2, participants read sentence (3a) or (3b) after (2a) or (2b), where the referents were preceded by a numeral. Redundancy led to longer first-pass, regression-path and total times for size-contrasted descriptions (3a) after a one-referent than two-referent context. Regression-path and total times for bare nouns (3b) were longer after a two-referent than a one-referent context. But first-pass times for bare nouns were unaffected by ambiguity, even when the critical region was expanded to include the next word (the towel was).

Experiment 3 compared the effect of context on size and colour adjectives, as in (4). First-pass times for the size-modified noun (4b) were longer in the two-referent than the one-referent condition, whereas the context did not significantly affect the reading times of colour-modified descriptions (4a) in the critical region.

Thus, our results contrast with research that found that redundancy facilitates rather than hinders comprehension (Arts et al., 2011). Also, we observed that ambiguity does not immediately disrupt comprehension, unless the ambiguity is explicitly signalled by a numeral. An early effect of redundancy was found for size but not for colour adjectives, indicating that redundancy disrupts comprehension immediately when the meaning of the modifier falsely signals a contrasting competitor.

Exp 1 context

(1) a. There was a small towel and a large robe in the bathroom. (one referent)

b. There was a small towel and a large towel in the bathroom. (two referents)

Exp 2 context

(2) a. There was one towel in the bathroom. (one referent)

b. There were two towels in the bathroom. (two referents)

Exp 1 & Exp 2 target

(3) a. The towel/was soaking/on the floor.

b. The small towel/was soaking/on the floor.

Exp 3 target

(4) a. The white towel/was soaking/on the floor.

b. The small towel/was soaking/on the floor.

Predicting the Success of Referring Expressions of Objects in Real-World Images

Dimitra Gkatzia

Edinburgh Napier University, dgatzia [at] gmail.com

Referring Expressions (REs) have attracted major attention by the NLG community over the last 20 years (Krahmer and van Deemter, 2011; Gatt et al., 2014). Research has mainly focused on the generation of REs rather than on predicting whether a RE will be successful. In addition, relevant work has used computer generated scenes, rather than real-world scenes. Real-world scenes are more coherent than computer-generated scenes. For example, many similar objects such as buildings can be present in one image (Figure 1).

We are interested in investigating two closely related questions:

1. How does the scene complexity affect the generation and the resolution of REs?

2. Can we develop accurate models for success prediction of REs in cluttered real-world images?

To shed some light on these tasks, we performed experiments with the REAL corpus (Gkatzia et al., 2015). The REAL corpus contains a collection of images of real-world urban scenes together with verbal descriptions of target objects generated by humans, paired with data on how successful other people were in identifying the same object based on these descriptions. Example descriptions contain:

1. The Austrian looking white house with the dark wooden beams at the water side.

2. The white building with the x-shape balconies. It seems it’s new.

3. The white building with the balconies by the river.

4. The nearest house on right side. It’s black and white.

5. The white and black building on the far right, it has lots of triangles in its design.

6. The rightmost house with white walls and wood finishings.

We looked into the semantic and syntactic elements that contribute to the success of a RE in the REAL corpus and the findings inspired the creation of success prediction models of REs. A significant element of the models is that they account for the complexity of the image.

Following this line of research, we are interested in investigating more accurate and more general prediction models than the ones described above. In particular, we are interested in looking into how scene perception influences both the generation and the resolution of REs in real-world scenes. Experiments with eye trackers can reveal specific features of objects that attract people’s attention. For instance, using eye trackers we can recognise which are the dominant objects in the images and relate this information to the referring expressions used.

References

Albert Gatt, Emiel Krahmer, and Kees van Deemter. 2014. Models and empirical data for the production of referring expressions. Language, Cognition and Neuroscience, 29(8):899 – 911.

Dimitra Gkatzia, Verena Rieser, Phil Bartie, and William Mackaness. 2015. From the virtual to the real world: Referring to objects in spatial real-world scenes. In EMNLP.

Emiel Krahmer and Kees van Deemter. 2011. Computational generation of referring expressions: A survey. Computational Linguistics, 38(1):173–218, 2015/02/20.

Summarising Incomplete Data

Stephanie Inglis

University of Aberdeen, r01si14 [at] abdn.ac.uk

Natural Language Generation (NLG) systems rely on input data to produce its text. If this data is not accurate or complete, the NLG system will consequently be inaccurate or incomplete. As a result, users making conclusions from a system may misinterpret the information given, or the information itself may be wrong. Therefore, what is the best way for an NLG system to handle unreliable data?

After an initial exercise of using SimpleNLG to produce text about deaths by Ebola in Africa (data provided by a report from the World Health Organisation), issues of completeness, consistency and correctness arose. Incomplete data was the most dominant issue, and so this issue is my initial focus.

Possible ways of handling missing data are:

• To use only data known to be accurate

• To ignore the issues, interpolate missing data and treat this data as being correct

• To explicitly tell the user that there is an issue with the data, and what this issue is

In order to investigate the best way to do this, various experiments were conducted to see how people expressed gaps in presented data, or if these gaps were even mentioned at all. They were asked to summarise data on the number of road traffic fatalities for various countries for the years 2007, 2010 and 2013 as tweets. By constraining the subjects to 140 characters, they were forced to summarise concisely, and decide which aspects of the information was worth including, such as explicitly stating that data was missing. Pilot experiments were done to explore possible designs of the experiment, before deploying this on Mechanical Turk. At the time of writing, the Mechanical Turk experiment has only just been completed successfully by 100 subjects, with no time to analyse the results yet. However points of investigation include whether data quality is mentioned, whether more specifically subjects mention that data is missing, whether a specific year being missing makes a difference, and if subjects vary in their techniques to summarise the information. The outcome of how people report the data will contribute towards determining which of the previously mentioned ways to handle missing data is most suitable for an NLG system.

By identifying how best to handle these situations, future NLG systems could become more beneficial to users in aiding them to understand the stories being told by large quantities of data more accurately.

The effect of time pressure on referential overspecification

Ruud Koolen, Albert Gatt, Roger van Gompel, Emiel Krahmer and Kees van Deemter

Tilburg University, r.m.f.koolen [at] uvt.nl

Speakers often produce definite referring expressions that are overspecified. For example, they may refer to a chair as “the green chair” in a situation where only one chair is present. Many factors have been found to induce overspecification, for example that speakers rely on quick heuristics during the attribute selection process, and that they thus prefer to include perceptually salient attributes such as color. After all, salient attributes are easily and quickly perceived by the visual system.

We argue that the use of heuristics suggests the existence of a two-stage model of attribute selection. Firstly, there is the early stage, where perceptually salient attributes are selected irrespective of whether they rule out any of the distractor objects in the scene. Secondly, there is the later stage, which is basically an object-by-object scan of the scene to see if there are any distractors left. The early is stage is speaker-oriented, while the later stage is more listener-oriented (Arnold, 2008).

To find evidence for such a two-stage model, we performed a reference production experiment with a manipulation of Time Pressure. Half of the speakers took part in the self-paced condition, and took as much time as needed to inspect the scene and to describe the target. The other half of the speakers took part in the system-paced condition and performed their task under time pressure: although they could as well take as much time as needed to describe a target, the visual scenes disappeared automatically after 1000 milliseconds for each trial. Within participants, we manipulated the visual scenes in terms of the attributes that were required to identify the target (color; size; color or size).

We expected to find that speakers are more likely to overspecify their referring expressions in the system-paced condition than in the self-paced condition. This expectation was indeed borne out by the data: irrespective of the attributes that were required in a particular scene, redundant modifiers were more frequent in the system-paced condition (40.7%) than in the self-paced condition (31.3%), β = 0.72; SE = 0.57; p < .05. Thus, under pressure, speakers did not seem to have enough time to take the addressee perspective into account (Horton & Keysar, 1996), and mainly included attributes that simply pop out of the scene. We regard our results as evidence for a two-stage model of attribute selection, akin to recent computational models of attribute selection (e.g., Mitchell et al., 2013).

Mapping numerical information to words: towards a fully statistical approach

Xiao Li

University of Aberdeen, xiao.li [at] abdn.ac.uk

This work sketches the outlines of a new approach to the generation of textual summaries of numerical information (e.g., in weather forecasts). The generated describing texts will be presented as a probabilistically vector.

Researchers in Natural Language Generation (NLG) often try to convert numerical data (e.g., projected weather data) into text (e.g., a written weather forecast). For example, time phrases such as ‘midday’, ‘by afternoon’, etc. are often used to refer to times, so the system has to choose which of these words to use in a specific case (e.g., Reiter et al. 1997, 2005). These words are generally selected from a corpus of textual summaries written by domain experts. Some systems model the use of these words by crisp thresholds, (e.g. ‘midday’ could be used to denote any time from 10:00 – 14:00). Other systems select words on the basis of frequencies in a data-text corpus (which couples texts with the data they describe), but the candidate words themselves are still given by experts. This limits the generality of the approach and makes it difficult to scale it up (e.g., by applying it to all the words in a long text).

Ultimately, we would like to predict various features of written weather forecasts, given the data. Because the same data may be described by different texts, we want our predictions to be probabilistic. At the moment we ware focussing on the probability that a given word or phrase occurs in the text. We aim to construct an algorithm that avoids any expert-based rules. The input of the algorithm is numerical data in the data-text corpus (e.g., the temperature at various times on a given day); the output of the algorithm is a probability vector; each entry in the vector gives us, for a given word w (e.g., the word “hot”), the probability that w occurs in a summary in the corpus.

This algorithm is different from traditional approaches: instead of producing single words (or phrase) as the outcome of the lexicaliztation, this algorithm gives a vector of probabilities for each candidate word (or phrase) to indicate the likelihoods of whether a particular word (or phrase) should be used to describe the given data. In addition, this algorithm implements the many-to-many mapping from the extracted features of data to candidate words (or phrases); that is, to decide the usages of each word (or phrase), this algorithm consider every features rather than some specific part of them.

Making invisible trouble visible: the effect of self-repairs on referential coordination

Gregory Mills, Gisela Redeker

University of Groningen, g.j.mills [at] rug.nl

One of the central findings in dialogue research is that interlocutors rapidly converge on a shared set of contracted referring expressions (Krauss and Weinheimer, 1966; Clark, 1996) which become progressively systematized and abstract. This occurs for a wide range of referents, e.g. when referring to spatial locations (Garrod and Doherty, 1994; Galantucci, 2005), music (Healey, 2002), concepts (Schwartz, 1995), confidence (Fusaroli et al., 2012), and temporal sequences (Mills, 2011). Cumulatively, these findings suggest that interaction in dialogue places important constraints on the semantics of referring expressions. However, there is currently no consensus about how best to account for how coordination develops. The Interactive alignment model of Pickering and Garrod (2004) favours alignment processes, the Grounding model of Clark (1996) emphasizes the role of positive feedback, while (Healey, 2002) demonstrates the importance of encountering and dealing with miscommunication.

To investigate in closer detail the development of referential coordination, we report a variant of the “maze task” (Pickering and Garrod, 2004). Participants communicate with each other via an experimental chat tool (Healey and Mills, 2006), which selectively transforms participants' private turn-revisions into public self-repairs that are visible to the other participant. For example, if a participant, A types:

A: "Now go to the square on the left, next to the big block on top"

and then before sending, A revises the turn to:

A: "Now go to the square on the left, next to the third column"

The chat server automatically automatically detects the revised text and inserts a self-repair marker (e.g. "umm" or "uhhh" immediately preceding the revision). This would yield the following turn, sent to B:

A: "Now go to the square on the left next to the big block on top umm I meant next to the third column"

Participants who received these transformed turns used more abstract and systematized referring expressions, but performed worse at the task. We argue that this is due to (1) the artificial self-repairs having a beneficial effect of amplifying naturally occurring miscommunication (cf. Healey et al, 2013), yielding enhanced problem detection and recovery from error, promoting the systematization of referring expressions (2) On the other hand, once these coordination problems are resolved, the public self-repairs have an opposite, deleterious effect on coordination, as they decrease participants' confidence in the referring conventions established during the task.

Exploring the dynamic relationship between emotion and referential communication

David Moulds, Janet McLean and Vera Kempe

Abertay University, d.moulds [at] abertay.ac.uk

This study investigated how communicative interaction affects the mood of interlocutors. Previous research has demonstrated a link between emotion and language production showing that positive emotional valence was associated with increased ambiguity of referring expressions (Kempe, Rookes, & Swarbrigg, 2012), – a manifestation of the less deliberate, more heuristic processing style typically found for positive emotions. However, communicative exchanges may extend over considerable periods of time during which mood itself may, in turn, be altered by the process of communication. The present study explored this potential feedback link between dyadic communication and emotion. Dyads of participants were induced into negative mood before participating in a referential communication task. In this task, a Director had to describe a series of depicted objects to a Matcher, who had to select target objects from a set of distractors. The objects were designed so as to require participants to mention three physical dimensions in order to produce unambiguous descriptions. Dyads were randomly assigned to either a Broadcast condition, in which Matchers were prevented from giving feedback and making clarification requests, with roles switching after the entire set of objects had been described by the Director, a Partial Interaction condition, where Matchers did provide feedback, and a Full Interaction condition, where Matchers provided feedback and also roles were switched after each object. Mood was measured before and after the communication task using self-ratings. Mood change during the interaction was compared between groups using multilevel modeling, with subject and dyad as random factors. The results showed a change from negative to positive mood only in the Partial Interaction condition, which was associated with a higher frequency of communicative exchanges compared to the other two conditions. These findings suggest that mutual communicative engagement, as evidenced by verbal feedback, is inherently rewarding. That this mood improvement does not occur in the full interaction condition is unexpected. One possible explanation is that preparation of next-card descriptions interferes with this engagement. These issues are currently being explored in further research.

Combining the speaker’s and addressee’s views to explain patterns of reference production: A Probabilistic computational model

Mindaugas Mozuraitis, Daphna Heller and Suzanne Stevenson

Saarland University, mindauga [at] coli.uni-saarland.de

While speakers tailor Referring Expressions (REs) to the knowledge of their addressees, they do so imperfectly [1,2,3,4]. Our goal here is to provide an explanation for this type of pattern by extending a probabilistic model introduced to explain perspective-taking behavior in comprehension [5]. Using novel production data from a type of knowledge mismatch not previously investigated, we show that production patterns can also be explained as arising from the combination of the speaker’s and the addressee’s perspectives.

We used Visually-Misleading Objects (VMOs), like a crayon shaped as a Lego [cf.6]. VMOs were paired with a target that shared Appearance (A: a real Lego) or Function (B: a typical crayon). In a referential communication task, the participant-speaker directed a confederate-addressee to move the target. The knowledge mismatch concerned object function: in PRIVILEGED, only the speaker was presented with the function of the VMO (in Control, the VMO was replaced by an unrelated object).

Using a modifier rather than a bare noun (“the shiny Lego” vs. “the Lego”) indicates that the speaker conceptualizes the target as the same category as the VMO. If speakers were egocentric, modification rates in PRIVILEGED should be like in SHARED. If speakers adapt to the addressee, modification should be at ceiling in Appearance-PRIVILEGED and at floor (same as Control) in Function-PRIVILEGED, because the addressee did not know the real function of objects an, thus, should be assumed to conceptualize them by appearance. Indeed, the higher modification rate in Appearance-PRIVILEGED shows that speakers used appearance more for categorization, in accordance with the addressee’s false-belief, but Function-PRIVILEGED was significantly higher than Control, indicating that speakers did not simply adopt the addressee’s perspective (it was lower than Function-SHARED).

Consider:

P(RE?target) = P(RE?target,d=speaker)*P(d=speaker)+P(RE?target,d=addressee)*P(d=addressee)

where d is the referential domain

The most significant aspect of the model is that it probabilistically weighs the contribution of both the speaker’s perspective and the addressee’s perspective. Combining the contribution of the two allows us to predict the quantitative pattern of modification observed. Conceptually, using both persepctives reflects the fact that when the function of the VMOs is not presented to the addressee (PRIVILEGED), the speaker faced uncertainty about how to categorize the VMO. These results show the applicability of the multiple-perspectives approach to language production, and to different types of knowledge mismatch between conversational partners.

[1] Horton & Keysar (1996), Cognition

[2] Wardlow Lane & Ferreira (2008), JEP:LMC

[3] Heller, Gorman & Tanenhaus (2012), TopiCS

[4] Yoon, Koh & Brown-Schmidt (2012), PB&R

[5] Heller, Parisien, & Stevenson (in press), Cognition

[6] Mozuraitis, Chambers & Daneman (2015), Cognition

What do you know? ERP evidence for immediate use of common ground during online reference resolution

Les Sikos, Sam Tomlinson, Conor Heins and Dan Grodner

Saarland University, sikos [at] coli.uni-saarland.de

Recent evidence on the time-course of reference resolution is mixed. Some results suggest that addressees quickly use ground information to narrow the set of potential referents [1-2]. Other findings suggest that addressees initially ignore such information, essentially adopting an “egocentric” perspective [3-4]. A key result in support of the egocentric view comes from visual-world eye-tracking: addressees systematically look at privileged-ground competitors despite knowing that the speaker cannot see them and hence could not be referring to that object [5-6]. One concern with this conclusion is that people do not just fixate entities that are candidates for referential description, but also entities that are merely semantically related to a target (e.g., increased fixations to a hammer when hearing “nail”) [7]. Thus increased eye-movements to a privileged competitor might not reflect referential ambiguity.

We used event-related potentials (ERP) to assess the time-course of perspective use in reference resolution. Definite descriptions with multiple potential referents elicit a sustained negative shift of the ERP which begins 300-600 ms after word onset (Nref effect [8]). Participants (N=35) were instructed to select a referent from a cartoon display of four animals by a speaker who could only see three of the animals. For instance, participants heard “Click on the triceratops with the party hat” (target word underlined) while viewing a display that contained two triceratops, one wearing a party hat and the other glasses. The critical manipulation was whether the competitor (e.g., triceratops-with-glasses) was in common or privileged ground.

If participants are initially egocentric (i.e., consider the privileged object to be a candidate for reference), both the privileged-ground competitor (PGC) and common-ground competitor (CGC) conditions should induce referential ambiguity and elicit an Nref effect relative to a no-competitor (NoC) control. If instead participants immediately take the speaker’s perspective, only the CGC condition should elicit an Nref compared to the NoC and PGC conditions.

Results showed that CGC elicited an Nref effect relative to both NoC and PGC; PGC did not differ from NoC. This pattern of results suggests that there is no point at which addressees consider a privileged competitor on equal footing with a common ground target (contra [3]). Thus, these findings are consistent with theories of language processing that allow socio-pragmatic information to rapidly influence online comprehension.

1 Heller, Grodner & Tanenhaus, 2008

2 Hanna, Tanenhaus & Trueswell, 2003

3 Barr, 2008

4 Keysar et al., 2000

5 Barr, 2015

6 Brown-Schmidt, 2009

7 Yee & Sedivy, 2006

8 Van Berkum et al., 2007