Pragmatic annotation

A (sub)version of the CHLG with annotation of roughly half the corpus for information status (given/mediated/new) is currently available for searching as CHLG-1.0-PRAG.

The annotation broadly follows the guidelines for information status annotation in Götze et al. (2007: §3) and are outlined below with specific reference to their implementation in CHLG.

 

Included texts

The specific texts annotated for information status and included in CHLG-1.0-PRAG are:

  • arznei
  • braunschweig
  • bremen
  • buxtehude
  • duder1
  • engelhus
  • greifswald
  • griseldis
  • rostock
  • ruethen
  • schwerin
  • willeken

 

Markables

Broadly speaking, the targets of the information status annotation are overt referential noun phrases which are arguments, or relate to arguments in some way.

 

As such, the following types of noun phrase are expected to bear an information status label:

  • NP-SBJ
  • NP-OB1
  • NP-OB2
  • NP-PRD
  • NP-SMC
  • NP-POS
  • NP-COM
  • NP-MSR
  • NP-PRN
  • NP-LFD

 

Excluded from the information status annotation are the following:

  • NP-SBJ *con*
  • NP-SBJ *pro*
  • NP-SBJ *exp*
  • NP-TMP
  • NP-ADT*
  • NP (e.g. noun phrases in sentence fragments, noun phrases as complements of prepositions)

The three-way distinction

The annotation involves a three-way distinction between GIV(en), MED(iated) and NEW.

  • GIV(en): explicitly mentioned in the previous discourse
  • MED(iated): not explicitly mentioned in the previous discourse, but can be inferred (mediated) via some kind of relation to a referent in the previous discourse, situational context, and/or assumed general world knowledge of hearer
  • NEW: not explicitly mentioned in the previous discourse, nor can be inferred (mediated) via some kind of relation to a referent in the previous discourse, situational context, and/or assumed general world knowledge of hearer

 

(Note that MED(iated) replaces Götze et al.’s label ACC(essible), to avoid confusion with accusative case.)

 

GIV(en)

The label GIV(en) applies to noun phrases whose referents are explicitly mentioned in the previous discourse. That is, there must be at least one expression earlier in the text that refers to the referent.

 

A further subdivision is made within GIV(en), capturing whether the discourse referent is given and active (GIVA) or given but inactive (GIVI), as follows:

  • GIVA: previous mention is within the immediately preceding or current IP-MAT token
  • GIVI: previous mention is earlier than immediately preceding IP-MAT token

 

Note that, for the purpose of looking for the previous mention, only overt referents qualify. In other words, the previous mention cannot be an unexpressed referential subject (NP-SBJ *con* or NP-SBJ *pro*).

 

MED(iated)

The  MED(iated) label is given to noun phrases whose referents are not explicitly mentioned in the previous discourse, but can be inferred (mediated) via some kind of relation to a referent in the previous discourse, situational context, and/or assumed general world knowledge of hearer.

 

A further subdivision is made with MED(iated), capturing the way in which the referent can be inferred, as follows (see also Nissim et al. 2004 for useful discussion):

  • MEDR (mediated – relational):
    • referent is in a part-whole relation with a referent already mentioned, e.g. “the herb” > “its root”; “the man” > “his arm”
    • referent is in a subset/superset relation to a referent already mentioned, e.g. “the flowers” > “the red flowers”
    • referent constitutes an attribute of a referent already mentioned, e.g. “the spice” > “its smell”; the King” > “his power”
  • MEDS (mediated – situational):
    • reference to speaker/addressee (mediated via the situational context), e.g. “I”, “you” – always treated as MED(iated) even when mentioned multiple times in a text
    • textual self-references (mediated via the situational context), e.g. “this letter”, “this book” – always treated as MED(iated) even when mentioned multiple times in a text
  • MEDG (mediated – general):
    • referent is a set/type of object (mediated via world knowledge), e.g. “dogs”, “children”
    • referent is a unique object (mediated via world knowledge), e.g. “the sun”, “the moon”
    • impersonals, e.g. “one”, “whoever”, “someone” – always treated as MED(iated) even when mentioned multiple times in a text

 

Note that, with the exceptions directly above, MED(iated) is only applied to the first mention of these types of referent (as an alternative to NEW). Subsequent mentions of a discourse referent which qualified as MED(iated) at first mention are labelled as given (GIVA/GIVI), just like any other referent which occurs more than once in a text.

NEW

The label NEW is given to noun phrases whose referents are not explicitly mentioned in the previous discourse, nor can be inferred (mediated) via some kind of relation to a referent in the previous discourse, situational context, and/or assumed general world knowledge of hearer.

Mixed cases

Where a noun phrase involve coordination of two nominals with more than one discourse referent, and where the information status of these discourse referents differs, a “mixed” label is applied, as follows:

  • GAIX: at least one discourse referent is given and active, and at least one discourse referent is given but inactive
  • GMX: at least one discourse referent is given (active or inactive), and at least one discourse referent is mediated
  • GNX: at least one discourse referent is given (active or inactive), and at least one discourse referent is new
  • MNX: at least one discourse referent is mediated, and at least one discourse referent is new

 

Annotation format

Since information status is annotated at NP-level, the annotations are added as extended tags after the function tags on NPs.

So a subject noun phrase whose referent is given (active) will be annotated as NP-SBJ-GIVA; a (primary) object noun phrase whose referent is mediated will be annotated as NP-OB1-MED, and so on.

 

Note that, due to the fact that certain types of noun phrases which typically occur within larger noun phrases are annotated for information status (NP-POSNP-COM, NP-MSR, NP-PRN), the information status annotation a (clause-level) noun phrase can bear one information status label, while a noun phrase within it bears a different information status label:

(NP-OB1-MED (NP-POS-GIVA (DDARTA des)
                         (NA bibotes))
            (NA wortelen)               
)

 

Expletive NPs

Expletive NPs are not annotated for information status. In such cases, rather than the NP getting one of the pragmatic extended tags outlined above, it gets a special expletive tag -EXPL after the function tag:

(IP-MAT (NP-SBJ-EXPL (PPER Et))
        (VVFIN was)
        (PP (APPR by)
            (NP (DDARTA der)
                (ADJA teynden)
                (NA stunde)))              
)
(IP-MAT (ADVP-TMP (AVD Do))
        (VVFIN bekande)
        (NP-OB1-EXPL-1 (PPER et))
        (NP-SBJ-GIVI (DDARTA de)
                     (NA uader))
        (CP-THT-1 (KOUS dat)
                   ...)              
)
(IP-SUB (VVFIN is)
        (NP-SBJ-EXPL-1 (PPER s))
        (CP-THT-1 (KOUS dat))              
)

 

Note that, in the case of existential/presentational constructions, the associate NP of the expletive does get an information-status tag just like any prototypical NP:

(IP-SUB (VVFIN is)
        (NP-SBJ-EXPL-1 (PPER s))
        (NP-NEW-1 (DIARTA eyn)
                  (ADJA rot)
                  (NA steyn))              
)

 

Note that the use of an EXPL tag is particular to CHLG-1.0-PRAG and is not present in the earlier versions of the corpus (CHLG-0.9 and CHLG-1.0) where expletive constructions are treated as per here.

 

References

Götze, Michael, Thomas Weskott, Cornelia Endriss, Ines Fiedler, Stefan Hinterwimmer, Svetlana Petrova, Anne Schwarz, Stavros Skopeteas & Ruben Stoel. 2007. Information structure. In Steffi Dipper, Michael Götze & Stavros Skopeteas (eds.), Information structure in cross-linguistic corpora: Annotation guidelines for phonology, morphology, syntax, semantics, and information structure, 147–187. Potsdam: Universitätsverlag Potsdam.

Malvina Nissim, Shipra Dingare, Jean Carletta, and Mark Steedman. 2004. An Annotation Scheme for Information Status in Dialogue. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04), Lisbon, Portugal. European Language Resources Association (ELRA).