Token boundaries
Broadly speaking, one token equates to one matrix sentence (IP-MAT
), including an embedded clause if present.
However, note the following:
- When two independent finite clauses are conjoined, the two clauses are treated as separate tokens:
(IP-MAT (NP-OB1 (DPDS Dat)) ← first independent clause as a token (VVFIN beleueden) (NP-SBJ (PPER se) (DIN alle)) (PP (APPR myt) (NP (NA willen))) ) (IP-MAT (KON und) ← second independent clause as a token (NP-SBJ *con*) (VVFIN scheiden) (PP (APPR van) (NP (PPER em))) )
- Direct speech which constitutes an
IP-MAT
can sit within a higherIP-MAT
introducing the speech. The direct speech matrix clause gets the extended tag-SPE
:
(IP-MAT (ADVP (ADV DO)) (VVFIN sprak) (NP-SBJ (PPER he)) (IP-MAT-SPE (NP-SBJ (PPER ik)) ← IP which is direct speech (PTKNEG ne) (VVFIN bin)) )
- In contrast to the treatment of direct speech (see above), cases where a citation which is an
IP-MAT
is introduced by e.g. `X writes’ are treated as separate tokens:
(IP-MAT (NP-SBJ (NE Salustius)) ← first token (VVFIN scrift) ) (IP-MAT (NP-SBJ (DDARTA de) ← second token (FM Troyani)) (VAFIN hebben) (NP-OB1 (NE rome)) (VVPP ghebuwet) )
- Chapter and section headings are treated as standalone tokens and are tagged
IP-MAT
is they constitute and independent finite clause or otherwiseFRAG
:
(FRAG (PP (APPR Van) (NP (DDARTA dem) (NA Borchgherichte))) )
- Places and dates given for the time of writing are also treated as standalone tokens and are tagged
FRAG
:
(FRAG (FM proximo) (FM libro) (FM de) (FM ciuitate) (FM dei) )
(FRAG (XY 2.000dcclxxx) (FM Abbon) )
Token IDs
Each token has a unique ID in the form of ID TEXT.DIALECT.GENRE.NUMBER
, for example:
ID ARZNEI.WP.SCI.1
,ID ARZNEI.WP.SCI.2
,ID ARZNEI.WP.SCI.3
etc…ID ENGELHUS.EP.HIS.1
,ID ENGELHUS.EP.HIS.2
,ID ENGELHUS.EP.HIS.3
etc…ID STRALSUND.EE.CHART.1
,ID STRALSUND.EE.CHART.2
,ID STRALSUND.EE.CHART.3
etc…