Skip to content

Discontinuous mentions in the coref model #1468

Open
@501Good

Description

@501Good

Hello!

Is your feature request related to a problem? Please describe.
Currently, a closing index of a discontinuous mention is not captured by the regex in the convert_udcoref.py script.

For example, a conllu file like this (taken from CorefUD_Norwegian-BokmaalNARC):

# newdoc id = ap~20081210-1411542
# global.Entity = eid-etype-head-other
# newpar
# sent_id = 016148
# text = Jeg synes de er ganske nye, alle sammen, tidligst fra 1500-tallet.
1	Jeg	jeg	PRON	_	Animacy=Hum|Case=Nom|Number=Sing|Person=1|PronType=Prs	2	nsubj	_	Entity=(e44528--1)
2	synes	synes	VERB	_	Mood=Ind|Tense=Pres|VerbForm=Fin	0	root	_	_
3	de	de	PRON	_	Case=Nom|Number=Plur|Person=3|PronType=Prs	6	nsubj	_	Entity=(e44523[1/2]--1)
4	er	være	AUX	_	Mood=Ind|Tense=Pres|VerbForm=Fin	6	cop	_	_
5	ganske	ganske	ADV	_	_	6	advmod	_	_
6	nye	ny	ADJ	_	Degree=Pos|Number=Plur	2	ccomp	_	SpaceAfter=No
7	,	$,	PUNCT	_	_	8	punct	_	_
8	alle	all	DET	_	Number=Plur|PronType=Tot	3	det	_	Entity=(e44523[2/2]--1
9	sammen	sammen	ADV	_	_	8	advmod	_	Entity=e44523[2/2])|SpaceAfter=No
10	,	$,	PUNCT	_	_	8	punct	_	_
11	tidligst	tidlig	ADJ	_	Definite=Ind|Degree=Sup	13	advmod	_	_
12	fra	fra	ADP	_	_	13	case	_	_
13	1500-tallet	1500-tall	NOUN	_	Definite=Def|Gender=Neut|Number=Sing	6	conj	_	Entity=(e44532--1)|SpaceAfter=No
14	.	$.	PUNCT	_	_	2	punct	_	_

will be converted into the following json:

[
  {
    "document_id": "ap~20081210-1411542",
    "cased_words": [
      "jeg",
      "synes",
      "de",
      "er",
      "ganske",
      "nye",
      ",",
      "alle",
      "sammen",
      ",",
      "tidligst",
      "fra",
      "1500-tallet",
      "."
    ],
    "sent_id": [
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0
    ],
    "part_id": 0,
    "deprel": [
      "nsubj",
      "root",
      "nsubj",
      "cop",
      "advmod",
      "ccomp",
      "punct",
      "det",
      "advmod",
      "punct",
      "advmod",
      "case",
      "conj",
      "punct"
    ],
    "head": [
      1,
      "null",
      5,
      5,
      5,
      1,
      7,
      2,
      7,
      7,
      12,
      12,
      5,
      1
    ],
    "span_clusters": [
      [
        [
          0,
          1
        ]
      ],
      [
        [
          2,
          3
        ]
      ],
      [
        [
          12,
          13
        ]
      ]
    ],
    "word_clusters": [
      [
        0
      ],
      [
        2
      ],
      [
        12
      ]
    ],
    "head2span": [
      [
        0,
        0,
        1
      ],
      [
        2,
        2,
        3
      ],
      [
        12,
        12,
        13
      ]
    ],
    "lang": "no"
  }
]

As you can see, the second part of the mention e44523 is completely missing from the converted json.

Describe the solution you'd like
It is not clear what is the best solution here, since to my knowledge the coref model does not support discontinuous mention spans.

One potential workaround would be to treat discontinuous parts of the same mention as separate mentions under a single entity.

For example, if we change this part of the conversion script:

head2span = []
word_total = 0
SPANS = re.compile(r"(\(\w+|[%\w]+\))")
for parsed_sentence in doc.sentences:
# spans regex
# parse the misc column, leaving on "Entity" entries
misc = [[k.split("=")
for k in j
if k.split("=")[0] == "Entity"]
for i in parsed_sentence.words
for j in [i.misc.split("|") if i.misc else []]]
# and extract the Entity entry values
entities = [i[0][1] if len(i) > 0 else None for i in misc]
# extract reference information
refs = [SPANS.findall(i) if i else [] for i in entities]
# and calculate spans: the basic rule is (e... begins a reference
# and ) without e before ends the most recent reference
# every single time we get a closing element, we pop it off
# the refdict and insert the pair to final_refs
refdict = defaultdict(list)
final_refs = defaultdict(list)
last_ref = None
for indx, i in enumerate(refs):
for j in i:
# this is the beginning of a reference
if j[0] == "(":
refdict[j[1+UDCOREF_ADDN:]].append(indx)
last_ref = j[1+UDCOREF_ADDN:]
# at the end of a reference, if we got exxxxx, that ends
# a particular refereenc; otherwise, it ends the last reference
elif j[-1] == ")" and j[UDCOREF_ADDN:-1].isnumeric():
if (not UDCOREF_ADDN) or j[0] == "e":
try:
final_refs[j[UDCOREF_ADDN:-1]].append((refdict[j[UDCOREF_ADDN:-1]].pop(-1), indx))
except IndexError:
# this is probably zero anaphora
continue
elif j[-1] == ")":
final_refs[last_ref].append((refdict[last_ref].pop(-1), indx))
last_ref = None
final_refs = dict(final_refs)

like this (changed the regex on line 71 and added a condition on line 94):

        head2span = []
        word_total = 0
        SPANS = re.compile(r"(\(\w+|[%\w]+(?:\[[\d/]+\])?\))")
        for parsed_sentence in doc.sentences:
            # spans regex
            # parse the misc column, leaving on "Entity" entries
            misc = [[k.split("=")
                    for k in j
                    if k.split("=")[0] == "Entity"]
                    for i in parsed_sentence.words
                    for j in [i.misc.split("|") if i.misc else []]]
            # and extract the Entity entry values
            entities = [i[0][1] if len(i) > 0 else None for i in misc]
            # extract reference information
            refs = [SPANS.findall(i) if i else [] for i in entities]
            # and calculate spans: the basic rule is (e... begins a reference
            # and ) without e before ends the most recent reference
            # every single time we get a closing element, we pop it off
            # the refdict and insert the pair to final_refs
            refdict = defaultdict(list)
            final_refs = defaultdict(list)
            last_ref = None
            for indx, i in enumerate(refs):
                for j in i:
                    # remove the discontinuous part from a closing index, e.g. "e1[1/2])" -> "e1)"
                    if j[-1] == ")" and j[-2] == "]":
                        j = re.sub(r"\[[\d/]+\]", "", j)
                    # this is the beginning of a reference
                    if j[0] == "(":
                        refdict[j[1+UDCOREF_ADDN:]].append(indx)
                        last_ref = j[1+UDCOREF_ADDN:]
                    # at the end of a reference, if we got exxxxx, that ends
                    # a particular refereenc; otherwise, it ends the last reference
                    elif j[-1] == ")" and j[UDCOREF_ADDN:-1].isnumeric():
                        if (not UDCOREF_ADDN) or j[0] == "e":
                            try:
                                final_refs[j[UDCOREF_ADDN:-1]].append((refdict[j[UDCOREF_ADDN:-1]].pop(-1), indx))
                            except IndexError:
                                # this is probably zero anaphora
                                continue
                    elif j[-1] == ")":
                        final_refs[last_ref].append((refdict[last_ref].pop(-1), indx))
                        last_ref = None
            final_refs = dict(final_refs)

then the discontinuous mention is included in the converted json:

[
  {
    "document_id": "ap~20081210-1411542",
    "cased_words": [
      "jeg",
      "synes",
      "de",
      "er",
      "ganske",
      "nye",
      ",",
      "alle",
      "sammen",
      ",",
      "tidligst",
      "fra",
      "1500-tallet",
      "."
    ],
    "sent_id": [
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0
    ],
    "part_id": 0,
    "deprel": [
      "nsubj",
      "root",
      "nsubj",
      "cop",
      "advmod",
      "ccomp",
      "punct",
      "det",
      "advmod",
      "punct",
      "advmod",
      "case",
      "conj",
      "punct"
    ],
    "head": [
      1,
      "null",
      5,
      5,
      5,
      1,
      7,
      2,
      7,
      7,
      12,
      12,
      5,
      1
    ],
    "span_clusters": [
      [
        [
          0,
          1
        ]
      ],
      [
        [
          2,
          3
        ],
        [
          7,
          9
        ]
      ],
      [
        [
          12,
          13
        ]
      ]
    ],
    "word_clusters": [
      [
        0
      ],
      [
        2,
        7
      ],
      [
        12
      ]
    ],
    "head2span": [
      [
        0,
        0,
        1
      ],
      [
        2,
        2,
        3
      ],
      [
        7,
        7,
        9
      ],
      [
        12,
        12,
        13
      ]
    ],
    "lang": "no"
  }
]

Describe alternatives you've considered
Another alternative would be to completely ignore all the discontinuous mentions since they are not very frequent.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions