Description
Hello!
Is your feature request related to a problem? Please describe.
Currently, a closing index of a discontinuous mention is not captured by the regex in the convert_udcoref.py script.
For example, a conllu file like this (taken from CorefUD_Norwegian-BokmaalNARC):
# newdoc id = ap~20081210-1411542
# global.Entity = eid-etype-head-other
# newpar
# sent_id = 016148
# text = Jeg synes de er ganske nye, alle sammen, tidligst fra 1500-tallet.
1 Jeg jeg PRON _ Animacy=Hum|Case=Nom|Number=Sing|Person=1|PronType=Prs 2 nsubj _ Entity=(e44528--1)
2 synes synes VERB _ Mood=Ind|Tense=Pres|VerbForm=Fin 0 root _ _
3 de de PRON _ Case=Nom|Number=Plur|Person=3|PronType=Prs 6 nsubj _ Entity=(e44523[1/2]--1)
4 er være AUX _ Mood=Ind|Tense=Pres|VerbForm=Fin 6 cop _ _
5 ganske ganske ADV _ _ 6 advmod _ _
6 nye ny ADJ _ Degree=Pos|Number=Plur 2 ccomp _ SpaceAfter=No
7 , $, PUNCT _ _ 8 punct _ _
8 alle all DET _ Number=Plur|PronType=Tot 3 det _ Entity=(e44523[2/2]--1
9 sammen sammen ADV _ _ 8 advmod _ Entity=e44523[2/2])|SpaceAfter=No
10 , $, PUNCT _ _ 8 punct _ _
11 tidligst tidlig ADJ _ Definite=Ind|Degree=Sup 13 advmod _ _
12 fra fra ADP _ _ 13 case _ _
13 1500-tallet 1500-tall NOUN _ Definite=Def|Gender=Neut|Number=Sing 6 conj _ Entity=(e44532--1)|SpaceAfter=No
14 . $. PUNCT _ _ 2 punct _ _
will be converted into the following json:
[
{
"document_id": "ap~20081210-1411542",
"cased_words": [
"jeg",
"synes",
"de",
"er",
"ganske",
"nye",
",",
"alle",
"sammen",
",",
"tidligst",
"fra",
"1500-tallet",
"."
],
"sent_id": [
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0
],
"part_id": 0,
"deprel": [
"nsubj",
"root",
"nsubj",
"cop",
"advmod",
"ccomp",
"punct",
"det",
"advmod",
"punct",
"advmod",
"case",
"conj",
"punct"
],
"head": [
1,
"null",
5,
5,
5,
1,
7,
2,
7,
7,
12,
12,
5,
1
],
"span_clusters": [
[
[
0,
1
]
],
[
[
2,
3
]
],
[
[
12,
13
]
]
],
"word_clusters": [
[
0
],
[
2
],
[
12
]
],
"head2span": [
[
0,
0,
1
],
[
2,
2,
3
],
[
12,
12,
13
]
],
"lang": "no"
}
]
As you can see, the second part of the mention e44523
is completely missing from the converted json.
Describe the solution you'd like
It is not clear what is the best solution here, since to my knowledge the coref model does not support discontinuous mention spans.
One potential workaround would be to treat discontinuous parts of the same mention as separate mentions under a single entity.
For example, if we change this part of the conversion script:
stanza/stanza/utils/datasets/coref/convert_udcoref.py
Lines 69 to 109 in af3d42b
like this (changed the regex on line 71 and added a condition on line 94):
head2span = []
word_total = 0
SPANS = re.compile(r"(\(\w+|[%\w]+(?:\[[\d/]+\])?\))")
for parsed_sentence in doc.sentences:
# spans regex
# parse the misc column, leaving on "Entity" entries
misc = [[k.split("=")
for k in j
if k.split("=")[0] == "Entity"]
for i in parsed_sentence.words
for j in [i.misc.split("|") if i.misc else []]]
# and extract the Entity entry values
entities = [i[0][1] if len(i) > 0 else None for i in misc]
# extract reference information
refs = [SPANS.findall(i) if i else [] for i in entities]
# and calculate spans: the basic rule is (e... begins a reference
# and ) without e before ends the most recent reference
# every single time we get a closing element, we pop it off
# the refdict and insert the pair to final_refs
refdict = defaultdict(list)
final_refs = defaultdict(list)
last_ref = None
for indx, i in enumerate(refs):
for j in i:
# remove the discontinuous part from a closing index, e.g. "e1[1/2])" -> "e1)"
if j[-1] == ")" and j[-2] == "]":
j = re.sub(r"\[[\d/]+\]", "", j)
# this is the beginning of a reference
if j[0] == "(":
refdict[j[1+UDCOREF_ADDN:]].append(indx)
last_ref = j[1+UDCOREF_ADDN:]
# at the end of a reference, if we got exxxxx, that ends
# a particular refereenc; otherwise, it ends the last reference
elif j[-1] == ")" and j[UDCOREF_ADDN:-1].isnumeric():
if (not UDCOREF_ADDN) or j[0] == "e":
try:
final_refs[j[UDCOREF_ADDN:-1]].append((refdict[j[UDCOREF_ADDN:-1]].pop(-1), indx))
except IndexError:
# this is probably zero anaphora
continue
elif j[-1] == ")":
final_refs[last_ref].append((refdict[last_ref].pop(-1), indx))
last_ref = None
final_refs = dict(final_refs)
then the discontinuous mention is included in the converted json:
[
{
"document_id": "ap~20081210-1411542",
"cased_words": [
"jeg",
"synes",
"de",
"er",
"ganske",
"nye",
",",
"alle",
"sammen",
",",
"tidligst",
"fra",
"1500-tallet",
"."
],
"sent_id": [
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0
],
"part_id": 0,
"deprel": [
"nsubj",
"root",
"nsubj",
"cop",
"advmod",
"ccomp",
"punct",
"det",
"advmod",
"punct",
"advmod",
"case",
"conj",
"punct"
],
"head": [
1,
"null",
5,
5,
5,
1,
7,
2,
7,
7,
12,
12,
5,
1
],
"span_clusters": [
[
[
0,
1
]
],
[
[
2,
3
],
[
7,
9
]
],
[
[
12,
13
]
]
],
"word_clusters": [
[
0
],
[
2,
7
],
[
12
]
],
"head2span": [
[
0,
0,
1
],
[
2,
2,
3
],
[
7,
7,
9
],
[
12,
12,
13
]
],
"lang": "no"
}
]
Describe alternatives you've considered
Another alternative would be to completely ignore all the discontinuous mentions since they are not very frequent.