Description
Hello!
We have been using Stanza 1.10.1 with single document processing but want to switch to batch processing to increase speed. For that, we ran some benchmarks, among other things comparing the results of single processing and batch processing.
From what we can see the results of processing a text individually is stable - we always get the same tagging result. If we process texts via batch processing, we get different results compared to processing the text individually.
We have tried
- different batch sizes (25. 50, 100, 500)
- different types of data (user generated content, Wikipedia text)
- different models (DE, FR, KO)
- checked for multiple line breaks in our data to avoid issues with the concatenation of documents in the tokeniser
We're seeing differences in sentence splitting and tokenisation (especially end-of-sentence punctuation) which then leads to changes further down the pipeline (lemmatisation, dependency relationships).
Key code snippets
self.nlp = stanza.Pipeline(lang = 'de',
processors = 'tokenize,mwt,pos,lemma,depparse',
package = 'gsd')
def get_docs(self, texts):
documents = [Document([], text=doc_content) for doc_content in texts]
return self.nlp(documents)
def get_documents(model, data, lang):
for batch in data:
keys, text = map(list, zip(*batch))
result_documents = model.get_docs(text)
The comparison is done on the JSON representation of the Stanza document.
Attached is the test data we used. We find 13 documents with differences in 1000 tagged documents overall. Is there any way to mitigate this issue?