Batch processing differs in tagging output from single document processing

Hello!

We have been using Stanza 1.10.1 with single document processing but want to switch to batch processing to increase speed. For that, we ran some benchmarks, among other things comparing the results of single processing and batch processing.

From what we can see the results of processing a text individually is stable - we always get the same tagging result. If we process texts via batch processing, we get different results compared to processing the text individually.

We have tried
- different batch sizes (25. 50, 100, 500)
- different types of data (user generated content, Wikipedia text)
- different models (DE, FR, KO)
- checked for multiple line breaks in our data to avoid issues with the concatenation of documents in the tokeniser

We're seeing differences in sentence splitting and tokenisation (especially end-of-sentence punctuation) which then leads to changes further down the pipeline (lemmatisation, dependency relationships).

Key code snippets

```
   self.nlp = stanza.Pipeline(lang = 'de',
                                   processors = 'tokenize,mwt,pos,lemma,depparse',
                                   package = 'gsd')

    def get_docs(self, texts):
        documents = [Document([], text=doc_content) for doc_content in texts]
        return self.nlp(documents)
		
   def get_documents(model, data, lang):
	 for batch in data:
	      keys, text = map(list, zip(*batch))
	      result_documents = model.get_docs(text)
```
			
The comparison is done on the JSON representation of the Stanza document.			
			
Attached is the test data we used. We find 13 documents with differences in 1000 tagged documents overall. Is there any way to mitigate this issue? 

[wiki_de.csv](https://github.com/user-attachments/files/19572088/wiki_de.csv)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Batch processing differs in tagging output from single document processing #1472

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Batch processing differs in tagging output from single document processing #1472

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions