Skip to content

Batch processing differs in tagging output from single document processing #1472

Open
@DZNLP

Description

@DZNLP

Hello!

We have been using Stanza 1.10.1 with single document processing but want to switch to batch processing to increase speed. For that, we ran some benchmarks, among other things comparing the results of single processing and batch processing.

From what we can see the results of processing a text individually is stable - we always get the same tagging result. If we process texts via batch processing, we get different results compared to processing the text individually.

We have tried

  • different batch sizes (25. 50, 100, 500)
  • different types of data (user generated content, Wikipedia text)
  • different models (DE, FR, KO)
  • checked for multiple line breaks in our data to avoid issues with the concatenation of documents in the tokeniser

We're seeing differences in sentence splitting and tokenisation (especially end-of-sentence punctuation) which then leads to changes further down the pipeline (lemmatisation, dependency relationships).

Key code snippets

   self.nlp = stanza.Pipeline(lang = 'de',
                                   processors = 'tokenize,mwt,pos,lemma,depparse',
                                   package = 'gsd')

    def get_docs(self, texts):
        documents = [Document([], text=doc_content) for doc_content in texts]
        return self.nlp(documents)
		
   def get_documents(model, data, lang):
	 for batch in data:
	      keys, text = map(list, zip(*batch))
	      result_documents = model.get_docs(text)

The comparison is done on the JSON representation of the Stanza document.

Attached is the test data we used. We find 13 documents with differences in 1000 tagged documents overall. Is there any way to mitigate this issue?

wiki_de.csv

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions