Skip to content

Verbose logging on messages signalling data loss in producer #1553

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 10, 2018

Conversation

sibiryakov
Copy link
Contributor

@sibiryakov sibiryakov commented Jul 19, 2018

Recently, I've spend some human days trying to find where in our multi step pipeline we were having a data loss. It turned out one of our producers were generating batches too often resulting to their expiration. But there were no signs of any problems in logs. This PR attempts to fix this issue.


This change is Reviewable

@sibiryakov sibiryakov changed the title Verbose logging on messages signalling data loss Verbose logging on messages signalling data loss in producer Jul 19, 2018
@dpkp
Copy link
Owner

dpkp commented Aug 31, 2018

You can also achieve this in application code by adding error logging to your produce messages via an callback / errback. Would that approach work for you instead?

@sibiryakov
Copy link
Contributor Author

sibiryakov commented Sep 20, 2018

Yes, I can @dpkp but this is awkward and non-obvious. Look from the user prospective: When user is starting to use the library it is assuming that Producer is reliable, and default settings are sufficient. Then he pushes too much small messages and it could take him days to debug where the problem is, depending on the infrastructure complexity. From my experience when messages are silently dropped in Kafka-based sequence of workers it is extremely hard to find where the problem is, and implies checking the whole sequence.

So my opinion that even there is a way to be notified of the problem, it will be rarely used and therefore result to a bad user experience.

@jeffwidman
Copy link
Contributor

jeffwidman commented Oct 31, 2018

I used to be against this PR, but after thinking about it more (and trying to support many dev teams that aren't familiar with the intricacies of this library), I do agree that the default user experience is better with this.

New users will look for an errback/callback mechanism to let them handle retries or dumping the message into another database for further analysis. But for checking if an application is healthy, the first thing I see most devs do is check the logs and if no errors then they assume everything is fine...

@sibiryakov
Copy link
Contributor Author

sibiryakov commented Nov 2, 2018

But for checking if an application is healthy, the first thing I see most devs do is check the logs and if no errors then they assume everything is fine...

And this is natural. I consider design poor if it requires code to be rewritten to diagnose that there is a data drop happens.

@dpkp dpkp merged commit cd47701 into dpkp:master Nov 10, 2018
@dpkp
Copy link
Owner

dpkp commented Nov 10, 2018

+1 -- thanks!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants