[SERVER-29854] Secondary members no more syncing until we restart it (oplog blocked) Created: 26/Jun/17 Updated: 07/Dec/17 Resolved: 07/Nov/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 3.4.3, 3.4.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Slawomir Lukiewski | Assignee: | Mark Agarunov |
| Resolution: | Done | Votes: | 2 |
| Labels: | MAREF, SWNA | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Operating System: | ALL | ||||||||
| Participants: | |||||||||
| Description |
|
Hello, Since we upgraded our cluster from 3.2.8 to 3.4.3, regularly secondary members of our most loaded shard stopping syncing to primary. When that happens, the field "oplog last event time" returned by command rs.printReplicationInfo() stays indefinitely blocked. The secondary member just fall behind and we need to restart it in order to free the oplog, allowing to catch up the primary member. When it happens, just before the service shutdown, in logs we can see this error :
The "took 39149258ms, timeout was set to 65000ms" is strange, seems that the oplog query was longer that the timeout. The shard on witch it happens receives huge amount of read and writes so it's not shocking for us to see a oplog query to fail. But the secondary should be able to retry it. And sometimes when the oplog is blocked and we restarting the concerned member, it produces theses messages in logs :
In that case only a "kill -9" let the member to shutdown. Our Mongo cluster details :
Thank you in advance for your help. Best regards, |
| Comments |
| Comment by Mark Agarunov [ 07/Nov/17 ] |
|
Hello slluk-sa, Thank you for the additional information. After investigating the provided data, this appears to be due to a network issue, where slow communication between mongod nodes will keep locks from yielding and cause the behavior you're seeing. We've opened Thanks, |
| Comment by Slawomir Lukiewski [ 27/Jun/17 ] |
|
Hi Thomas, Thank you for your answer.
You should look at events of June 23 at night, on member with name ending with 24. That day we got the problem twice (the second time with failed mongod shutdown).
Best regards, |
| Comment by Kelsey Schubert [ 26/Jun/17 ] |
|
Hi slluk-sa, Thanks for reporting this issue. We'll need some additional information to better understand what is going on here. To help us investigate, would you please provide the complete mongod log files covering this issue as well as an archive of the diagnostic.data for the three nodes in the affected replica set? I've created a secure upload portal for you to provide these files. Files uploaded to the portal are visible only to MongoDB employees investigating the issue and are routinely deleted after some time. Kind regards, |