[SERVER-33446] PowerPC rollback failure Created: 22/Feb/18 Updated: 02/Apr/18 Resolved: 10/Mar/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 3.7.2 |
| Fix Version/s: | None |
| Type: | Question | Priority: | Major - P3 |
| Reporter: | Kevin Albertson | Assignee: | Kelsey Schubert |
| Resolution: | Incomplete | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
| Participants: |
| Description |
|
Tests for the C driver had a failure on PowerPC which look like a mongod failure. I haven't yet been able to reproduce. Looking at the logs we get a secondary unable to rollback. The replica set is initiated with following config:
The logs show the following roles are transitioned to: The secondary fassert's with a failure later:
It looks like it starts rollback on this line
It isn't clear to me that this is a bug, but it also seems unlikely the C driver tests are generating so much data that the secondary's oplog rolls off. Can someone confirm that this is a server bug, or help explain what is going on here? |
| Comments |
| Comment by Kelsey Schubert [ 06/Mar/18 ] | ||||||||
|
I think this is the most likely explanation. If you wanted to continue to look into it, I'd suggest reviewing the diagnostic.data after trying to reproduce the issue and looking at the oplog stats as well as the repl lag metrics. Let me know if you'd like help continuing to investigate or if you're comfortable closing this ticket. | ||||||||
| Comment by Kevin Albertson [ 22/Feb/18 ] | ||||||||
|
Looking at our logs, each node has 100MB set for the oplog size. Regarding how much data is generated, to get at least a rough idea I ran tests on a clean replset and checked the server status after (attached).
So it looks like the tests are applying more operations than I suspected. Perhaps this combined with the small oplog is causing the secondary to rollover? | ||||||||
| Comment by William Schultz (Inactive) [ 22/Feb/18 ] | ||||||||
|
As a rough start, do you have any idea of how much data was generated during the span of that test? And what the oplog sizes of each node were and how much data was in their oplogs at the end of the test? The rollback failure you see may be due to some other issue, but eliminating the oplog rollover case is a good start. |