[SERVER-43399] Add logging to DeleteOpIsIdBased test to debug rare test hang in waitForAllEarlierOplogWritesToBeVisible Created: 20/Sep/19 Updated: 29/Oct/23 Resolved: 17/Oct/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | 4.3.1 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | A. Jesse Jiryu Davis | Assignee: | A. Jesse Jiryu Davis |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||
| Operating System: | ALL | ||||||||
| Sprint: | Repl 2019-10-07, Repl 2019-10-21 | ||||||||
| Participants: | |||||||||
| Linked BF Score: | 13 | ||||||||
| Description |
|
A hang was observed in the DeleteOpIsIdBased unittest in repltests.cpp. The test performs several deletes (which create delete oplog entries) and immediately queries the oplog, triggering a call to waitForAllEarlierOplogWritesToBeVisible. The stack trace is approximately:
This hang was observed approximately once in Evergreen. It seems likely to be a race involving the WTOplogJournalThread and the main thread, where the main thread is expecting the WTOplogJournalThread to call _setOplogReadTimestamp but it already has / never does. As lingzhi.deng showed me, it may be because waitForAllEarlierOplogWritesToBeVisible increments _opsWaitingForVisibility tell this thread that someone is waiting for it, but the thread checks a different member, _opsWaitingForJournal, to determine if there are any waiters. |
| Comments |
| Comment by Githook User [ 16/Oct/19 ] | ||
|
Author: {'name': 'A. Jesse Jiryu Davis', 'username': 'ajdavis', 'email': 'jesse@mongodb.com'}Message: Also clean up unused function in repltests.cpp. | ||
| Comment by Matthew Russotto [ 04/Oct/19 ] | ||
|
I think BF-14344 is the same. In that case, both primary and secondary were stuck at the same timestamp. | ||
| Comment by Matthew Russotto [ 04/Oct/19 ] | ||
|
BF-14344 shows a node in this state, though I don't know if it causes this BF. (It's not waiting for oplog writes, but lastVisible is not advancing) | ||
| Comment by A. Jesse Jiryu Davis [ 21/Sep/19 ] | ||
|
Thanks. Yes, we verified that code is run when the test writes two oplog entries. | ||
| Comment by Daniel Gottlieb (Inactive) [ 21/Sep/19 ] | ||
|
FWIW, this is the code that triggers a journal flush on transaction completion. | ||
| Comment by A. Jesse Jiryu Davis [ 20/Sep/19 ] | ||
|
If we disable the WTOplogJournalThread in the test, then it hangs similarly to how it hung in the build failure, so the thread is definitely critical for the test to complete. It seems like this line in the test:
... ought to cause a journal flush which would advance the visible timestamp, so this line in waitForAllEarlierOplogWritesToBeVisible:
... should return true and finish waiting. It appears like the visible timestamp wasn't advanced, however. One solution might just be to increase the log level for this test to get more info if it fails again. |