[SERVER-43399] Add logging to DeleteOpIsIdBased test to debug rare test hang in waitForAllEarlierOplogWritesToBeVisible Created: 20/Sep/19  Updated: 29/Oct/23  Resolved: 17/Oct/19

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 4.3.1

Type: Bug Priority: Major - P3
Reporter: A. Jesse Jiryu Davis Assignee: A. Jesse Jiryu Davis
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-44196 Complete TODO listed in SERVER-43399 Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Repl 2019-10-07, Repl 2019-10-21
Participants:
Linked BF Score: 13

 Description   

A hang was observed in the DeleteOpIsIdBased unittest in repltests.cpp. The test performs several deletes (which create delete oplog entries) and immediately queries the oplog, triggering a call to waitForAllEarlierOplogWritesToBeVisible. The stack trace is approximately:

Thread 1: "testsuite" (Thread 0x7fdf3e0f7ac0 (LWP 67799))
 .
#10 0x000055f6e9f94225 in mongo::Interruptible::waitForConditionOrInterrupt
#11 mongo::WiredTigerOplogManager::waitForAllEarlierOplogWritesToBeVisible
 .
#16 0x000055f6eb3e54fb in mongo::(anonymous namespace)::FindCmd::Invocation::run
 .
#28 0x000055f6e97c822d in ReplTests::Base::applyAllOperations
#29 0x000055f6e9824b59 in ReplTests::DeleteOpIsIdBased::run 

This hang was observed approximately once in Evergreen. It seems likely to be a race involving the WTOplogJournalThread and the main thread, where the main thread is expecting the WTOplogJournalThread to call _setOplogReadTimestamp but it already has / never does. As lingzhi.deng showed me, it may be because waitForAllEarlierOplogWritesToBeVisible increments _opsWaitingForVisibility tell this thread that someone is waiting for it, but the thread checks a different member, _opsWaitingForJournal, to determine if there are any waiters.



 Comments   
Comment by Githook User [ 16/Oct/19 ]

Author:

{'name': 'A. Jesse Jiryu Davis', 'username': 'ajdavis', 'email': 'jesse@mongodb.com'}

Message: SERVER-43399 Log more in DeleteOpIsIdBased test

Also clean up unused function in repltests.cpp.
Branch: master
https://github.com/mongodb/mongo/commit/8b32d754d7e12a11e8d4a8479312f9169147e22e

Comment by Matthew Russotto [ 04/Oct/19 ]

I think BF-14344 is the same. In that case, both primary and secondary were stuck at the same timestamp.

Comment by Matthew Russotto [ 04/Oct/19 ]

BF-14344 shows a node in this state, though I don't know if it causes this BF. (It's not waiting for oplog writes, but lastVisible is not advancing)

Comment by A. Jesse Jiryu Davis [ 21/Sep/19 ]

Thanks. Yes, we verified that code is run when the test writes two oplog entries.

Comment by Daniel Gottlieb (Inactive) [ 21/Sep/19 ]

FWIW, this is the code that triggers a journal flush on transaction completion.

Comment by A. Jesse Jiryu Davis [ 20/Sep/19 ]

If we disable the WTOplogJournalThread in the test, then it hangs similarly to how it hung in the build failure, so the thread is definitely critical for the test to complete. It seems like this line in the test:

_client.remove(ns(), BSON("a" << 10)); 

... ought to cause a journal flush which would advance the visible timestamp, so this line in waitForAllEarlierOplogWritesToBeVisible:

        return latestVisible >= waitingFor; 

... should return true and finish waiting. It appears like the visible timestamp wasn't advanced, however. One solution might just be to increase the log level for this test to get more info if it fails again.

Generated at Thu Feb 08 05:03:06 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.