[SERVER-57219] JournalFlusher not shut down by OpMsgFuzzerFixture Created: 26/May/21  Updated: 04/Feb/22  Resolved: 04/Feb/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Billy Donahue Assignee: Dianna Hohensee (Inactive)
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Operating System: ALL
Sprint: Execution Team 2021-07-26, Execution Team 2022-02-21
Participants:
Linked BF Score: 0

 Description   

In BF-20646, we can see that a JournalFlusher thread was still running after main() had exited in a very simple unit test of the OpMsgFuzzerFixture. The log contained neither of the messages that we'd expect from the JournalFlusher::shutdown call.



 Comments   
Comment by Dianna Hohensee (Inactive) [ 26/May/21 ]

Yeah, that seems likely, but then I'm still confused why the JournalFlusher::shutdown() function doesn't log a message, because the JournalFlusher must be created serially, even if its internal thread doesn't start serially. So JournalFlusher::shutdown() should still log, even if it doesn't have a thread to wait on, right?

An angle to explore would be what shutdown() does – like maybe nothing? – if the thread has not yet started. Then maybe the thread starts after the fact and runs into the memory error. But that wouldn't explain the lack of logging...

Comment by Billy Donahue [ 26/May/21 ]

Maybe the ping test is over so quickly that the JournalFlusher thread never really gets started before the test body finishes, and there's some kind of race there.

Comment by Dianna Hohensee (Inactive) [ 26/May/21 ]

I ran the test locally, and it doesn't log anything by the journal flusher

~/mongo (SERVER-49769-radix_iterator)*$ ./build/install/bin/op_msg_fuzzer_fixture_test
{"t":{"$date":"2021-05-26T17:02:40.969Z"},"s":"I",  "c":"NETWORK",  "id":4915701, "ctx":"-","msg":"Initialized wire specification","attr":{"spec":{"incomingExternalClient":{"minWireVersion":0,"maxWireVersion":13},"incomingInternalClient":{"minWireVersion":0,"maxWireVersion":13},"outgoing":{"minWireVersion":0,"maxWireVersion":13},"isInternalClient":false}}}
{"t":{"$date":"2021-05-26T17:02:40.984Z"},"s":"I",  "c":"TEST",     "id":23063,   "ctx":"main","msg":"Running","attr":{"suite":"OpMsgFuzzerFixtureTest"}}
{"t":{"$date":"2021-05-26T17:02:40.984Z"},"s":"I",  "c":"TEST",     "id":23059,   "ctx":"main","msg":"Running","attr":{"test":"Ping","rep":1,"reps":1}}
{"t":{"$date":"2021-05-26T17:02:40.984Z"},"s":"I",  "c":"TEST",     "id":23051,   "ctx":"main","msg":"Created temporary directory: {path}","attr":{"path":"/tmp/op_msg_fuzzer_fixture-2189-f2e3-56a6-ba0f"}}
{"t":{"$date":"2021-05-26T17:02:40.991Z"},"s":"E",  "c":"-",        "id":23077,   "ctx":"test","msg":"Assertion","attr":{"error":"Location34348: cannot translate opcode 1834942464","file":"src/mongo/rpc/message.h","line":120}}
{"t":{"$date":"2021-05-26T17:02:40.996Z"},"s":"E",  "c":"COMMAND",  "id":4879802, "ctx":"test","msg":"Failed to handle request","attr":{"error":"Location34348: cannot translate opcode 1834942464"}}
{"t":{"$date":"2021-05-26T17:02:40.996Z"},"s":"I",  "c":"STORAGE",  "id":22320,   "ctx":"test","msg":"Shutting down journal flusher thread"}
{"t":{"$date":"2021-05-26T17:02:40.996Z"},"s":"I",  "c":"STORAGE",  "id":22321,   "ctx":"test","msg":"Finished shutting down journal flusher thread"}
{"t":{"$date":"2021-05-26T17:02:40.997Z"},"s":"I",  "c":"STORAGE",  "id":20282,   "ctx":"test","msg":"Deregistering all the collections"}
{"t":{"$date":"2021-05-26T17:02:40.997Z"},"s":"I",  "c":"TEST",     "id":23060,   "ctx":"test","msg":"Done running tests"}
{"t":{"$date":"2021-05-26T17:02:40.997Z"},"s":"I",  "c":"TEST",     "id":4680101, "ctx":"test","msg":"Result","attr":{"suite":{"name":"OpMsgFuzzerFixtureTest","tests":1,"fails":0,"asserts":0,"time":{"durationMillis":12},"failures":[]}}}
{"t":{"$date":"2021-05-26T17:02:40.997Z"},"s":"I",  "c":"TEST",     "id":23065,   "ctx":"test","msg":"Totals","attr":{"totals":{"name":"TOTALS","tests":1,"fails":0,"asserts":0,"time":{"durationMillis":12},"failures":[]}}}
{"t":{"$date":"2021-05-26T17:02:40.997Z"},"s":"I",  "c":"TEST",     "id":23069,   "ctx":"test","msg":"SUCCESS - All tests in all suites passed"}

I then added some additional logging to try to see what's going on

~/mongo (SERVER-49769-radix_iterator)*$ ./build/install/bin/op_msg_fuzzer_fixture_test
{"t":{"$date":"2021-05-26T17:12:44.251Z"},"s":"I",  "c":"NETWORK",  "id":4915701, "ctx":"-","msg":"Initialized wire specification","attr":{"spec":{"incomingExternalClient":{"minWireVersion":0,"maxWireVersion":13},"incomingInternalClient":{"minWireVersion":0,"maxWireVersion":13},"outgoing":{"minWireVersion":0,"maxWireVersion":13},"isInternalClient":false}}}
{"t":{"$date":"2021-05-26T17:12:44.263Z"},"s":"I",  "c":"TEST",     "id":23063,   "ctx":"main","msg":"Running","attr":{"suite":"OpMsgFuzzerFixtureTest"}}
{"t":{"$date":"2021-05-26T17:12:44.263Z"},"s":"I",  "c":"TEST",     "id":23059,   "ctx":"main","msg":"Running","attr":{"test":"Ping","rep":1,"reps":1}}
{"t":{"$date":"2021-05-26T17:12:44.263Z"},"s":"I",  "c":"TEST",     "id":23051,   "ctx":"main","msg":"Created temporary directory: {path}","attr":{"path":"/tmp/op_msg_fuzzer_fixture-37c2-7ae1-ee9d-2844"}}
{"t":{"$date":"2021-05-26T17:12:44.268Z"},"s":"I",  "c":"STORAGE",  "id":0,       "ctx":"test","msg":"~~~startStorageControls"}
{"t":{"$date":"2021-05-26T17:12:44.268Z"},"s":"I",  "c":"STORAGE",  "id":0,       "ctx":"test","msg":"~~~startStorageControls, instantiating journal flusher"}
{"t":{"$date":"2021-05-26T17:12:44.268Z"},"s":"I",  "c":"STORAGE",  "id":0,       "ctx":"test","msg":"~~~startStorageControls, done instantiating journal flusher"}
{"t":{"$date":"2021-05-26T17:12:44.268Z"},"s":"I",  "c":"STORAGE",  "id":0,       "ctx":"JournalFlusher","msg":"~~~JournalFlusher::run(), starting"}
{"t":{"$date":"2021-05-26T17:12:44.268Z"},"s":"E",  "c":"-",        "id":23077,   "ctx":"test","msg":"Assertion","attr":{"error":"Location34348: cannot translate opcode 1834942464","file":"src/mongo/rpc/message.h","line":120}}
{"t":{"$date":"2021-05-26T17:12:44.269Z"},"s":"E",  "c":"COMMAND",  "id":4879802, "ctx":"test","msg":"Failed to handle request","attr":{"error":"Location34348: cannot translate opcode 1834942464"}}
{"t":{"$date":"2021-05-26T17:12:44.269Z"},"s":"I",  "c":"STORAGE",  "id":0,       "ctx":"test","msg":"~~~stopStorageControls"}
{"t":{"$date":"2021-05-26T17:12:44.269Z"},"s":"I",  "c":"STORAGE",  "id":0,       "ctx":"test","msg":"~~~JournalFlusher::shutdown(), stopping"}
{"t":{"$date":"2021-05-26T17:12:44.269Z"},"s":"I",  "c":"STORAGE",  "id":22320,   "ctx":"test","msg":"Shutting down journal flusher thread"}
{"t":{"$date":"2021-05-26T17:12:44.270Z"},"s":"I",  "c":"STORAGE",  "id":22321,   "ctx":"test","msg":"Finished shutting down journal flusher thread"}
{"t":{"$date":"2021-05-26T17:12:44.270Z"},"s":"I",  "c":"STORAGE",  "id":20282,   "ctx":"test","msg":"Deregistering all the collections"}
{"t":{"$date":"2021-05-26T17:12:44.270Z"},"s":"I",  "c":"TEST",     "id":23060,   "ctx":"test","msg":"Done running tests"}
{"t":{"$date":"2021-05-26T17:12:44.270Z"},"s":"I",  "c":"TEST",     "id":4680101, "ctx":"test","msg":"Result","attr":{"suite":{"name":"OpMsgFuzzerFixtureTest","tests":1,"fails":0,"asserts":0,"time":{"durationMillis":6},"failures":[]}}}
{"t":{"$date":"2021-05-26T17:12:44.270Z"},"s":"I",  "c":"TEST",     "id":23065,   "ctx":"test","msg":"Totals","attr":{"totals":{"name":"TOTALS","tests":1,"fails":0,"asserts":0,"time":{"durationMillis":6},"failures":[]}}}
{"t":{"$date":"2021-05-26T17:12:44.270Z"},"s":"I",  "c":"TEST",     "id":23069,   "ctx":"test","msg":"SUCCESS - All tests in all suites passed"}

Seems like the extra logging slowed something down such that the regular JournalFlusher logging comes through? Weird... Logging

Generated at Thu Feb 08 05:41:17 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.