[SERVER-35519] NULL dereference in LogicalSessionCacheReap Created: 09/Jun/18  Updated: 23/Jul/18  Resolved: 26/Jun/18

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 4.1.1
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: A. Jesse Jiryu Davis Assignee: Blake Oler
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File mongod.log    
Operating System: ALL
Participants:

 Description   

The transactions test suite for both the C Driver and Motor are able to pretty often crash the server, version 4.1.0-263-gbb2de3700e, configured as a 3-node RS on macOS with test commands enabled:

2018-06-08T23:14:25.068-0400 F -        [LogicalSessionCacheReap] Invalid access at address: 0x0
2018-06-08T23:14:25.106-0400 F -        [LogicalSessionCacheReap] Got signal: 11 (Segmentation fault: 11).
...
  ??? [0x0]
 mongod(_ZN5mongo23ServiceEntryPointCommon13handleRequestEPNS_16OperationContextERKNS_7MessageERKNS0_5HooksE+0x9F8) [0x10fa20cd8]
 mongod(_ZN5mongo23ServiceEntryPointMongod13handleRequestEPNS_16OperationContextERKNS_7MessageE+0x3B) [0x10fa1820b]
 mongod(_ZN5mongo12_GLOBAL__N_121loopbackBuildResponseEPNS_16OperationContextEPNS_9LastErrorERNS_7MessageE+0xCF) [0x110a2924f]
 mongod(_ZN5mongo14DBDirectClient4callERNS_7MessageES2_bPNSt3__112basic_stringIcNS3_11char_traitsIcEENS3_9allocatorIcEEEE+0x37) [0x110a290e7]
 mongod(_ZN5mongo14DBClientCursor4initEv+0x6D) [0x110ccc90d]
 mongod(_ZN5mongo12DBClientBase5queryERKNSt3__112basic_stringIcNS1_11char_traitsIcEENS1_9allocatorIcEEEENS_5QueryEiiPKNS_7BSONObjEii+0x68) [0x110cc5c48]
 mongod(_ZN5mongo14DBDirectClient5queryERKNSt3__112basic_stringIcNS1_11char_traitsIcEENS1_9allocatorIcEEEENS_5QueryEiiPKNS_7BSONObjEii+0x55) [0x110a29425]
 mongod(_ZN5mongo12_GLOBAL__N_121TransactionReaperImplINS0_11ReplHandlerEE4reapEPNS_16OperationContextE+0x2B9) [0x10fd82639]
 mongod(_ZN5mongo23LogicalSessionCacheImpl5_reapEPNS_6ClientE+0x1B3) [0x110be70e3]
 mongod(_ZN5mongo23LogicalSessionCacheImpl13_periodicReapEPNS_6ClientE+0x30) [0x110be7630]
 mongod(_ZNSt3__114__thread_proxyINS_5tupleIJNS_10unique_ptrINS_15__thread_structENS_14default_deleteIS3_EEEEZN5mongo18PeriodicRunnerImpl15PeriodicJobImpl3runEvE3$_1EEEEEPvSC_+0xD8) [0x1107b71c8]
 libsystem_pthread.dylib(_pthread_body+0x154) [0x7fff71b8b661]



 Comments   
Comment by Kaloian Manassiev [ 26/Jun/18 ]

Spoke with Jesse on email and since he was using locally built binaries which were lost, we are unable to recover the exact line of the crash and from the mangled stack trace is it not obvious which part of the command execution pipeline is nullptr.

I am closing it as "cannot reproduce".

jesse, since the description says "pretty often", please reopen this ticket if it happens again and make sure you save the symbols.

Comment by Blake Oler [ 12/Jun/18 ]

jesse and I discussed this offline. He's going to turn on core dumps to give us more info the next time this crashes. In the meantime, I'm going to investigate the sequence of operations stressed by the Python driver (Motor) tests to tease out any race conditions.

Comment by Eric Milkie [ 11/Jun/18 ]

Indeed the default LogicalSessionDefaultRefresh time is 5 minutes.

Comment by A. Jesse Jiryu Davis [ 11/Jun/18 ]

Dammit I can't reproduce this now, sorry. I didn't have core dumps enabled at the time I observed the crashes. I only observed it with 4.1.0-263-gbb2de3700e, I haven't been testing with 4.0.0-rc0 recently.

I believe this is related to me developing the driver transactions tests: I would begin a test that would open a session and start a transaction, then I would stop the test and start over. As a result, there were open sessions with transactions in progress when my test called "killAllSessions", and the server would crash then or some time afterward.

I notice that the log is suspiciously close to 5 minutes long, but I don't know if there's a 5-minute timer involved or not.

Comment by Kaloian Manassiev [ 11/Jun/18 ]

jesse, does this only impact master (4.1 codebase) or it is happening on 4.0 as well?

Do you have any repro steps and/or the core dump from one of these occurrences?

Generated at Thu Feb 08 04:40:05 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.