[SERVER-21078] Segfault from race between getlasterror with fsync:true and clean database shutdown Created: 22/Oct/15  Updated: 25/Nov/15  Resolved: 20/Nov/15

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: 3.2.0-rc0
Fix Version/s: 3.2.0-rc4

Type: Bug Priority: Minor - P4
Reporter: Andy Schwerin Assignee: Geert Bosch
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File fsyncwt.sh    
Issue Links:
Duplicate
is duplicated by SERVER-21582 Sharding tests' run time has doubled Closed
Related
is related to SERVER-20856 During CSRS upgrade, config server ge... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:

Run attached bash shell script, fsyncwt.sh, from a directory that contains "mongo" and "mongod" binaries.

It does the following:

  1. launches mongod
  2. starts a background job that runs getlasterror with fsync:true in a loop
  3. shuts down the mongod using the shutdown command
  4. expects mongod to exit with code 0

The actual behavior is that eventually (rarely in more than 5 runs for me) the mongod segfaults.

Sprint: QuInt C (11/23/15)
Participants:

 Description   

There appears to be a race between shutting down a mongod instance and running getlasterror with fsync:true when the storage engine is wired tiger and journaling is disabled. When the race goes poorly, mongod segfaults.

The stack trace via addr2line -i is as follows, with a build of mongod at commit 737bb20fcb9176eb5f664bd874cdaece779d4012.

/data/rbuild/mongo/src/mongo/util/stacktrace_posix.cpp:171
/data/rbuild/mongo/src/mongo/util/signal_handlers_synchronous.cpp:179
/data/rbuild/mongo/src/mongo/util/signal_handlers_synchronous.cpp:274
??:0
/data/rbuild/mongo/src/third_party/wiredtiger/src/txn/txn_ckpt.c:322
/data/rbuild/mongo/src/third_party/wiredtiger/src/txn/txn_ckpt.c:529 (discriminator 1)
/data/rbuild/mongo/src/third_party/wiredtiger/src/session/session_api.c:1086 (discriminator 3)
/data/rbuild/mongo/src/mongo/db/storage/wiredtiger/wiredtiger_kv_engine.cpp:299
/data/rbuild/mongo/src/mongo/db/storage/kv/kv_storage_engine.cpp:247
/data/rbuild/mongo/src/mongo/db/write_concern.cpp:244
/data/rbuild/mongo/src/mongo/db/commands/get_last_error.cpp:274 (discriminator 1)
/data/rbuild/mongo/src/mongo/db/dbcommands.cpp:1385
/data/rbuild/mongo/src/mongo/db/dbcommands.cpp:1290
/data/rbuild/mongo/src/mongo/db/commands.cpp:496
/data/rbuild/mongo/src/mongo/db/instance.cpp:293
/data/rbuild/mongo/src/mongo/db/instance.cpp:526
/data/rbuild/mongo/src/mongo/db/db.cpp:170 (discriminator 1)
/data/rbuild/mongo/src/mongo/util/net/message_server_port.cpp:229



 Comments   
Comment by Githook User [ 20/Nov/15 ]

Author:

{u'username': u'GeertBosch', u'name': u'Geert Bosch', u'email': u'geert@mongodb.com'}

Message: SERVER-21078: Avoid unsafe sessionCache usage in waitUntilDurable
Branch: master
https://github.com/mongodb/mongo/commit/3504bc8ef46733ae79863e8d34765735a67fcb44

Comment by Githook User [ 20/Nov/15 ]

Author:

{u'username': u'GeertBosch', u'name': u'Geert Bosch', u'email': u'geert@mongodb.com'}

Message: Revert "SERVER-21078 - Fix race conditions by using the session cache more within the wtkv engine code"

This reverts commit 1944e28410ee687c7314e848d96582d5a9d54ff6.
Branch: master
https://github.com/mongodb/mongo/commit/d2b49fe116b8d8012220fbbe0419fd2f57265054

Comment by Githook User [ 19/Nov/15 ]

Author:

{u'username': u'daveh86', u'name': u'David Hows', u'email': u'howsdav@gmail.com'}

Message: SERVER-21078 - Fix race conditions by using the session cache more within the wtkv engine code
Branch: master
https://github.com/mongodb/mongo/commit/1944e28410ee687c7314e848d96582d5a9d54ff6

Comment by David Hows [ 16/Nov/15 ]

Just rebased this patch against the tip of master and kicked off.

Will follow up with the results.

Comment by Michael Cahill (Inactive) [ 16/Nov/15 ]

daveh86, I see the CR here https://mongodbcr.appspot.com/28910002

Is there a patch build for the current version of the change?

Comment by David Hows [ 26/Oct/15 ]

Been looking into this, looks like the segfault is caused by the WTKV Engine using a session that is not taken from its internal cache. This means that when shutting down the server we can close the underlying WT connection while this session is still active and trying to write a checkpoint.

I'm doing some short testing on an initial fix and there may been need to remediate further things within the WTKV Engine code as I can see at least 3 other places that use sessions not taken from the Engines internal cache.

Comment by Andy Schwerin [ 22/Oct/15 ]

I have yet to repro this failure with journaling enabled. I've run the attached script through 170 iterations with no failures.

Comment by Andy Schwerin [ 22/Oct/15 ]

Noticed this in the logs attached to SERVER-20856, though it's probably not the cause of that bug.

Generated at Thu Feb 08 03:56:13 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.