[SERVER-21078] Segfault from race between getlasterror with fsync:true and clean database shutdown Created: 22/Oct/15 Updated: 25/Nov/15 Resolved: 20/Nov/15 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | WiredTiger |
| Affects Version/s: | 3.2.0-rc0 |
| Fix Version/s: | 3.2.0-rc4 |
| Type: | Bug | Priority: | Minor - P4 |
| Reporter: | Andy Schwerin | Assignee: | Geert Bosch |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Steps To Reproduce: | Run attached bash shell script, fsyncwt.sh, from a directory that contains "mongo" and "mongod" binaries. It does the following:
The actual behavior is that eventually (rarely in more than 5 runs for me) the mongod segfaults. |
||||||||||||||||
| Sprint: | QuInt C (11/23/15) | ||||||||||||||||
| Participants: | |||||||||||||||||
| Description |
|
There appears to be a race between shutting down a mongod instance and running getlasterror with fsync:true when the storage engine is wired tiger and journaling is disabled. When the race goes poorly, mongod segfaults. The stack trace via addr2line -i is as follows, with a build of mongod at commit 737bb20fcb9176eb5f664bd874cdaece779d4012.
|
| Comments |
| Comment by Githook User [ 20/Nov/15 ] |
|
Author: {u'username': u'GeertBosch', u'name': u'Geert Bosch', u'email': u'geert@mongodb.com'}Message: |
| Comment by Githook User [ 20/Nov/15 ] |
|
Author: {u'username': u'GeertBosch', u'name': u'Geert Bosch', u'email': u'geert@mongodb.com'}Message: Revert " This reverts commit 1944e28410ee687c7314e848d96582d5a9d54ff6. |
| Comment by Githook User [ 19/Nov/15 ] |
|
Author: {u'username': u'daveh86', u'name': u'David Hows', u'email': u'howsdav@gmail.com'}Message: |
| Comment by David Hows [ 16/Nov/15 ] |
|
Just rebased this patch against the tip of master and kicked off. Will follow up with the results. |
| Comment by Michael Cahill (Inactive) [ 16/Nov/15 ] |
|
daveh86, I see the CR here https://mongodbcr.appspot.com/28910002 Is there a patch build for the current version of the change? |
| Comment by David Hows [ 26/Oct/15 ] |
|
Been looking into this, looks like the segfault is caused by the WTKV Engine using a session that is not taken from its internal cache. This means that when shutting down the server we can close the underlying WT connection while this session is still active and trying to write a checkpoint. I'm doing some short testing on an initial fix and there may been need to remediate further things within the WTKV Engine code as I can see at least 3 other places that use sessions not taken from the Engines internal cache. |
| Comment by Andy Schwerin [ 22/Oct/15 ] |
|
I have yet to repro this failure with journaling enabled. I've run the attached script through 170 iterations with no failures. |
| Comment by Andy Schwerin [ 22/Oct/15 ] |
|
Noticed this in the logs attached to |