[SERVER-22869] large $sample size causes seg fault on wiredtiger Created: 26/Feb/16 Updated: 19/Nov/16 Resolved: 31/Mar/16 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | WiredTiger |
| Affects Version/s: | 3.2.3 |
| Fix Version/s: | 3.2.5, 3.3.4 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Thomas Rueckstiess | Assignee: | Geert Bosch |
| Resolution: | Done | Votes: | 0 |
| Labels: | code-only | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
| Backwards Compatibility: | Fully Compatible |
| Operating System: | ALL |
| Backport Completed: | |
| Steps To Reproduce: | Mac OS X 10.11.3, SSD, mongodb 3.2.3, node.js driver 2.1.7 node.js script to cause the seg fault is attached (requires mongodb driver). I was unable to reproduce it in the shell (possibly due to much smaller batch sizes?) |
| Sprint: | Integration 12 (04/04/16) |
| Participants: |
| Description |
|
I'm executing a $sample aggregation through the node.js driver with a large sample size (e.g. 10000 on a collection of size 1M), which reproducibly causes a seg fault and crashes the server. In several attempts with smaller sample sizes (100, 1000) the crash did not occur. I was only able to reproduce on WiredTiger, not MMAP. Every time it crashes, I can see "duplicate document" messages in the log file in the getMore command, followed by an "Invalid access at address: 0x90." Relevant log lines and stack trace:
Full log file (verbosity 2) attached. |
| Comments |
| Comment by Githook User [ 04/Apr/16 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {u'username': u'GeertBosch', u'name': u'Geert Bosch', u'email': u'geert@mongodb.com'}Message: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 01/Apr/16 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {u'username': u'GeertBosch', u'name': u'Geert Bosch', u'email': u'geert@mongodb.com'}Message: (cherry picked from commit 4d24b36e82e5b2ddba32b30c5bc2499d635d1397) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Max Hirschhorn [ 26/Feb/16 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Running the script on a debug build gives the useful information that we are segfaulting when trying to close the _cursor member of the RandomCursor. By instrumenting the build with some logging statements in the RandomCursor methods, it appears we are calling
without calling RandomCursor::restore() between steps #3 and #4. According to the doc comments for the RecordCursor class, this is sequence of operations it legitimate because reattachToOperationContext() puts the cursor back in a "saved" state. The WiredTiger integration layer should probably handle this by guarding the _cursor->close(_cursor) call with if (_cursor) {...}. Additionally, it seems desirable to check the return value of the WiredTiger API call by doing invariantWTOK().
CC geert.bosch | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Thomas Rueckstiess [ 26/Feb/16 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
collection stats of this collection:
|