[SERVER-19990] abort in bt_discard.c, 419 Created: 17/Aug/15 Updated: 22/Sep/15 Resolved: 28/Aug/15 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | WiredTiger |
| Affects Version/s: | 3.1.7 |
| Fix Version/s: | 3.1.8 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Eric Milkie | Assignee: | Michael Cahill (Inactive) |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Participants: | |||||||||||||||||
| Linked BF Score: | 0 | ||||||||||||||||
| Description |
|
https://logkeeper.mongodb.org/build/55d24df390413011a25f61c1/all This was a replication jstest running on the head of master branch. The secondary node hit this error and shut down:
This appears to have happened after three attempts by replication to drop a collection. It appears three times in the log, so my guess is that it may have hit a WriteConflictException and retried. I believe there is really only one drop instruction in the replication stream.
|
| Comments |
| Comment by Michael Cahill (Inactive) [ 22/Sep/15 ] |
|
ramon.fernandez, WiredTiger only checks assertions when configured for diagnostic builds, so we could only see this in debug builds on MongoDB. Triggering this would require a combination of high insert rates and collection drops on a debug build (and hasn't shown up in Evergreen in 3.0, as far as I know). tl;dr I am happy not to backport this. |
| Comment by Michael Cahill (Inactive) [ 28/Aug/15 ] |
|
Should be fixed by the latest WT drop into MongoDB master. |
| Comment by Eric Milkie [ 26/Aug/15 ] |
|
Recording another test failure instance here, for tracking purposes: |
| Comment by Githook User [ 25/Aug/15 ] |
|
Author: {u'username': u'agorrod', u'name': u'Alex Gorrod', u'email': u'alexander.gorrod@mongodb.com'}Message: Merge pull request #2145 from wiredtiger/
|
| Comment by Githook User [ 25/Aug/15 ] |
|
Author: {u'username': u'agorrod', u'name': u'Alex Gorrod', u'email': u'alexander.gorrod@mongodb.com'}Message: Merge pull request #2145 from wiredtiger/
|
| Comment by Githook User [ 25/Aug/15 ] |
|
Author: {u'username': u'michaelcahill', u'name': u'Michael Cahill', u'email': u'michael.cahill@mongodb.com'}Message: |
| Comment by Michael Cahill (Inactive) [ 24/Aug/15 ] |
|
tl;dr I think I understand this, and the assertion is too strong. I'll fix. I believe a tree has been marked "dead" by WiredTiger, which happens during a drop operation so the file can be removed from cache in the background (to keep the application drop operation fast). Once in this state, it is okay to throw away all of the updates from the tree, and that usually happens quickly when WiredTiger's "sweep" thread notices a dead tree and discards it. However, if there is enough cache pressure, eviction will also try to discard pages from dead trees before the sweep server gets to them, and in that case, this assertion should not fire. It is likely that recent changes to sweep have made this more likely, which is why we are seeing these assertions now. |
| Comment by Eric Milkie [ 21/Aug/15 ] |
|
Another example of a failure, on Windows, but in the same test suite as the others: This might be incidental, but the failure looks like it typically comes immediately after a foreground index build, which means WT's bulk inserter was exercised. |
| Comment by Andy Schwerin [ 21/Aug/15 ] |
|
Looks like the signal handlers are configured with SA_RESETHAND, which indeed makes double aborts lose stack traces. I believe that is fixable, at least. |
| Comment by Eric Milkie [ 21/Aug/15 ] |
|
Oh, it might be because two threads called abort() simultaneously. The second abort() might just kill the process? Signals are hard. |
| Comment by Eric Milkie [ 21/Aug/15 ] |
|
I cannot explain why we aren't getting a stacktrace, as it seems that we are definitely calling abort(), which ought to fall into the signal handler for SIGABRT and produce a stack trace. Odd. |
| Comment by Eric Milkie [ 21/Aug/15 ] |
|
Details: https://logkeeper.mongodb.org/build/55d66c3d90413011a27bc90f/test/55d6730dbe07c47abf7bf953 An aside: I've noticed that for these types of failures, no stack trace is printed in the log. Is this unintentional? |
| Comment by Eric Milkie [ 21/Aug/15 ] |
|
Unfortunately, this happened again with the latest build from master. |
| Comment by Michael Cahill (Inactive) [ 18/Aug/15 ] |
|
P.S. For clarity, I know the assertion is not exactly the same, but the underlying bug was that WiredTiger's "oldest ID" (which determines the oldest update we have to keep available to readers) was moving too far forward. There are various assertions in WiredTiger that check for this, including the one here as well as the one that triggered |
| Comment by Michael Cahill (Inactive) [ 18/Aug/15 ] |
|
I believe this is a dup of |
| Comment by Eric Milkie [ 17/Aug/15 ] |
|
I take back what I said about the collection drop; the "cursora.js" test indeed drops that collection as a routine matter of testing. The test actually tests four different ways of inserting and removing data, and it drops the test collection in between each phase. |
| Comment by Eric Milkie [ 17/Aug/15 ] |
|
Here is a link to the test run itself: |