[SERVER-19990] abort in bt_discard.c, 419 Created: 17/Aug/15  Updated: 22/Sep/15  Resolved: 28/Aug/15

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: 3.1.7
Fix Version/s: 3.1.8

Type: Bug Priority: Major - P3
Reporter: Eric Milkie Assignee: Michael Cahill (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by WT-1973 MongoDB changes for WiredTiger 2.7.0 Closed
Duplicate
duplicates WT-2021 test/thread test failure Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:
Linked BF Score: 0

 Description   

https://logkeeper.mongodb.org/build/55d24df390413011a25f61c1/all

This was a replication jstest running on the head of master branch. The secondary node hit this error and shut down:

[ReplicaSetFixture:job0:secondary] 2015-08-17T21:14:17.531+0000 E STORAGE  [thread1] WiredTiger (0) [1439846057:531093][2736:2005873552], file:collection-992-5875615686962012966.wt, eviction-worker: src\third_party\wiredtiger\src\btree\bt_discard.c, 419: F_ISSET(session, WT_SESSION_DISCARD_FORCE) || upd->txnid == WT_TXN_ABORTED || __wt_txn_visible_all(session, upd->txnid)
[ReplicaSetFixture:job0:secondary] 2015-08-17T21:14:17.531+0000 E STORAGE  [thread2] WiredTiger (0) [1439846057:531093][2736:2005873552], file:index-993-5875615686962012966.wt, eviction-worker: src\third_party\wiredtiger\src\btree\bt_discard.c, 419: F_ISSET(session, WT_SESSION_DISCARD_FORCE) || upd->txnid == WT_TXN_ABORTED || __wt_txn_visible_all(session, upd->txnid)
[ReplicaSetFixture:job0:secondary] 2015-08-17T21:14:17.531+0000 E STORAGE  [thread1] WiredTiger (0) [1439846057:531093][2736:2005873552], file:collection-992-5875615686962012966.wt, eviction-worker: aborting WiredTiger library
[ReplicaSetFixture:job0:secondary] 2015-08-17T21:14:17.531+0000 E STORAGE  [thread2] WiredTiger (0) [1439846057:531093][2736:2005873552], file:index-993-5875615686962012966.wt, eviction-worker: aborting WiredTiger library

This appears to have happened after three attempts by replication to drop a collection. It appears three times in the log, so my guess is that it may have hit a WriteConflictException and retried. I believe there is really only one drop instruction in the replication stream.

[ReplicaSetFixture:job0:secondary] 2015-08-17T21:14:12.544+0000 I COMMAND  [repl writer worker 2] CMD: drop test.cursora
[ReplicaSetFixture:job0:secondary] 2015-08-17T21:14:16.079+0000 I COMMAND  [repl writer worker 1] CMD: drop test.cursora
[ReplicaSetFixture:job0:secondary] 2015-08-17T21:14:17.117+0000 I COMMAND  [repl writer worker 1] CMD: drop test.cursora



 Comments   
Comment by Michael Cahill (Inactive) [ 22/Sep/15 ]

ramon.fernandez, WiredTiger only checks assertions when configured for diagnostic builds, so we could only see this in debug builds on MongoDB. Triggering this would require a combination of high insert rates and collection drops on a debug build (and hasn't shown up in Evergreen in 3.0, as far as I know).

tl;dr I am happy not to backport this.

Comment by Michael Cahill (Inactive) [ 28/Aug/15 ]

Should be fixed by the latest WT drop into MongoDB master.

Comment by Eric Milkie [ 26/Aug/15 ]

Recording another test failure instance here, for tracking purposes:
https://evergreen.mongodb.com/task/mongodb_mongo_master_windows_64_2k8_debug_jsCore_small_oplog_rs_WT_5d76623abc23f453c6530b1f8543476c5d65c4e9_15_08_26_04_49_10

Comment by Githook User [ 25/Aug/15 ]

Author:

{u'username': u'agorrod', u'name': u'Alex Gorrod', u'email': u'alexander.gorrod@mongodb.com'}

Message: Merge pull request #2145 from wiredtiger/SERVER-19990

SERVER-19990 Don't assert on eviction of live updates from dead trees.
Branch: develop
https://github.com/wiredtiger/wiredtiger/commit/3e46e79990170b9fd4c03e087c6876c22733fe5f

Comment by Githook User [ 25/Aug/15 ]

Author:

{u'username': u'agorrod', u'name': u'Alex Gorrod', u'email': u'alexander.gorrod@mongodb.com'}

Message: Merge pull request #2145 from wiredtiger/SERVER-19990

SERVER-19990 Don't assert on eviction of live updates from dead trees.
Branch: develop
https://github.com/wiredtiger/wiredtiger/commit/3e46e79990170b9fd4c03e087c6876c22733fe5f

Comment by Githook User [ 25/Aug/15 ]

Author:

{u'username': u'michaelcahill', u'name': u'Michael Cahill', u'email': u'michael.cahill@mongodb.com'}

Message: SERVER-19990 Don't assert on eviction of live updates from dead trees.
Branch: develop
https://github.com/wiredtiger/wiredtiger/commit/470648d1fe51287f260d133282adf8b3f7c7004d

Comment by Michael Cahill (Inactive) [ 24/Aug/15 ]

tl;dr I think I understand this, and the assertion is too strong. I'll fix.

I believe a tree has been marked "dead" by WiredTiger, which happens during a drop operation so the file can be removed from cache in the background (to keep the application drop operation fast). Once in this state, it is okay to throw away all of the updates from the tree, and that usually happens quickly when WiredTiger's "sweep" thread notices a dead tree and discards it.

However, if there is enough cache pressure, eviction will also try to discard pages from dead trees before the sweep server gets to them, and in that case, this assertion should not fire. It is likely that recent changes to sweep have made this more likely, which is why we are seeing these assertions now.

Comment by Eric Milkie [ 21/Aug/15 ]

Another example of a failure, on Windows, but in the same test suite as the others:
https://evergreen.mongodb.com/task/mongodb_mongo_master_windows_64_2k8_debug_jsCore_small_oplog_rs_WT_7d7f4fb3b6f6a171eacf53384053df0fe728db42_15_08_20_21_16_10

This might be incidental, but the failure looks like it typically comes immediately after a foreground index build, which means WT's bulk inserter was exercised.

Comment by Andy Schwerin [ 21/Aug/15 ]

Looks like the signal handlers are configured with SA_RESETHAND, which indeed makes double aborts lose stack traces. I believe that is fixable, at least.

Comment by Eric Milkie [ 21/Aug/15 ]

Oh, it might be because two threads called abort() simultaneously. The second abort() might just kill the process? Signals are hard.

Comment by Eric Milkie [ 21/Aug/15 ]

I cannot explain why we aren't getting a stacktrace, as it seems that we are definitely calling abort(), which ought to fall into the signal handler for SIGABRT and produce a stack trace. Odd.

Comment by Eric Milkie [ 21/Aug/15 ]

Details:

https://evergreen.mongodb.com/task/mongodb_mongo_master_ubuntu1410_debug_asan_jsCore_small_oplog_rs_WT_7d7f4fb3b6f6a171eacf53384053df0fe728db42_15_08_20_21_16_10

https://logkeeper.mongodb.org/build/55d66c3d90413011a27bc90f/test/55d6730dbe07c47abf7bf953

An aside: I've noticed that for these types of failures, no stack trace is printed in the log. Is this unintentional?

Comment by Eric Milkie [ 21/Aug/15 ]

Unfortunately, this happened again with the latest build from master.

Comment by Michael Cahill (Inactive) [ 18/Aug/15 ]

P.S. For clarity, I know the assertion is not exactly the same, but the underlying bug was that WiredTiger's "oldest ID" (which determines the oldest update we have to keep available to readers) was moving too far forward. There are various assertions in WiredTiger that check for this, including the one here as well as the one that triggered WT-2021.

Comment by Michael Cahill (Inactive) [ 18/Aug/15 ]

I believe this is a dup of WT-2021, fixed in today's drop to MongoDB master.

Comment by Eric Milkie [ 17/Aug/15 ]

I take back what I said about the collection drop; the "cursora.js" test indeed drops that collection as a routine matter of testing. The test actually tests four different ways of inserting and removing data, and it drops the test collection in between each phase.

Comment by Eric Milkie [ 17/Aug/15 ]

Here is a link to the test run itself:
https://evergreen.mongodb.com/task/mongodb_mongo_master_windows_64_2k8_debug_jsCore_small_oplog_rs_WT_afd0f15913e95c5e530f25272e60254770350c89_15_08_17_20_15_23

Generated at Thu Feb 08 03:52:47 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.