[SERVER-15539] Invariant failure keyOffset >= 0 during 16 thread write command remove operation Created: 06/Oct/14 Updated: 30/Jan/17 Resolved: 07/Jan/15 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Concurrency, Storage |
| Affects Version/s: | 2.7.5 |
| Fix Version/s: | 2.8.0-rc5 |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Quentin Conner | Assignee: | Geert Bosch |
| Resolution: | Done | Votes: | 0 |
| Labels: | 28qa | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||
| Operating System: | Windows | ||||||||||||||||||||||||
| Steps To Reproduce: | Use mongod 78a778 from a green MCI build. > db.serverBuildInfo() Start mongod: Use a mongo shell with version 2.7.7 (hand compiled). Obtain util/utils.js from the mongo-perf project (mongodb/mongo-perf master branch).
Use the following block to reproduce. The "alternative" is less unwieldy. ALTERNATIVELY: from the mongo-perf top level directory: The following will run single db first, then multi db second:
The failure occurred in the second run of the test suite, with multi database (4 databases active). |
||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||
| Description |
|
Hit an Invariant failure in mmap_v1/btree/btree_logic.cpp at line 2059. The operation that encountered the error was a remove, sent via the write command. 16 threads were executing the mongo-perf Remove.v2.IntNonIdIndex test. The issue may be intermittent and difficult to reproduce. It may depend on preconditions such as the loading of other data or execution of other tests in the mongo-perf suite. In other words, the issue may not reproduce unless the mongo-perf suite is run. I was not able to reproduce with a single invocation of benchRun() or runTests in util/utils.js. mongod is commit 78a778b23bed9018e56b8573b052ee2117beac46 This occurred while running mongo-perf against the Remove.v2.IntNonIdIndex test. The test ran a few trials before the server died: mongod log file shows the following:
|
| Comments |
| Comment by Githook User [ 22/Jan/15 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {u'username': u'kaloianm', u'name': u'Kaloian Manassiev', u'email': u'kaloian.manassiev@mongodb.com'}Message: | ||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 07/Jan/15 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {u'username': u'GeertBosch', u'name': u'Geert Bosch', u'email': u'geert@mongodb.com'}Message: | ||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 09/Dec/14 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {u'username': u'GeertBosch', u'name': u'Geert Bosch', u'email': u'geert@mongodb.com'}Message: This fixes a regression introduced in commit 29a700ad. | ||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Quentin Conner [ 08/Dec/14 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
Confirmed with a second test. 230490 causes the BTree invariant. | ||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Quentin Conner [ 08/Dec/14 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
The mongo-perf git bisect is (again) complete with 2304904687c5d29e228d86da95244682dc62caa1 identified as reintroducing the BTree invariant. I'm running a second test on 489690cfbccc24cf69b6ae7848581303ec8f4b0e to check for a false negative.
| ||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Quentin Conner [ 03/Dec/14 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
Sorry, 4c221b5... tested bad after testing good. Bisect is going to carry on a little longer. Currently retesting the Previous GOOD. Here is my bisect log so far:
And some notes on conditions to reproduce: --> mod mongo-perf to turn off FETCHMCI mode, compile from source | ||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Quentin Conner [ 02/Dec/14 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
Bisect is complete. Issue reintroduced with 5c9b4a6... commit 5c9b4a636753e98de0ba7d9d518bfb6516f843d7 While state is saved, threads pass in their own OperationContext to I am re-running 4c221b5ce50c3eaabc0348432b6df6c41aeabee5 to double-check and ensure this previous commit is not a false negative. | ||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Quentin Conner [ 24/Nov/14 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
On Windows, mongod was started like this (note the generous syncdelay): The Invariant occurs during MULTI DB testing with mongo-perf. This is where WriteCommand Remove ops are dispatched to multiple dbs concurrently. We can reproduce consistently (mongo-perf cannot complete due to this) with 609a149 by running the Remove.v3.IntNonIdIndex test in mongo-perf. In the run I captured, mongod failed at 28 concurrent worker threads.
Lastly, the mongo shell invocation by the mongo-perf test harness looks like this:
| ||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Quentin Conner [ 23/Nov/14 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
2.8.0-rc2 MMAPv1 storage engine invariant failure during mongo-perf. Two examples. | ||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Quentin Conner [ 23/Nov/14 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
This invariant has shown up again as of 2.8.0-rc1 (609a149). I have a couple of failed mongo-perf runs from 22 November 2014, with log files attached. This happens on our 64-core Windows 2012 R2 test slave. | ||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Quentin Conner [ 20/Nov/14 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
still good at c0a623c | ||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Quentin Conner [ 10/Nov/14 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
I noticed at git hash f91d3efd69a154d46a73caf2e6b5a5f632b56061 (in master) that we aren't seeing the assert any longer. I have not bisected to see which checkin fixed it, but have several complete (good) mongo-perf runs with MMAPv1 (same two-socket hardware). | ||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Quentin Conner [ 20/Oct/14 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
This keyOffset >=0 assert has now been observed on an EC2 box running Windows 2008. The box is believed to be a single socket, unlike the other environment (Linux). | ||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Quentin Conner [ 10/Oct/14 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
Found in Version should be 2.7.5. | ||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Quentin Conner [ 10/Oct/14 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
git bisect is complete. It appears the "Use correct algorithm in BtreeLogic::Builder" commit is to blame. Here is the git bisect log:
And the commit comment: commit 9b5f950603376b36eb2659ae5b8fdae2ee8714cf Both algorithms cover phases 2 and 3 of forground index building. Importantly OLD ALGORITHM: Phase 3:
NEW ALGORITHM: Phase 3: Non-exhaustive list of benefits to the new algorthm:
:040000 040000 819555cfa90cd1a28c8281b4eaa102212a262eaa 4a316f83e664c80c8c33693ae76d15ec21bf7edb M src | ||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Geert Bosch [ 08/Oct/14 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Quentin, Let's look at this tomorrow. It would be really helpful if we can narrow down the commit around which the error was introduced. Is the r2.7.6 server known-good? -Geert | ||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Quentin Conner [ 06/Oct/14 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
benchRun invocation looks like this:
|