[SERVER-15580] Evaluating candidate query plans with concurrent writes on same collection may crash mongod Created: 09/Oct/14  Updated: 26/Sep/17  Resolved: 31/Oct/14

Status: Closed
Project: Core Server
Component/s: Querying
Affects Version/s: 2.6.3
Fix Version/s: 2.6.6, 2.8.0-rc0

Type: Bug Priority: Critical - P2
Reporter: Kalle Varisvirta Assignee: J Rassi
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Duplicate
is duplicated by SERVER-15053 MongoDB throws segmentation fault whi... Closed
is duplicated by SERVER-15936 Segment fault crash on primary Closed
is duplicated by SERVER-15669 Mongod crashes from time to time Closed
Related
is related to SERVER-16549 Segmentation fault in mongodb Closed
is related to SERVER-15913 QueryStageKeep::KeepStageFlagAddition... Closed
Tested
Operating System: ALL
Backport Completed:
Participants:
Case:

 Description   

Our MongoDB server (3 servers, in replica set) was running normally until suddenly the primary segfaulted. This is what's in the log (obviously, just the tail of the log):

2014-10-09T14:38:18.441+0300 [conn11661296] killcursors  keyUpdates:0 numYields:0 locks(micros) r:212 127ms
2014-10-09T14:38:18.450+0300 [conn11667693] query crossbow_cos.node query: { $query: { _bundle: "xb4_m4_data_video", _type: "node", cos_deleted: 0, status: 1, _id: { $in: [ 1313690, 1250602, 1314508, 1300482, 933372, 1304961, 1311114, 533803, 1315776 ] }, cos_field_publishing_channels: "nelonenmedia", cos_private_tags: null }, orderby: { cos_created: -1 } } planSummary: IXSCAN { _id: 1 }, IXSCAN { _id: 1 } ntoreturn:10 ntoskip:0 nscanned:34 nscannedObjects:18 keyUpdates:0 numYields:2 locks(micros) r:18881 nreturned:7 reslen:52504 1025ms
2014-10-09T14:38:18.513+0300 [conn11661120] killcursors: found 0 of 1
2014-10-09T14:38:18.593+0300 [conn11661509] SEVERE: Invalid access at address: 0
2014-10-09T14:38:18.694+0300 [conn11667971] killcursors: found 0 of 1
2014-10-09T14:38:18.990+0300 [conn11661509] SEVERE: Got signal: 11 (Segmentation fault).
Backtrace:0x11da801 0x11d9bde 0x11d9ccf 0x3b8000f710 0xaa7241 0xd4cccc 0xd4d69d 0xd4f81e 0xc12931 0xa37c44 0xa2295a 0xa241b7 0xa26269 0xd549f7 0xb9c972 0xb9ef52 0x76f2bf 0x1190afb 0x3b800079d1 0x3b7fce8b6d 
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0x11da801]
 /usr/bin/mongod() [0x11d9bde]
 /usr/bin/mongod() [0x11d9ccf]
 /lib64/libpthread.so.0() [0x3b8000f710]
 /usr/bin/mongod(_ZN5mongo18KeepMutationsStage4workEPm+0xc1) [0xaa7241]
 /usr/bin/mongod(_ZN5mongo15MultiPlanRunner12workAllPlansEPNS_7BSONObjEm+0x13c) [0xd4cccc]
 /usr/bin/mongod(_ZN5mongo15MultiPlanRunner12pickBestPlanEPmPNS_7BSONObjE+0xed) [0xd4d69d]
 /usr/bin/mongod(_ZN5mongo15MultiPlanRunner7getNextEPNS_7BSONObjEPNS_7DiskLocE+0x34e) [0xd4f81e]
 /usr/bin/mongod(_ZN5mongo8runCountERKSsRKNS_7BSONObjERSsRi+0xe11) [0xc12931]
 /usr/bin/mongod(_ZN5mongo8CmdCount3runERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0x114) [0xa37c44]
 /usr/bin/mongod(_ZN5mongo12_execCommandEPNS_7CommandERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0x3a) [0xa2295a]
 /usr/bin/mongod(_ZN5mongo7Command11execCommandEPS0_RNS_6ClientEiPKcRNS_7BSONObjERNS_14BSONObjBuilderEb+0xe87) [0xa241b7]
 /usr/bin/mongod(_ZN5mongo12_runCommandsEPKcRNS_7BSONObjERNS_11_BufBuilderINS_16TrivialAllocatorEEERNS_14BSONObjBuilderEbi+0x6c9) [0xa26269]
 /usr/bin/mongod(_ZN5mongo11newRunQueryERNS_7MessageERNS_12QueryMessageERNS_5CurOpES1_+0x2307) [0xd549f7]
 /usr/bin/mongod() [0xb9c972]
 /usr/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0x442) [0xb9ef52]
 /usr/bin/mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x9f) [0x76f2bf]
 /usr/bin/mongod(_ZN5mongo17PortMessageServer17handleIncomingMsgEPv+0x4fb) [0x1190afb]
 /lib64/libpthread.so.0() [0x3b800079d1]
 /lib64/libc.so.6(clone+0x6d) [0x3b7fce8b6d]

If we caused this somehow, any advice how to avoid it would be hugely useful



 Comments   
Comment by Ramon Fernandez Marina [ 05/Nov/14 ]

mike@meshfire.com, a 2.6.6 release is currently scheduled for December. Please tune in to mongodb-announce to receive the release announcement when 2.6.6 is published.

Comment by Michael Templeman [ 04/Nov/14 ]

Any timeline on release of 2.6.6? Hinting all of our potentially long-running queries. Our logs indicated that there may be a few long-running queries causing this problem.

Comment by Githook User [ 31/Oct/14 ]

Author:

{u'username': u'jrassi', u'name': u'Jason Rassi', u'email': u'rassi@10gen.com'}

Message: SERVER-15580 KeepMutationsStage shouldn't deref invalidated iterator
Branch: master
https://github.com/mongodb/mongo/commit/e1ef3e6371f59c6a6396e78b2a4c4d9a975e9396

Comment by Githook User [ 31/Oct/14 ]

Author:

{u'username': u'jrassi', u'name': u'Jason Rassi', u'email': u'rassi@10gen.com'}

Message: SERVER-15580 KeepMutationsStage shouldn't deref invalidated iterator

(cherry picked from commit e1ef3e6371f59c6a6396e78b2a4c4d9a975e9396)
Branch: v2.6
https://github.com/mongodb/mongo/commit/3898bb1e160e1118a84114620bec62e63254ed77

Comment by Kalle Varisvirta [ 10/Oct/14 ]

Thank you! We'll try hinting the long queries next.

Comment by J Rassi [ 10/Oct/14 ]

Kalle: I forgot to mention that the second stack trace you posted indeed points to the same root cause.

Comment by J Rassi [ 10/Oct/14 ]

The root cause of this bug is that documents can be added to the working set's "flagged documents" list while the KEEP_MUTATIONS stage is in the middle of iterating through that list. This is a regression introduced in version 2.6.0 of the server.

The conditions to reproduce this are:

  1. Client 1 issues a query that uses the MultiPlanRunner, which picks and starts evaluating candidate plans. One of the candidate plans is a KEEP_MUTATIONS <= FETCH <= IXSCAN plan. This candidate plan produces a few results and then yields the read lock.
  2. Client 2 issues a multi-update, which invalidates some of the documents that the candidate plan from step 1 started to produce. The multi-update then yields the write lock.
  3. Client 1 reacquires the read lock and finishes the index scan. It then proceeds to produce results from the set of documents that were flagged for mutations, and then yields the read lock.
  4. Client 2 reacquires the write lock and then invalidates additional documents that the candidate plan originally produced, and yields the write lock.
  5. Client 1 reacquires the read lock and attempts to return more results from the set of documents that were flagged for mutations.

At step 3 above, KeepMutationsStage::work() saves a const_iterator to referring into _workingSet::_flagged, which is of type std::unordered_set<size_t>.

At step 4, MultiPlanRunner::invalidate() calls ws->flagForReview(), which invokes insert() on the unordered_set. If the insert triggers a rehash of the set, then KeepMutationsStage's saved iterator is invalidated.

At step 5, KeepMutationsStage::work() dereferences the invalidated iterator, which is undefined behavior.

I can reproduce this with the following shell snippet:

db.dropDatabase();
for (var i = 0; i < 100; i++) {
    db.foo.insert({a: 1, b: 1, c: 1});
}
db.foo.ensureIndex({a: 1});
db.foo.ensureIndex({b: 1});
startParallelShell("while (true) { db.foo.find({a: 1, b: 1, $where: function(){ sleep(10); return true; }}).itcount() }");
while (true) { db.foo.update({$where: function(){ sleep(1); return true; } },
                             {$inc: {c: 1}}, {multi: true}); }

When running this against the 2.6.5 Linux-64 mongod debug binary, the server outputs the following:

2014-10-10T10:56:32.770-0400 [conn1] command test.$cmd command: update { update: "foo", updates: [ { q: { $where: function (){ sleep(1); return true; } }, u: { $inc: { c: 1.0 } }, multi: true, upsert: false } ], ordered: true } keyUpdates:0 numYields:0  reslen:55 459ms
2014-10-10T10:56:32.815-0400 [conn2] SEVERE: Invalid access at address: 0
2014-10-10T10:56:32.827-0400 [conn2] SEVERE: Got signal: 11 (Segmentation fault).
Backtrace:0x124d3d1 0x124c7ae 0x124c89f 0x7f73d1d9a340 0xadeda1 0xd956bc 0xd9621d 0xd991bd 0xd9c42b 0xbde810 0xbdfe90 0x78bfde 0x12029e4 0x7f73d1d92182 0x7f73d109730d
 ./mongodb-linux-x86_64-2.6.5/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0x124d3d1]
 ./mongodb-linux-x86_64-2.6.5/bin/mongod() [0x124c7ae]
 ./mongodb-linux-x86_64-2.6.5/bin/mongod() [0x124c89f]
 /lib/x86_64-linux-gnu/libpthread.so.0(+0x10340) [0x7f73d1d9a340]
 ./mongodb-linux-x86_64-2.6.5/bin/mongod(_ZN5mongo18KeepMutationsStage4workEPm+0xc1) [0xadeda1]
 ./mongodb-linux-x86_64-2.6.5/bin/mongod(_ZN5mongo15MultiPlanRunner12workAllPlansEPNS_7BSONObjEm+0x13c) [0xd956bc]
 ./mongodb-linux-x86_64-2.6.5/bin/mongod(_ZN5mongo15MultiPlanRunner12pickBestPlanEPmPNS_7BSONObjE+0xed) [0xd9621d]
 ./mongodb-linux-x86_64-2.6.5/bin/mongod(_ZN5mongo15MultiPlanRunner7getNextEPNS_7BSONObjEPNS_7DiskLocE+0x30d) [0xd991bd]
 ./mongodb-linux-x86_64-2.6.5/bin/mongod(_ZN5mongo11newRunQueryERNS_7MessageERNS_12QueryMessageERNS_5CurOpES1_+0x92b) [0xd9c42b]
 ./mongodb-linux-x86_64-2.6.5/bin/mongod() [0xbde810]
 ./mongodb-linux-x86_64-2.6.5/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0x530) [0xbdfe90]
 ./mongodb-linux-x86_64-2.6.5/bin/mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x9e) [0x78bfde]
 ./mongodb-linux-x86_64-2.6.5/bin/mongod(_ZN5mongo17PortMessageServer17handleIncomingMsgEPv+0x514) [0x12029e4]
 /lib/x86_64-linux-gnu/libpthread.so.0(+0x8182) [0x7f73d1d92182]
 /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f73d109730d]

In my assessment, the only workaround available for this issue is to add hints to any long-running queries, which will disable use of the MultiPlanRunner. Kalle, I would advise that you add a hint to your long-running count queries, since the stack traces you posted indicated that both crashes happened during execution of a count. If this does not fix the issue for you, then you will need to downgrade to version 2.4.11 until version 2.6.6 is released.

Setting this ticket's fix version to 2.6.6, 2.7.8.

Comment by Kalle Varisvirta [ 10/Oct/14 ]

The same machine crashed again today:

2014-10-10T10:24:28.786+0300 [conn3391533] SEVERE: Invalid access at address: 0x3a
2014-10-10T10:24:29.294+0300 [conn3391533] SEVERE: Got signal: 11 (Segmentation fault).
Backtrace:0x11da801 0x11d9bde 0x11d9ccf 0x3b8000f710 0xae2822 0xae2849 0xa92901 0xbe3b62 0xbf65f0 0xbf5ec2 0xaa7295 0xd4cccc 0xd4d69d 0xd4f81e 0xc12931 0xa37c44 0xa2295a 0xa241b7 0xa26269 0xd549f7 
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0x11da801]
 /usr/bin/mongod() [0x11d9bde]
 /usr/bin/mongod() [0x11d9ccf]
 /lib64/libpthread.so.0() [0x3b8000f710]
 /usr/bin/mongod(_ZNK5mongo16WorkingSetMember11hasOwnedObjEv+0x2) [0xae2822]
 /usr/bin/mongod(_ZNK5mongo16WorkingSetMember6hasObjEv+0x9) [0xae2849]
 /usr/bin/mongod(_ZNK5mongo27WorkingSetMatchableDocument16allocateIteratorEPKNS_11ElementPathE+0x21) [0xa92901]
 /usr/bin/mongod(_ZNK5mongo19LeafMatchExpression7matchesEPKNS_17MatchableDocumentEPNS_12MatchDetailsE+0x22) [0xbe3b62]
 /usr/bin/mongod(_ZNK5mongo18NotMatchExpression7matchesEPKNS_17MatchableDocumentEPNS_12MatchDetailsE+0x10) [0xbf65f0]
 /usr/bin/mongod(_ZNK5mongo18AndMatchExpression7matchesEPKNS_17MatchableDocumentEPNS_12MatchDetailsE+0x32) [0xbf5ec2]
 /usr/bin/mongod(_ZN5mongo18KeepMutationsStage4workEPm+0x115) [0xaa7295]
 /usr/bin/mongod(_ZN5mongo15MultiPlanRunner12workAllPlansEPNS_7BSONObjEm+0x13c) [0xd4cccc]
 /usr/bin/mongod(_ZN5mongo15MultiPlanRunner12pickBestPlanEPmPNS_7BSONObjE+0xed) [0xd4d69d]
 /usr/bin/mongod(_ZN5mongo15MultiPlanRunner7getNextEPNS_7BSONObjEPNS_7DiskLocE+0x34e) [0xd4f81e]
 /usr/bin/mongod(_ZN5mongo8runCountERKSsRKNS_7BSONObjERSsRi+0xe11) [0xc12931]
 /usr/bin/mongod(_ZN5mongo8CmdCount3runERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0x114) [0xa37c44]
 /usr/bin/mongod(_ZN5mongo12_execCommandEPNS_7CommandERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0x3a) [0xa2295a]
 /usr/bin/mongod(_ZN5mongo7Command11execCommandEPS0_RNS_6ClientEiPKcRNS_7BSONObjERNS_14BSONObjBuilderEb+0xe87) [0xa241b7]
 /usr/bin/mongod(_ZN5mongo12_runCommandsEPKcRNS_7BSONObjERNS_11_BufBuilderINS_16TrivialAllocatorEEERNS_14BSONObjBuilderEbi+0x6c9) [0xa26269]
 /usr/bin/mongod(_ZN5mongo11newRunQueryERNS_7MessageERNS_12QueryMessageERNS_5CurOpES1_+0x2307) [0xd549f7]

Not sure whether this is the same reason.

Comment by Thomas Rueckstiess [ 09/Oct/14 ]

Hi Kalle,

Thanks for reporting this problem. We've found a potential concurrency bug that may in rare cases cause the issue you're seeing. We will investigate further and update the ticket once we know more.

Regards,
Thomas

Generated at Thu Feb 08 03:38:25 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.