[SERVER-17499] Using eval command to run getMore on aggregation cursor trips fatal assertion Created: 06/Mar/15  Updated: 18/Sep/15  Resolved: 09/Mar/15

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.0.0
Fix Version/s: 3.0.1, 3.1.0

Type: Bug Priority: Critical - P2
Reporter: Patrig Droumaguet [X] Assignee: J Rassi
Resolution: Done Votes: 0
Labels: 3.0, ET, upgrading
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Ubuntu 14.04.2
Linux kernel 3.13.0-46-generic and 3.10.23-xxxx-std-ipv6-64
Hosting : OVH (France)


Issue Links:
Duplicate
Backwards Compatibility: Fully Compatible
Operating System: Linux
Backport Completed:
Participants:

 Description   

Hi,

We had a replica set of three dedicated servers on Ubuntu 14.04. Everything was working fine on 2.6.8. We have stopped one of them to reinstall Ubuntu (clean slate) and to install MongoDB 3.0 with WiredTiger. The initialization has gone OK, the replication was working great (the primary was still on 2.6.8). Using rs.slaveOk(), we could see that the data was up-to-date.

We then have tried to push him as primary by calling a rs.freeze(600) on the other secondary and a stepDown() on the primary. After less than 10 seconds, mongod crashes. Starting mongod again, the server works again as secondary.

As it is written that each server should be on 2.6.8 or preferably 3.0, we've tried to update the second secondary server, but this time without reinstalling everything (so we've only updated the Ubuntu sources list and launch sudo apt-get update / sudo apt-get upgrade). The server updated, and restarted as secondary without any issue. It was possible to call queries from the shell without problem.

Then we've called a rs.freeze(600) on the secondary on 3.0 WiredTiger and a rs.stepDown() on the primary still on 2.6.8. Very quickly the server has crashed and the previous primary turns back as primary. The log was identical as the crash log on the first server.

After restarting the second 3.0 server as secondary, we've hoped that if all the servers on the replica set were on 3.0 it could work… So we've called a rs.freeze(600) on the first 3.0 secondary server (the one using WiredTiger) and we've shut down the primary by calling sudo service mongod stop on Ubuntu shell. The server on 3.0 using MMAPv1 turns primary, and crashes after few seconds. We then have started the server on 2.6.8 and after a very short rollback it has gone back primary.

Here is an example of what we could see in the log on the 3.0 servers after the crash :

2015-03-06T19:48:09.990+0100 I -        [conn50916] Invariant failure txn->recoveryUnit() == cc->getUnownedRecoveryUnit() src/mongo/db/query/find$
2015-03-06T19:48:09.997+0100 I CONTROL  [conn50916]
 0xf3f529 0xedfd81 0xec50c2 0xb96a2c 0xaadda7 0xab0eab 0x9e339c 0x874dbd 0x8753f7 0xeb5567 0xea4128 0x106934c 0x32facf106362
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"400000","o":"B3F529"},{"b":"400000","o":"ADFD81"},{"b":"400000","o":"AC50C2"},{"b":"400000","o":"796A2C"},{"b":"400000","o":"6ADDA7"},{"b":"400000","o":"6B0EAB"},{"b":"400000","o":"5E339C"},{"b":"400000","o":"474DBD"},{"b":"400000","o":"4753F7"},{"b":"400000","o":"AB5567"},{"b":"400000","o":"AA4128"},{"b":"400000","o":"C6934C"},{"b":"0","o":"32FACF106362"}],"processInfo":{ "mongodbVersion" : "3.0.0", "gitVersion" : "a841fd6394365954886924a35076691b4d149168", "uname" : { "sysname" : "Linux","release" : "3.10.23-xxxx-std-ipv6-64", "version" : "#1 SMP Tue Mar 18 14:48:24 CET 2014", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000", "buildId" : "5F365548FE8312E027817D98FEECA3A18F0F60FB" }, { "b" : "7FFF15CFF000", "elfType" : 3,"buildId" : "0D096D82CBE15973DC310445C75DA4831706A73B" }, { "b" : "7FBFD29B9000", "path" : "/lib/x86_64-linux-gnu/libpthread.so.0","elfType" : 3, "buildId" : "9318E8AF0BFBE444731BB0461202EF57F7C39542" }, { "b" : "7FBFD275B000", "path" : "/lib/x86_64-linux-gnu/libssl.so.1.0.0", "elfType" : 3, "buildId" : "FF43D0947510134A8A494063A3C1CF3CEBB27791" }, { "b" : "7FBFD2381000", "path" : "/lib/x86_64-linux-gnu/libcrypto.so.1.0.0", "elfType" : 3, "buildId" : "379F80D2768BA6A21F52781895EE9F47B34A0A85" }, { "b" : "7FBFD2179000", "path" : "/lib/x86_64-linux-gnu/librt.so.1", "elfType" : 3, "buildId" : "92FCF41EFE012D6186E31A59AD05BDBB487769AB" }, { "b" : "7FBFD1F75000", "path" : "/lib/x86_64-linux-gnu/libdl.so.2", "elfType" : 3, "buildId" : "C1AE4CB7195D337A77A3C689051DABAA3980CA0C" }, { "b" : "7FBFD1C71000", "path" : "/usr/lib/x86_64-linux-gnu/libstdc++.so.6", "elfType" : 3, "buildId" : "19EFDDAB11B3BF5C71570078C59F91CF6592CE9E" }, { "b" : "7FBFD196B000", "path" : "/lib/x86_64-linux-gnu/libm.so.6", "elfType" : 3, "buildId" : "1D76B71E905CB867B27CEF230FCB20F01A3178F5" }, { "b" : "7FBFD1755000", "path" : "/lib/x86_64-linux-gnu/libgcc_s.so.1","elfType" : 3, "buildId" : "8D0AA71411580EE6C08809695C3984769F25725B" }, { "b" : "7FBFD1390000", "path" : "/lib/x86_64-linux-gnu/libc.so.6", "elfType" : 3, "buildId" : "30C94DC66A1FE95180C3D68D2B89E576D5AE213C" }, { "b" : "7FBFD2BD7000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "9F00581AB3C73E3AEA35995A0C50D24D59A01D47" } ] }}
 mongod(_ZN5mongo15printStackTraceERSo+0x29) [0xf3f529]
 mongod(_ZN5mongo10logContextEPKc+0xE1) [0xedfd81]
 mongod(_ZN5mongo15invariantFailedEPKcS1_j+0xB2) [0xec50c2]
 mongod(_ZN5mongo7getMoreEPNS_16OperationContextEPKcixRNS_5CurOpEiRbPbb+0x58C) [0xb96a2c]
 mongod(_ZN5mongo15receivedGetMoreEPNS_16OperationContextERNS_10DbResponseERNS_7MessageERNS_5CurOpEb+0x367) [0xaadda7]
 mongod(_ZN5mongo16assembleResponseEPNS_16OperationContextERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortEb+0x13CB) [0xab0eab]
 mongod(_ZN5mongo14DBDirectClient4callERNS_7MessageES2_bPSs+0x9C) [0x9e339c]
 mongod(_ZN5mongo14DBClientCursor11requestMoreEv+0x20D) [0x874dbd]
 mongod(_ZN5mongo14DBClientCursor4moreEv+0x47) [0x8753f7]
 mongod(_ZN5mongo21internalCursorHasNextEPNS_7V8ScopeERKN2v89ArgumentsE+0x17) [0xeb5567]
 mongod(_ZN5mongo7V8Scope10v8CallbackERKN2v89ArgumentsE+0x78) [0xea4128]
 mongod(+0xC6934C) [0x106934c]
 ??? [0x32facf106362]
-----  END BACKTRACE  -----
2015-03-06T19:48:09.997+0100 I -        [conn50916]
 
***aborting after invariant() failure

We can see in the log that some queries have managed to be called just before the crash, for example :

2015-03-06T19:48:09.104+0100 I QUERY    [conn48972] query asknread.keywords query: { first_word: "mejor", lang: "es", categories_ids: ObjectId('511e0150c19219cb25000000') } planSummary: IXSCAN { first_word: 1.0, lang: 1.0, status: 1.0 } ntoskip:0 nscanned:15675 nscannedObjects:15675 keyUpdates:0 writeConflicts:0 numYields:1366 nreturned:0 reslen:20 locks:{} 470ms

So now we have a primary on 2.6.8 and two secondaries on 3.0, which can't be primary.



 Comments   
Comment by Githook User [ 09/Mar/15 ]

Author:

{u'username': u'jrassi', u'name': u'Jason Rassi', u'email': u'rassi@10gen.com'}

Message: SERVER-17499 Don't set owned RU if creating DBDirectClient agg cursor
Branch: v3.0
https://github.com/mongodb/mongo/commit/d6b7c2351e47992bf350131c2b60e15644ce4b69

Comment by Githook User [ 09/Mar/15 ]

Author:

{u'username': u'jrassi', u'name': u'Jason Rassi', u'email': u'rassi@10gen.com'}

Message: SERVER-17499 Unify GodScope and fromDBDirectClient bool
Branch: v3.0
https://github.com/mongodb/mongo/commit/568052f3aa85a803fbced9b848688892fdbe0290

Comment by Githook User [ 09/Mar/15 ]

Author:

{u'username': u'jrassi', u'name': u'Jason Rassi', u'email': u'rassi@10gen.com'}

Message: SERVER-17499 Don't set owned RU if creating DBDirectClient agg cursor
Branch: master
https://github.com/mongodb/mongo/commit/78ac7a453c78d429e0ef294ae350a61c3bb83b26

Comment by Githook User [ 09/Mar/15 ]

Author:

{u'username': u'jrassi', u'name': u'Jason Rassi', u'email': u'rassi@10gen.com'}

Message: SERVER-17499 Unify GodScope and fromDBDirectClient bool
Branch: master
https://github.com/mongodb/mongo/commit/0c388e562f6b1549b583486b151164df17b37d94

Comment by J Rassi [ 06/Mar/15 ]

Thanks for reporting this issue. We can confirm that the root cause is that the server hits a fatal assertion when the eval command is used to run a getMore on an aggregation cursor.

As a workaround, you should indeed change your application code to use the MongoCollection::aggregate() helper as follows:

// Before:
$results = $db->execute('return db.collection.aggregate(...)');
 
// After:
$results = $collection->aggregate(...);

Please continue to watch this ticket for updates on when a fix will be available.

~ Jason Rassi

Comment by Patrig Droumaguet [X] [ 06/Mar/15 ]

Hi David,

I have to figure out how to extract only the part of the log that is interesting, but apparently the crash is connected to some $eval commands. A bit earlier than the crash, I can read this in the log :

2015-03-06T23:23:06.078+0100 D COMMAND  [conn386442] warning: log line attempted (59k) over max size (10k), printing beginning and end ...  run command asknread.$cmd { $eval: CodeWScope( return db.articles.aggregate([
[my agregation command]

I have to say that I had used MongoDB::execute() to call aggregations because I was using a framework which previously couldn't support the PHP drivers after 1.3.5 (when Mongo has changed to MongoClient), and the aggregate() collection method didn't existed then (if I'm not wrong). But recently (and to prepare MongoDB 3) I had upgraded my code and the framework to be able to use the 1.6+ driver. And still on MongoDB 2.6.8 the execute() method for aggregations was working fine.

I may need to change all my aggregations calls. Thanks for your help !

Comment by David Storch [ 06/Mar/15 ]

Thanks Sicabol. It looks like the crash is happening inside the eval command. Would you be able to try reproducing the problem while running with logLevel 2? You can increase the logLevel using the shell's logLevel parameter, e.g.:

var old = db.adminCommand({setParameter: 1, logLevel: 2});
// Reproduce the crash
db.adminCommand({setParameter: 1, logLevel:  old.was}); // Restore the old log level.

The verbose logs prior to and including the time of the crash would be very useful.

Best,
Dave

Comment by Patrig Droumaguet [X] [ 06/Mar/15 ]

OK, now that's very interesting !

I've tried in PHP to use an MongoDB::execute() command on my database (a find().explain()) but by forcing the read preference on SECONDARY, and I've tried to do the same by detailing each step. The first try shows that execute() doesn't use a secondary server but the primary, the second try was using a secondary server. Maybe it's related, as I call several MongoDB::execute()…

        $m = new MongoClient("mongodb://one.secondary.server", array(
            'connect' => TRUE,
            'readPreference' => MongoClient::RP_SECONDARY,
            'replicaSet' => 'rs',
        ));
        $db = $m->mydatabase;
 
// This method shows that the query has been executed on the primary server
        $results = $db->execute("return db.keywords.find({lang:'en', name:'my keyword'}).explain();");
        print_r($results);
        print_r($db->getReadPreference());
 
// This method shows that the query has been correctly executed on a secondary server
        $col = $db->keywords;
        $cursor = $col->find(array('name' => 'my keyword'), array('name'));
        var_dump($cursor->explain());
        print_r($db->getReadPreference());

Comment by Patrig Droumaguet [X] [ 06/Mar/15 ]

Hi,

The application using the database is developed with PHP. We use the 1.6.4 driver.

There are some agregations called with MongoDB::execute() method, which (AFAIK) is equivalent to MongoDB::command() with an eval. We use neither MapReduce nor $where.

Here is an example of a command. $profiles_ids_str is an array of ObjectIds (in string, as in "ObjectId('xxxx')").

        $command = 'db.archives.distinct("article_id", {
                status: {
                    $exists: true
                },
                profiles_ids: {
                    $in: ['.implode(', ',$profiles_ids_str).']
                }
            }).length;';
        $results = $db->execute('return '.$command);

Comment by David Storch [ 06/Mar/15 ]

Hi Sicabol,

Thanks for reporting this issue. We are trying to reproduce on our end, but in the meantime I would like to gather some additional information. Are you running anything that would create a JavaScript context on the server? This includes:

  1. eval,
  2. mapReduce,
  3. $where

Thanks,
Dave

Generated at Thu Feb 08 03:44:42 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.