Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-17499

Using eval command to run getMore on aggregation cursor trips fatal assertion

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Critical - P2 Critical - P2
    • 3.0.1, 3.1.0
    • Affects Version/s: 3.0.0
    • Component/s: Replication
    • Labels:
    • Environment:
      Ubuntu 14.04.2
      Linux kernel 3.13.0-46-generic and 3.10.23-xxxx-std-ipv6-64
      Hosting : OVH (France)
    • Fully Compatible
    • Linux

      Hi,

      We had a replica set of three dedicated servers on Ubuntu 14.04. Everything was working fine on 2.6.8. We have stopped one of them to reinstall Ubuntu (clean slate) and to install MongoDB 3.0 with WiredTiger. The initialization has gone OK, the replication was working great (the primary was still on 2.6.8). Using rs.slaveOk(), we could see that the data was up-to-date.

      We then have tried to push him as primary by calling a rs.freeze(600) on the other secondary and a stepDown() on the primary. After less than 10 seconds, mongod crashes. Starting mongod again, the server works again as secondary.

      As it is written that each server should be on 2.6.8 or preferably 3.0, we've tried to update the second secondary server, but this time without reinstalling everything (so we've only updated the Ubuntu sources list and launch sudo apt-get update / sudo apt-get upgrade). The server updated, and restarted as secondary without any issue. It was possible to call queries from the shell without problem.

      Then we've called a rs.freeze(600) on the secondary on 3.0 WiredTiger and a rs.stepDown() on the primary still on 2.6.8. Very quickly the server has crashed and the previous primary turns back as primary. The log was identical as the crash log on the first server.

      After restarting the second 3.0 server as secondary, we've hoped that if all the servers on the replica set were on 3.0 it could work… So we've called a rs.freeze(600) on the first 3.0 secondary server (the one using WiredTiger) and we've shut down the primary by calling sudo service mongod stop on Ubuntu shell. The server on 3.0 using MMAPv1 turns primary, and crashes after few seconds. We then have started the server on 2.6.8 and after a very short rollback it has gone back primary.

      Here is an example of what we could see in the log on the 3.0 servers after the crash :

      2015-03-06T19:48:09.990+0100 I -        [conn50916] Invariant failure txn->recoveryUnit() == cc->getUnownedRecoveryUnit() src/mongo/db/query/find$
      2015-03-06T19:48:09.997+0100 I CONTROL  [conn50916]
       0xf3f529 0xedfd81 0xec50c2 0xb96a2c 0xaadda7 0xab0eab 0x9e339c 0x874dbd 0x8753f7 0xeb5567 0xea4128 0x106934c 0x32facf106362
      ----- BEGIN BACKTRACE -----
      {"backtrace":[{"b":"400000","o":"B3F529"},{"b":"400000","o":"ADFD81"},{"b":"400000","o":"AC50C2"},{"b":"400000","o":"796A2C"},{"b":"400000","o":"6ADDA7"},{"b":"400000","o":"6B0EAB"},{"b":"400000","o":"5E339C"},{"b":"400000","o":"474DBD"},{"b":"400000","o":"4753F7"},{"b":"400000","o":"AB5567"},{"b":"400000","o":"AA4128"},{"b":"400000","o":"C6934C"},{"b":"0","o":"32FACF106362"}],"processInfo":{ "mongodbVersion" : "3.0.0", "gitVersion" : "a841fd6394365954886924a35076691b4d149168", "uname" : { "sysname" : "Linux","release" : "3.10.23-xxxx-std-ipv6-64", "version" : "#1 SMP Tue Mar 18 14:48:24 CET 2014", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000", "buildId" : "5F365548FE8312E027817D98FEECA3A18F0F60FB" }, { "b" : "7FFF15CFF000", "elfType" : 3,"buildId" : "0D096D82CBE15973DC310445C75DA4831706A73B" }, { "b" : "7FBFD29B9000", "path" : "/lib/x86_64-linux-gnu/libpthread.so.0","elfType" : 3, "buildId" : "9318E8AF0BFBE444731BB0461202EF57F7C39542" }, { "b" : "7FBFD275B000", "path" : "/lib/x86_64-linux-gnu/libssl.so.1.0.0", "elfType" : 3, "buildId" : "FF43D0947510134A8A494063A3C1CF3CEBB27791" }, { "b" : "7FBFD2381000", "path" : "/lib/x86_64-linux-gnu/libcrypto.so.1.0.0", "elfType" : 3, "buildId" : "379F80D2768BA6A21F52781895EE9F47B34A0A85" }, { "b" : "7FBFD2179000", "path" : "/lib/x86_64-linux-gnu/librt.so.1", "elfType" : 3, "buildId" : "92FCF41EFE012D6186E31A59AD05BDBB487769AB" }, { "b" : "7FBFD1F75000", "path" : "/lib/x86_64-linux-gnu/libdl.so.2", "elfType" : 3, "buildId" : "C1AE4CB7195D337A77A3C689051DABAA3980CA0C" }, { "b" : "7FBFD1C71000", "path" : "/usr/lib/x86_64-linux-gnu/libstdc++.so.6", "elfType" : 3, "buildId" : "19EFDDAB11B3BF5C71570078C59F91CF6592CE9E" }, { "b" : "7FBFD196B000", "path" : "/lib/x86_64-linux-gnu/libm.so.6", "elfType" : 3, "buildId" : "1D76B71E905CB867B27CEF230FCB20F01A3178F5" }, { "b" : "7FBFD1755000", "path" : "/lib/x86_64-linux-gnu/libgcc_s.so.1","elfType" : 3, "buildId" : "8D0AA71411580EE6C08809695C3984769F25725B" }, { "b" : "7FBFD1390000", "path" : "/lib/x86_64-linux-gnu/libc.so.6", "elfType" : 3, "buildId" : "30C94DC66A1FE95180C3D68D2B89E576D5AE213C" }, { "b" : "7FBFD2BD7000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "9F00581AB3C73E3AEA35995A0C50D24D59A01D47" } ] }}
       mongod(_ZN5mongo15printStackTraceERSo+0x29) [0xf3f529]
       mongod(_ZN5mongo10logContextEPKc+0xE1) [0xedfd81]
       mongod(_ZN5mongo15invariantFailedEPKcS1_j+0xB2) [0xec50c2]
       mongod(_ZN5mongo7getMoreEPNS_16OperationContextEPKcixRNS_5CurOpEiRbPbb+0x58C) [0xb96a2c]
       mongod(_ZN5mongo15receivedGetMoreEPNS_16OperationContextERNS_10DbResponseERNS_7MessageERNS_5CurOpEb+0x367) [0xaadda7]
       mongod(_ZN5mongo16assembleResponseEPNS_16OperationContextERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortEb+0x13CB) [0xab0eab]
       mongod(_ZN5mongo14DBDirectClient4callERNS_7MessageES2_bPSs+0x9C) [0x9e339c]
       mongod(_ZN5mongo14DBClientCursor11requestMoreEv+0x20D) [0x874dbd]
       mongod(_ZN5mongo14DBClientCursor4moreEv+0x47) [0x8753f7]
       mongod(_ZN5mongo21internalCursorHasNextEPNS_7V8ScopeERKN2v89ArgumentsE+0x17) [0xeb5567]
       mongod(_ZN5mongo7V8Scope10v8CallbackERKN2v89ArgumentsE+0x78) [0xea4128]
       mongod(+0xC6934C) [0x106934c]
       ??? [0x32facf106362]
      -----  END BACKTRACE  -----
      2015-03-06T19:48:09.997+0100 I -        [conn50916]
      
      ***aborting after invariant() failure
      

      We can see in the log that some queries have managed to be called just before the crash, for example :

      2015-03-06T19:48:09.104+0100 I QUERY    [conn48972] query asknread.keywords query: { first_word: "mejor", lang: "es", categories_ids: ObjectId('511e0150c19219cb25000000') } planSummary: IXSCAN { first_word: 1.0, lang: 1.0, status: 1.0 } ntoskip:0 nscanned:15675 nscannedObjects:15675 keyUpdates:0 writeConflicts:0 numYields:1366 nreturned:0 reslen:20 locks:{} 470ms
      

      So now we have a primary on 2.6.8 and two secondaries on 3.0, which can't be primary.

            Assignee:
            rassi J Rassi
            Reporter:
            Sicabol Patrig Droumaguet [X]
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

              Created:
              Updated:
              Resolved: