Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-18006

Shard replicas crash after 3.0.2 upgrade with "Didn't find RecordId in WiredTigerRecordStore"

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical - P2
    • Resolution: Duplicate
    • Affects Version/s: 3.0.2
    • Fix Version/s: None
    • Component/s: WiredTiger
    • Labels:
      None
    • Operating System:
      ALL

      Description

      Upgraded a 4-shard replicated 3.0.1 cluster to 3.0.2. 1 hour later, two of the shard replicas crashed with the following stack trace:

      2015-04-10T20:42:23.166+0000 I -        [repl writer worker 14] Assertion: 28556:Didn't find RecordId in WiredTigerRecordStore
      2015-04-10T20:42:23.174+0000 I CONTROL  [repl writer worker 14] 
       0xf6b4a9 0xf0bca1 0xef0eff 0xd835bc 0x92e4d0 0xa2ac15 0xa1caa0 0xbed174 0xbed524 0xbedb5d 0xb1a536 0xc577e1 0xcaea6e 0xcb1255 0xf02abb 0xfb8a04 0x7f68e4785df3 0x7f68e32351ad
      ----- BEGIN BACKTRACE -----
      {"backtrace":[{"b":"400000","o":"B6B4A9"},{"b":"400000","o":"B0BCA1"},{"b":"400000","o":"AF0EFF"},{"b":"400000","o":"9835BC"},{"b":"400000","o":"52E4D0"},{"b":"400000","o":"62AC15"},{"b":"400000","o":"61CAA0"},{"b":"400000","o":"7ED174"},{"b":"400000","o":"7ED524"},{"b":"400000","o":"7EDB5D"},{"b":"400000","o":"71A536"},{"b":"400000","o":"8577E1"},{"b":"400000","o":"8AEA6E"},{"b":"400000","o":"8B1255"},{"b":"400000","o":"B02ABB"},{"b":"400000","o":"BB8A04"},{"b":"7F68E477E000","o":"7DF3"},{"b":"7F68E313F000","o":"F61AD"}],"processInfo":{ "mongodbVersion" : "3.0.2", "gitVersion" : "6201872043ecbbc0a4cc169b5482dcf385fc464f", "uname" : { "sysname" : "Linux", "release" : "3.4.73-64.112.amzn1.x86_64", "version" : "#1 SMP Tue Dec 10 01:50:05 UTC 2013", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000", "buildId" : "810BAA4B0BDE1EBB076BF6A544BF2724A13FAE84" }, { "b" : "7FFFF29DA000", "elfType" : 3, "buildId" : "6A7E58423C94E707E8B4A45F098E78D557DD5A7D" }, { "b" : "7F68E477E000", "path" : "/lib64/libpthread.so.0", "elfType" : 3, "buildId" : "D48D3E6672A77B603B402F661BABF75E90AD570B" }, { "b" : "7F68E4511000", "path" : "/usr/lib64/libssl.so.10", "elfType" : 3, "buildId" : "A5C6FCE56D21542B9535D5FBCFE74EBA71D24613" }, { "b" : "7F68E412D000", "path" : "/lib64/libcrypto.so.10", "elfType" : 3, "buildId" : "75565934551BECB21A027BE0A3AE28A62AEEB7EE" }, { "b" : "7F68E3F25000", "path" : "/lib64/librt.so.1", "elfType" : 3, "buildId" : "E81013CBFA409053D58A65A0653271AB665A4619" }, { "b" : "7F68E3D21000", "path" : "/lib64/libdl.so.2", "elfType" : 3, "buildId" : "62A8842157C62F95C3069CBF779AFCC26577A99A" }, { "b" : "7F68E3A18000", "path" : "/usr/lib64/libstdc++.so.6", "elfType" : 3, "buildId" : "66F1CF311C61879639BD3DC0034DEE0D6D042261" }, { "b" : "7F68E3716000", "path" : "/lib64/libm.so.6", "elfType" : 3, "buildId" : "5F97F8F8E5024E29717CF35998681F84D4A22D45" }, { "b" : "7F68E3500000", "path" : "/lib64/libgcc_s.so.1", "elfType" : 3, "buildId" : "C52958E393BDF8E8D090F36DE0F4E620D8736FBF" }, { "b" : "7F68E313F000", "path" : "/lib64/libc.so.6", "elfType" : 3, "buildId" : "77E48C0DBAE0843560558F1E51A5FC61602ABC5F" }, { "b" : "7F68E499A000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "6F90843B9087FE91955FEB0355EB0858EF9E97B2" }, { "b" : "7F68E2EFC000", "path" : "/lib64/libgssapi_krb5.so.2", "elfType" : 3, "buildId" : "9DF61878D8918F25CC74AD01F417FDB051DFE3DA" }, { "b" : "7F68E2C17000", "path" : "/lib64/libkrb5.so.3", "elfType" : 3, "buildId" : "6F1DB0F811D1B210520443442D4437BC43BF9A80" }, { "b" : "7F68E2A14000", "path" : "/lib64/libcom_err.so.2", "elfType" : 3, "buildId" : "1A6E97644CC9149C2E1871C6AE1DB51975E78A41" }, { "b" : "7F68E27E9000", "path" : "/lib64/libk5crypto.so.3", "elfType" : 3, "buildId" : "F7DF34078FD7BFD684FE46D5F677EEDA1D9B9DC9" }, { "b" : "7F68E25D3000", "path" : "/lib64/libz.so.1", "elfType" : 3, "buildId" : "87B4EBF2183C8EA4AB657212203EFFE6340E2F4F" }, { "b" : "7F68E23C8000", "path" : "/lib64/libkrb5support.so.0", "elfType" : 3, "buildId" : "381960ACAB9C39461D58BDE7B272C4F61BB3582F" }, { "b" : "7F68E21C5000", "path" : "/lib64/libkeyutils.so.1", "elfType" : 3, "buildId" : "BF48CD5658DE95CE058C4B828E81C97E2AE19643" }, { "b" : "7F68E1FAB000", "path" : "/lib64/libresolv.so.2", "elfType" : 3, "buildId" : "17295EFA7A45ADE184C2A7A3294460BDC8B43277" }, { "b" : "7F68E1D8A000", "path" : "/usr/lib64/libselinux.so.1", "elfType" : 3, "buildId" : "803D7EF21A989677D056E52BAEB9AB5B154FB9D9" } ] }}
       mongod(_ZN5mongo15printStackTraceERSo+0x29) [0xf6b4a9]
       mongod(_ZN5mongo10logContextEPKc+0xE1) [0xf0bca1]
       mongod(_ZN5mongo11msgassertedEiPKc+0xAF) [0xef0eff]
       mongod(_ZNK5mongo21WiredTigerRecordStore7dataForEPNS_16OperationContextERKNS_8RecordIdE+0xDC) [0xd835bc]
       mongod(_ZNK5mongo10Collection6docForEPNS_16OperationContextERKNS_8RecordIdE+0x20) [0x92e4d0]
       mongod(_ZN5mongo11IDHackStage4workEPm+0x185) [0xa2ac15]
       mongod(_ZN5mongo11DeleteStage4workEPm+0x70) [0xa1caa0]
       mongod(_ZN5mongo12PlanExecutor18getNextSnapshottedEPNS_11SnapshottedINS_7BSONObjEEEPNS_8RecordIdE+0xA4) [0xbed174]
       mongod(_ZN5mongo12PlanExecutor7getNextEPNS_7BSONObjEPNS_8RecordIdE+0x34) [0xbed524]
       mongod(_ZN5mongo12PlanExecutor11executePlanEv+0x3D) [0xbedb5d]
       mongod(_ZN5mongo13deleteObjectsEPNS_16OperationContextEPNS_8DatabaseERKNS_10StringDataENS_7BSONObjENS_12PlanExecutor11YieldPolicyEbbbb+0x226) [0xb1a536]
       mongod(_ZN5mongo4repl21applyOperation_inlockEPNS_16OperationContextEPNS_8DatabaseERKNS_7BSONObjEbb+0x1041) [0xc577e1]
       mongod(_ZN5mongo4repl8SyncTail9syncApplyEPNS_16OperationContextERKNS_7BSONObjEb+0x2EE) [0xcaea6e]
       mongod(_ZN5mongo4repl14multiSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE+0x65) [0xcb1255]
       mongod(_ZN5mongo10threadpool6Worker4loopERKSs+0x2FB) [0xf02abb]
       mongod(+0xBB8A04) [0xfb8a04]
       libpthread.so.0(+0x7DF3) [0x7f68e4785df3]
       libc.so.6(clone+0x6D) [0x7f68e32351ad]
      -----  END BACKTRACE  -----
      2015-04-10T20:42:23.174+0000 E REPL     [repl writer worker 14] writer worker caught exception:  :: caused by :: 28556 Didn't find RecordId in WiredTigerRecordStore on: { ts: Timestamp 1428698543000|107, h: 3163678979668444329, v: 2, op: "d", ns: "abc.xyz", fromMigrate: true, o: { _id: ObjectId('5509b725e4b0bd119005d2b6') } }
      2015-04-10T20:42:23.174+0000 I -        [repl writer worker 14] Fatal Assertion 16360
      2015-04-10T20:42:23.174+0000 I -        [repl writer worker 14] 
       
      ***aborting after fassert() failure
      

      Second replica same trace! But a different collection and about 5 minutes later in time.

      Both of these failures are preceded by a chunk move (in the mongos logs), a few minutes after. Here is the only line in the MongoS that's pertaining:

      2015-04-10T20:37:14.516+0000 I SHARDING [Balancer] moving chunk ns: abc.xyz moving ( ns: abc.xyz, shard: rs_prod1_event_shard2:rs_prod1_event_shard2/p1e2r1.prod.evergage.com:27017,p1e2r2.prod.evergage.com:27017, lastmod: 7|1||000000000000000000000000, min: { user.id: "8812601eb0ca0000" }, max: { user.id: "b0adb576c86b0000" }) rs_prod1_event_shard2:rs_prod1_event_shard2/p1e2r1.prod.evergage.com:27017,p1e2r2.prod.evergage.com:27017 -> rs_prod1_event_shard3:rs_prod1_event_shard3/p1e3r1.prod.evergage.com:27017,p1e3r2.prod.evergage.com:27017

      A little later in the mongos, two minutes before failure is the only other log line related to this same collection:

      2015-04-10T20:40:09.779+0000 I SHARDING [Balancer] ChunkManager: time to load chunks for abc.xyz: 0ms sequenceNumber: 2667 version: 10|1||54ee6b6d4c0a4ab1d04d2986 based on: 9|1||54ee6b6d4c0a4ab1d04d2986
      

      So it seems like this was a result of a chunk move. We turned off the load balancer and ran without crashes for the last two days. Of course, it's not sustainable to leave the load balancer off.

      Replicas would NOT restart after this crash and kept crashing with the same stack trace on the same collection. They would start w/o the replSet argument, but with replSet, it would fail. We ended up removing the collection locally (started w/o replSet) and that allowed the node to restart as a replication member. (We then dropped that collection clusterwide)

      About an hour before the failure, the entire cluster has been upgraded from 3.0.1 to 3.0.2, specifically the mongodb-linux-x86_64-amazon-3.0.2.tgz release. Here's the uname -a output:

      $ uname -a
      Linux p1e1r2 3.4.73-64.112.amzn1.x86_64 #1 SMP Tue Dec 10 01:50:05 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

        Attachments

          Issue Links

            Activity

              People

              • Votes:
                0 Vote for this issue
                Watchers:
                11 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: