Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-21943

MR crashes in a sharded WT environment

    • Type: Icon: Bug Bug
    • Resolution: Incomplete
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 3.0.4
    • Component/s: MapReduce, Sharding, WiredTiger
    • Labels:
      None
    • ALL
    • Hide

      1. Install a sharded 3.0.4 mongodb setup w/
      2. Run a MR every 5 min
      3. After few days the MR (and the daemon) will crush w/ the following error
      4. The same does being recreated in MMAPv1

      Show
      1. Install a sharded 3.0.4 mongodb setup w/ 2. Run a MR every 5 min 3. After few days the MR (and the daemon) will crush w/ the following error 4. The same does being recreated in MMAPv1

      We got the a similar error to SERVER-16429 at version 3.0.4.
      It happens once a week in a sharded environment with WT engine in the mongod instances.
      The evidences is a crashed shard, while in the other shard there a remaining of a tmp table that was not deleted.
      Attached is the log error from the failed shard:

      2015-12-12T00:35:40.814+0000 I COMMAND  [conn92487] mr failed, removing collection :: caused by :: WriteConflict
      2015-12-12T00:35:40.818+0000 I COMMAND  [conn92487] CMD: drop XXXX.tmp.mr.account_231015
      2015-12-12T00:35:40.822+0000 I NETWORK  [initandlisten] connection accepted from XXX.XXX.XXX.XXX:XXXXX #110728 (47 connections now open)
      2015-12-12T00:35:40.920+0000 I COMMAND  [conn92487] command XXXX.$cmd command: drop { drop: "tmp.mr.account_231015" } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:122 locks:{ Global: { acquireCount: { r: 8, w: 4 } }, Database: { acquireCount: { r: 1, w: 1, R: 1, W: 4 }, acquireWaitCount: { W: 4 }, timeAcquiringMicros: { W: 7627049380 } }, Collection: { acquireCount: { r: 1, w: 1, W: 1 } } } 102ms
      2015-12-12T00:35:40.920+0000 I QUERY    [conn110694] query XXXX.endpoints query: { $query: { gw: { $gt: 0 }, $or: [ { status: "unmanaged" }, { status: "managed" } ] }, $readPreference: { mode: "secondaryPreferred" } } planSummary: IXSCAN { gw: -1.0, status: -1.0 } ntoreturn:0 ntoskip:0 nscanned:0 nscannedObjects:0 keyUpdates:0 writeConflicts:0 numYields:1 nreturned:0 reslen:20 locks:{ Global: { acquireCount: { r: 4 } }, Database: { acquireCount: { r: 2 }, acquireWaitCount: { r: 2 }, timeAcquiringMicros: { r: 3690878996 } }, Collection: { acquireCount: { r: 2 } } } 106ms
      2015-12-12T00:35:40.920+0000 I QUERY    [conn110692] query XXXX.endpoints query: { $query: { gw: { $gt: 0 }, $or: [ { status: "unmanaged" }, { status: "managed" } ] }, $readPreference: { mode: "secondaryPreferred" } } planSummary: IXSCAN { gw: -1.0, status: -1.0 } ntoreturn:0 ntoskip:0 nscanned:0 nscannedObjects:0 keyUpdates:0 writeConflicts:0 numYields:3 nreturned:0 reslen:20 locks:{ Global: { acquireCount: { r: 8 } }, Database: { acquireCount: { r: 4 }, acquireWaitCount: { r: 4 }, timeAcquiringMicros: { r: 3690908386 } }, Collection: { acquireCount: { r: 4 } } } 182338ms
      2015-12-12T00:35:40.927+0000 I COMMAND  [conn92491] command admin.$cmd command: listDatabases { listDatabases: 1 } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:290 locks:{ Global: { acquireCount: { r: 6 } }, Database: { acquireCount: { r: 3 }, acquireWaitCount: { r: 1 }, timeAcquiringMicros: { r: 14825915727 } } } 179101ms
      2015-12-12T00:35:40.978+0000 E NETWORK  [conn92487] Uncaught std::exception: std::exception, terminating
      2015-12-12T00:35:40.978+0000 I CONTROL  [conn92487] dbexit:  rc: 100
      

            Assignee:
            kelsey.schubert@mongodb.com Kelsey Schubert
            Reporter:
            MosheKaplan Moshe Kaplan [X]
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

              Created:
              Updated:
              Resolved: