Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-26943

Non-replacement updates to the config.shards collection can crash the CSRS secondary after rollback

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Fixed
    • Affects Version/s: 3.4.0-rc2
    • Fix Version/s: 3.4.0-rc4
    • Component/s: Sharding
    • Labels:
      None
    • Backwards Compatibility:
      Fully Compatible
    • Operating System:
      ALL
    • Steps To Reproduce:
      Hide

      No deterministic way. Hit through the continuous stepdown suite.

      Show
      No deterministic way. Hit through the continuous stepdown suite.
    • Sprint:
      Sharding 2016-11-21
    • Linked BF Score:
      0

      Description

      The config servers have a special opObserver insert hook to intercept updates from a legacy v3.2 mongos to the config.shards collection and maintain the shard identity.

      This hook always always expects that a complete shard document is inserted (which is correct on the primaries). However on a secondary, which is recovering from a rollback, if an update is followed by delete, it may end up trying to apply the update after a previously applied deletion, which will convert the update to an upsert and cause an invariant, because this results in an incomplete shard document.

      For example, the following sequence:

      c23012| 2016-11-07T19:01:26.092+0000 D ASIO     [NetworkInterfaceASIO-RS-0] Request 286 finished with response: { cursor: { firstBatch: [ { ts: Timestamp 1478545283000|1, t: 4, h: 565510199539623323, v: 2, op: "u", ns: "config.shards", o2: { _id: "shard0001" }, o: { $set: { draining: true } } }, { ts: Timestamp 1478545285000|8, t: 4, h: -4558147567226493446, v: 2, op: "d", ns: "config.shards", o: { _id: "shard0001" } }, ok: 1.0 }
       
      c23012| 2016-11-07T19:01:26.092+0000 I REPL     [rsBackgroundSync] Starting rollback due to OplogStartMissing: our last op time fetched: { ts: Timestamp 1478545283000|1, t: 3 }. source's GTE: { ts: Timestamp 1478545283000|1, t: 4 } hashes: (-6821259113153738378/565510199539623323)
       
      c23012| 2016-11-07T19:01:26.107+0000 D ASIO     [rsBackgroundSync] startCommand: RemoteCommand 298 -- target:ip-10-152-38-201:23013 db:local expDate:2016-11-07T19:01:31.107+0000 cmd:{ find: "oplog.rs", filter: { ts: { $gte: Timestamp 1478545272000|5 } }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 60000, term: 4 }
       
      c23012| 2016-11-07T19:01:26.108+0000 D ASIO     [NetworkInterfaceASIO-RS-0] Request 298 finished with response: { cursor: { firstBatch: [ { ts: Timestamp 1478545283000|1, t: 4, h: 565510199539623323, v: 2, op: "u", ns: "config.shards", o2: { _id: "shard0001" }, o: { $set: { draining: true } } }, { ts: Timestamp 1478545285000|8, t: 4, h: -4558147567226493446, v: 2, op: "d", ns: "config.shards", o: { _id: "shard0001" } }, ok: 1.0 }
      

      Results in this fatal exception:

      c23012| 2016-11-07T19:01:26.109+0000 F REPL     [repl writer worker 15] writer worker caught exception: 4 Missing expected field "host" on: { ts: Timestamp 1478545283000|1, t: 4, h: 565510199539623323, v: 2, op: "u", ns: "config.shards", o2: { _id: "shard0001" }, o: { $set: { draining: true } } }
      c23012| 2016-11-07T19:01:26.109+0000 I -        [repl writer worker 15] Fatal assertion 16359 NoSuchKey: Missing expected field "host" at src/mongo/db/repl/sync_tail.cpp 1054
      c23012| 2016-11-07T19:01:26.109+0000 I -        [repl writer worker 15]
      c23012|
      c23012| ***aborting after fassert() failure
      

        Attachments

          Activity

            People

            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: