Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-32696

All replica set nodes freeze during heavy update with enableMajorityReadConcern

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: Replication
    • Labels:
    • Environment:
      Ubuntu 16.04
    • ALL

      I have a 3-node replica set running version 3.4.10 on Ubuntu 16.04.

      I ran a schema update that touched all 7 million rows of a collection with a $set and a $rename. Because one of the secondaries is about 30ms away in Azure, I used majority write concern to slow down the update and make sure at least one of the secondaries would stay in sync.

      The query started at 14:19:29. At that point the Azure slave was probably 3-5 minutes behind because of earlier schema migrations. But by 14:27:00, the main secondary was unable to get results for oplog queries:

      Jan 13 14:27:00 secondary mongod.27017[28273]: [replication-163] Restarting oplog query due to error: ExceededTimeLimit: Operation timed out, request was RemoteCommand 18482564 -- target:primary:27017 db:local expDate:2018-01-13T14:27:00.216+0000 cmd:{ getMore: 16483339842, collection: "oplog.rs", maxTimeMS: 5000, term: 25, lastKnownCommittedOpTime: { ts: Timestamp 1515853539000|8343, t: 25 } }. Last fetched optime (with hash): { ts: Timestamp 1515853555000|1852, t: 25 }[1175973526525408650]. Restarts remaining: 3

      That's also the time the replica set stopped accepting connections from clients.

      To get things running again I had to kill all three mongod processes (and then kill -9 because the shutdown tends to hang while in this state).

      After letting the nodes sync up, I was able to reproduce this again with the same query.

      I can provide logs and the query privately if that would be useful.

      Just guessing based on what I learned in SERVER-32398, maybe the primary froze up because it ran out of cache while waiting for the secondary to apply changes. But the update was running with majority read concern so I would have thought the secondary couldn't have gotten far enough behind for that to occur.

            bruce.lucas@mongodb.com Bruce Lucas (Inactive)
            mzs Michael Smith
            0 Vote for this issue
            10 Start watching this issue