Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-43296

3.6.13 ReplicaSet freeze during initial sync / batch sync

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: Replication
    • Labels:
      None
    • ALL

      We are migrating a 3.2 standalone server to a 3.6.13 sharded/replicated cluster

      We have 5 RHEL7 nodes, lot of RAM, SSD disks :

      • node1 : mongos, config_server1
      • node2: mongod_shard1_primary, mongodb_shard2_arbiter
      • node3: mongod_shard1_secondary, config_server2
      • node4: mongod_shard2_primary, config_server3
      • node5: mongod_shard2_secondary, mongodb_shard1_arbiter

      shard secondaries are hidden, priority 0

      write_concerns=1 from clients

      First 10 days after startup, the data ingestion was ok and our dataset reached 100G on each data_shard (we process a live flow + a migration flow from the 3.2 standalone)

      Then for some reason, we had a first crash on both  shard1 primary/secondary

      After this crash the secondary whas some hours behind the primary

      We cannot now stabilize the shard1 replicaset. When we start the shard1 nodes, the r/w performances are very affected and both shard1 primary and secondary end with freezing and deadlock on clients.

      We can see this in log of secondary:

      2019-09-12T11:25:37.926-0400 I REPL [replication-4] Error returned from oplog query (no more query restarts left): NetworkInterfaceExceededTimeLimit: error in fetcher batch callback: Operation timed out
      2019-09-12T11:25:37.926-0400 I REPL [replication-4] Finished fetching oplog during initial sync: NetworkInterfaceExceededTimeLimit: error in fetcher batch callback: Operation timed out. Last fetched optime and hash: { ts: Timestamp(1568301901, 325), t: 43 }[8532403056184220739]

      If I disable the replication (standalone shard1) it lives as a charm

      If I try to initial sync the secondary, it ends with a freeze after some Gb of data sync

      Actually the shard1 rs seems Ok after a replica sync from direct data file transfer (the network throughput from node2 to node1 to transfer file was ~100M/s)

      But I'm afraid of a crash in case of any secondary/primary lag 

       

            Assignee:
            daniel.hatcher@mongodb.com Danny Hatcher (Inactive)
            Reporter:
            franck.lefebure@gmail.com FRANCK LEFEBURE
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: