-
Type: Bug
-
Resolution: Done
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: Replication
-
Labels:None
-
ALL
We are migrating a 3.2 standalone server to a 3.6.13 sharded/replicated cluster
We have 5 RHEL7 nodes, lot of RAM, SSD disks :
- node1 : mongos, config_server1
- node2: mongod_shard1_primary, mongodb_shard2_arbiter
- node3: mongod_shard1_secondary, config_server2
- node4: mongod_shard2_primary, config_server3
- node5: mongod_shard2_secondary, mongodb_shard1_arbiter
shard secondaries are hidden, priority 0
write_concerns=1 from clients
First 10 days after startup, the data ingestion was ok and our dataset reached 100G on each data_shard (we process a live flow + a migration flow from the 3.2 standalone)
Then for some reason, we had a first crash on both shard1 primary/secondary
After this crash the secondary whas some hours behind the primary
We cannot now stabilize the shard1 replicaset. When we start the shard1 nodes, the r/w performances are very affected and both shard1 primary and secondary end with freezing and deadlock on clients.
We can see this in log of secondary:
2019-09-12T11:25:37.926-0400 I REPL [replication-4] Error returned from oplog query (no more query restarts left): NetworkInterfaceExceededTimeLimit: error in fetcher batch callback: Operation timed out
2019-09-12T11:25:37.926-0400 I REPL [replication-4] Finished fetching oplog during initial sync: NetworkInterfaceExceededTimeLimit: error in fetcher batch callback: Operation timed out. Last fetched optime and hash: { ts: Timestamp(1568301901, 325), t: 43 }[8532403056184220739]
If I disable the replication (standalone shard1) it lives as a charm
If I try to initial sync the secondary, it ends with a freeze after some Gb of data sync
Actually the shard1 rs seems Ok after a replica sync from direct data file transfer (the network throughput from node2 to node1 to transfer file was ~100M/s)
But I'm afraid of a crash in case of any secondary/primary lag