[SERVER-43296] 3.6.13 ReplicaSet freeze during initial sync / batch sync Created: 12/Sep/19  Updated: 16/Sep/19  Resolved: 16/Sep/19

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: FRANCK LEFEBURE Assignee: Danny Hatcher (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Participants:

 Description   

We are migrating a 3.2 standalone server to a 3.6.13 sharded/replicated cluster

We have 5 RHEL7 nodes, lot of RAM, SSD disks :

  • node1 : mongos, config_server1
  • node2: mongod_shard1_primary, mongodb_shard2_arbiter
  • node3: mongod_shard1_secondary, config_server2
  • node4: mongod_shard2_primary, config_server3
  • node5: mongod_shard2_secondary, mongodb_shard1_arbiter

shard secondaries are hidden, priority 0

write_concerns=1 from clients

First 10 days after startup, the data ingestion was ok and our dataset reached 100G on each data_shard (we process a live flow + a migration flow from the 3.2 standalone)

Then for some reason, we had a first crash on both  shard1 primary/secondary

After this crash the secondary whas some hours behind the primary

We cannot now stabilize the shard1 replicaset. When we start the shard1 nodes, the r/w performances are very affected and both shard1 primary and secondary end with freezing and deadlock on clients.

We can see this in log of secondary:

2019-09-12T11:25:37.926-0400 I REPL [replication-4] Error returned from oplog query (no more query restarts left): NetworkInterfaceExceededTimeLimit: error in fetcher batch callback: Operation timed out
2019-09-12T11:25:37.926-0400 I REPL [replication-4] Finished fetching oplog during initial sync: NetworkInterfaceExceededTimeLimit: error in fetcher batch callback: Operation timed out. Last fetched optime and hash: { ts: Timestamp(1568301901, 325), t: 43 }[8532403056184220739]

If I disable the replication (standalone shard1) it lives as a charm

If I try to initial sync the secondary, it ends with a freeze after some Gb of data sync

Actually the shard1 rs seems Ok after a replica sync from direct data file transfer (the network throughput from node2 to node1 to transfer file was ~100M/s)

But I'm afraid of a crash in case of any secondary/primary lag 

 



 Comments   
Comment by Danny Hatcher (Inactive) [ 16/Sep/19 ]

Glad to hear it!

Comment by FRANCK LEFEBURE [ 13/Sep/19 ]

Hi Daniel,

I really appreciate your quick comment.

Situation has been fixed by, on all data shards :

  • enableMajorityReadConcern: false
  • cacheSizeGB: 1 => default 50% host (we use a third party packaged mongodb, with this damned 1G default configuration.. )

Theses pages helped me :

 

 

Comment by Danny Hatcher (Inactive) [ 12/Sep/19 ]

It is possible that this is an environmental issue instead of a bug and we will be limited in the amount of help that we can provide. However, if you provide some files to our Secure Uploader I can take a look. Please rest assured that any files provided to that link will only be viewable by MongoDB engineers.

I would like the log files and "diagnostic.data" directories from each of the following:

  • all shard nodes that had problems
  • the config server Primary
  • the mongos
Generated at Thu Feb 08 05:02:47 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.