[SERVER-43296] 3.6.13 ReplicaSet freeze during initial sync / batch sync Created: 12/Sep/19 Updated: 16/Sep/19 Resolved: 16/Sep/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | FRANCK LEFEBURE | Assignee: | Danny Hatcher (Inactive) |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Operating System: | ALL |
| Participants: |
| Description |
|
We are migrating a 3.2 standalone server to a 3.6.13 sharded/replicated cluster We have 5 RHEL7 nodes, lot of RAM, SSD disks :
shard secondaries are hidden, priority 0 write_concerns=1 from clients First 10 days after startup, the data ingestion was ok and our dataset reached 100G on each data_shard (we process a live flow + a migration flow from the 3.2 standalone) Then for some reason, we had a first crash on both shard1 primary/secondary After this crash the secondary whas some hours behind the primary We cannot now stabilize the shard1 replicaset. When we start the shard1 nodes, the r/w performances are very affected and both shard1 primary and secondary end with freezing and deadlock on clients. We can see this in log of secondary: 2019-09-12T11:25:37.926-0400 I REPL [replication-4] Error returned from oplog query (no more query restarts left): NetworkInterfaceExceededTimeLimit: error in fetcher batch callback: Operation timed out If I disable the replication (standalone shard1) it lives as a charm If I try to initial sync the secondary, it ends with a freeze after some Gb of data sync Actually the shard1 rs seems Ok after a replica sync from direct data file transfer (the network throughput from node2 to node1 to transfer file was ~100M/s) But I'm afraid of a crash in case of any secondary/primary lag
|
| Comments |
| Comment by Danny Hatcher (Inactive) [ 16/Sep/19 ] |
|
Glad to hear it! |
| Comment by FRANCK LEFEBURE [ 13/Sep/19 ] |
|
Hi Daniel, I really appreciate your quick comment. Situation has been fixed by, on all data shards :
Theses pages helped me :
|
| Comment by Danny Hatcher (Inactive) [ 12/Sep/19 ] |
|
It is possible that this is an environmental issue instead of a bug and we will be limited in the amount of help that we can provide. However, if you provide some files to our Secure Uploader I can take a look. Please rest assured that any files provided to that link will only be viewable by MongoDB engineers. I would like the log files and "diagnostic.data" directories from each of the following:
|