Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Done
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: Replication
Labels:
None

Operating System:
ALL
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

We are migrating a 3.2 standalone server to a 3.6.13 sharded/replicated cluster

We have 5 RHEL7 nodes, lot of RAM, SSD disks :

node1 : mongos, config_server1
node2: mongod_shard1_primary, mongodb_shard2_arbiter
node3: mongod_shard1_secondary, config_server2
node4: mongod_shard2_primary, config_server3
node5: mongod_shard2_secondary, mongodb_shard1_arbiter

shard secondaries are hidden, priority 0

write_concerns=1 from clients

First 10 days after startup, the data ingestion was ok and our dataset reached 100G on each data_shard (we process a live flow + a migration flow from the 3.2 standalone)

Then for some reason, we had a first crash on both shard1 primary/secondary

After this crash the secondary whas some hours behind the primary

We cannot now stabilize the shard1 replicaset. When we start the shard1 nodes, the r/w performances are very affected and both shard1 primary and secondary end with freezing and deadlock on clients.

We can see this in log of secondary:

2019-09-12T11:25:37.926-0400 I REPL [replication-4] Error returned from oplog query (no more query restarts left): NetworkInterfaceExceededTimeLimit: error in fetcher batch callback: Operation timed out
2019-09-12T11:25:37.926-0400 I REPL [replication-4] Finished fetching oplog during initial sync: NetworkInterfaceExceededTimeLimit: error in fetcher batch callback: Operation timed out. Last fetched optime and hash: { ts: Timestamp(1568301901, 325), t: 43 }[8532403056184220739]

If I disable the replication (standalone shard1) it lives as a charm

If I try to initial sync the secondary, it ends with a freeze after some Gb of data sync

Actually the shard1 rs seems Ok after a replica sync from direct data file transfer (the network throughput from node2 to node1 to transfer file was ~100M/s)

But I'm afraid of a crash in case of any secondary/primary lag

Assignee:: Danny Hatcher (Inactive)
Reporter:: FRANCK LEFEBURE
Participants:: Danny Hatcher, FRANCK LEFEBURE
Votes:: 0 Vote for this issue
Watchers:: 4 Start watching this issue

Created:: Sep 12 2019 04:32:29 PM UTC
Updated:: Sep 16 2019 03:43:08 PM UTC
Resolved:: Sep 16 2019 03:43:08 PM UTC

Details

Description

Attachments

Activity

People

Dates