-
Type:
Bug
-
Resolution: Done
-
Priority:
Major - P3
-
None
-
Affects Version/s: 3.2.1
-
Component/s: Replication
-
None
-
Fully Compatible
-
ALL
-
-
None
-
3
-
None
-
None
-
None
-
None
-
None
-
None
Hi, We had a node of a 3 node replica set fail with a bad disk and we had to stop the node. After 2 days or so, we got the node back online. We no longer had the dataDir, so we had to resync automatically (which we have done before). It took 18 hours or so to resync. Then it appears to have started reading from the oplog and then the member became too stale to recover. So basically, we resynced for 18 hours (~500gb) and then it died when reading the oplog because the oplog contained less that 18 hours of data, it seems.
I have no idea why this would be the case. Maybe someone can shed some light on this. We could never size the oplog to hold all we would ever need to resync (new data the was loaded) since a node went down. What if it was down for 2 weeks. This wasn't even a lot of data... What if it took a week to resync 5TB? Not sure how this works. I am going to try and attach the logs