-
Type:
Bug
-
Resolution: Done
-
Priority:
Major - P3
-
None
-
Affects Version/s: 2.0.5
-
Component/s: Replication
-
None
-
Environment:EC2 instances using Amazon AMIs and YUM installations of MongoDB 2.0.5 in a PRIMAR, SECONDARY, ARBITER setup
-
Linux
-
None
-
3
-
None
-
None
-
None
-
None
-
None
-
None
-
None
We set up a replica set a couple of weeks ago and were happily using it until about yesterday. Our monitoring alerted us to the fact that a secondary had become too stale to sync (e.g. RS102) and so we investigated.
For whatever reason I do not yet understand, the secondary has gotten so far behind that it was unable to keep up with replication. After looking in the docs, I attemped a couple of full resyncs at first without restarting mongod but when that process crashed, I saw myself forced to start the server anew.
Standard replication started, connections to the primary and arbiter were made, hands were shaken and data started to be transferred, up until about 50% of a 120GB collection, when the process stopped again. The inter-instance connectivity is capable of sustained loads and so the collection should have, at the latest, been resynced within two hours, but it did not.
Long story short: a secondary became too stale to sync, attempting to resync it by a.) keeping the local.* files did not work but b.) removing all local.* and other DB files and starting completely fresh did not work either. Changing the oplogsize to 100GB (it was 20GB before) and starting the process anew also yielded no results.
The secondary will stay in RECOVERING state but never recover. I am hesitant in adding another secondary at this point, as I believe it too would not be able to sync itself from the primary.
Any thoughts on what is going on here?