[SERVER-6699] secondary will not emerge from RECOVERING state Created: 03/Aug/12 Updated: 11/Jul/16 Resolved: 07/Aug/12 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 2.0.5 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Kerim Satirli | Assignee: | Kristina Chodorow (Inactive) |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
EC2 instances using Amazon AMIs and YUM installations of MongoDB 2.0.5 in a PRIMAR, SECONDARY, ARBITER setup |
||
| Operating System: | Linux |
| Participants: |
| Description |
|
We set up a replica set a couple of weeks ago and were happily using it until about yesterday. Our monitoring alerted us to the fact that a secondary had become too stale to sync (e.g. RS102) and so we investigated. For whatever reason I do not yet understand, the secondary has gotten so far behind that it was unable to keep up with replication. After looking in the docs, I attemped a couple of full resyncs at first without restarting mongod but when that process crashed, I saw myself forced to start the server anew. Standard replication started, connections to the primary and arbiter were made, hands were shaken and data started to be transferred, up until about 50% of a 120GB collection, when the process stopped again. The inter-instance connectivity is capable of sustained loads and so the collection should have, at the latest, been resynced within two hours, but it did not. Long story short: a secondary became too stale to sync, attempting to resync it by a.) keeping the local.* files did not work but b.) removing all local.* and other DB files and starting completely fresh did not work either. Changing the oplogsize to 100GB (it was 20GB before) and starting the process anew also yielded no results. The secondary will stay in RECOVERING state but never recover. I am hesitant in adding another secondary at this point, as I believe it too would not be able to sync itself from the primary. Any thoughts on what is going on here? |
| Comments |
| Comment by Kristina Chodorow (Inactive) [ 07/Aug/12 ] |
|
Great, glad it worked out! |
| Comment by Kerim Satirli [ 07/Aug/12 ] |
|
Kristina, thanks for the pointers. Going back to "MongoDB: The Definitive Guide" and your pointers (in both places) helped me in figuring this out. The primary's log size was, indeed, too small. We ended up dumping the databases we needed and reimporting them. The secondary was able to keep up just fine and I added another secondary to mitigate problems like this in the future. |
| Comment by Kristina Chodorow (Inactive) [ 06/Aug/12 ] |
|
It sounds like your primary's oplog is too small. Is there a less busy time when you can sync your secondary? You can run db.printReplicationInfo() on the primary to see how long (in seconds) your oplog is: that's how long you have for initial sync. If you don't have a long enough primary oplog, you'll have to take some downtime to get the secondary up. I can walk you through a few options for that. Longer term, you're going to want a longer oplog on the primary. |