[SERVER-6255] RS102 should automatically attempt repair Created: 29/Jun/12 Updated: 06/Dec/22 Resolved: 14/Jun/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 2.0.5 |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Minor - P4 |
| Reporter: | Jed Smith | Assignee: | Backlog - Replication Team |
| Resolution: | Done | Votes: | 1 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Amazon Linux |
||
| Assigned Teams: |
Replication
|
| Participants: |
| Description |
|
When a replica falls back to RS102, is it safe to try to fix that automatically by resyncing the DB? Administrator intervention is in most cases a good thing, so I'm just testing the water. I'd imagine there's a worry about data corruption here that I'm overlooking. Perhaps this should be an option? (If I hit RS102, resync from scratch?) We missed a few backup replicas because we don't really monitor them, and it would have rocked had they fixed themselves. |
| Comments |
| Comment by Jed Smith [ 29/Jun/12 ] |
|
Sure, agreed on the manual load, just seeing if it's something you guys have thought about making optional. I did indeed fully resync these replicas. Thanks for the response. |
| Comment by Scott Hernandez (Inactive) [ 29/Jun/12 ] |
|
I'm sure you have already read this, since it is printed in the logs when this comes up but I'll summarize some important points below: http://www.mongodb.org/display/DOCS/Resyncing+a+Very+Stale+Replica+Set+Member RS102 means that replication was not keeping up for too long, or a replica was offline too long to catch up again. It generally means you need a larger oplog to account for this downtime, or replication delay. The reason this is a manual step for recovery is because of the added load it will place on some other system since it has to do a full resync, from the primary. This is a case where no one behavior fits everyone's needs and it is left manual for the admin to manage. One thing some people do is to wipe the data dir during a restart to initiate the full-resync, but that is something you need to script yourself. |