[SERVER-9752] Resyncing a Stale Member, Stucked tor STARTUP2 Created: 22/May/13 Updated: 17/Jul/13 Resolved: 18/Jun/13 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 2.4.3 |
| Fix Version/s: | None |
| Type: | Question | Priority: | Major - P3 |
| Reporter: | Taha Jahangir | Assignee: | Thomas Rueckstiess |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Freebsd 9.0 amd64 |
||
| Issue Links: |
|
||||||||
| Participants: | |||||||||
| Description |
|
We had a large dataset, with a stale member, and want to automatically resync it from primary (initial sync) After removing it's data directory and starting it again, it went to STARTUP2 state, and started cloning data. This portion of log file is: I don't know is initial sync successful or not, but we was seeing `initialSyncOplogApplication` logs every ~5min, and sync time was moving very slow (sync time move 1 hour forward after 5 hours!). We restarted mongodb service, but unfortunately, it starts to sync from scratch. With log like this: Wed May 22 10:43:28.548 [rsStart] replSet I am 172.20.43.11:27118 I think the state of server should be RECOVERING not STARTUP2, is this correct? |
| Comments |
| Comment by Taha Jahangir [ 29/May/13 ] |
|
But, unrelated to my problem, if initial sync have three stages (initial copy / building indexes / catching up from oplog). I think the member state should change to RECOVERING after stage-2, So it can be stopped and started again to catch-up. In our scenario, we stopped the server in catch-up stage, but the entire copied data gone! |
| Comment by Taha Jahangir [ 29/May/13 ] |
|
Thanks, Thomas You're exactly right. The secondary cannot catchup to primary, but not because of network bandwidth, limited resources or disk speed, but because of extensive write locks! Our application of write heavy app (analytical), and it seems mongodb is not good to operate in write-heavy environments. We love mongodb map/reduce framework, but write overhead is not acceptable. We are searching for a DB with high write capability (like cassandra), and with realtime map/reduce. (HADOOP is suitable mostly for offline map/reduces). |
| Comment by Thomas Rueckstiess [ 29/May/13 ] |
|
Hi Taha, Sorry to hear that you have trouble with your secondary catching up again with your primary node. This sounds like you're writing faster to the primary than the secondary is able to replicate. Have you looked at the network being the bottleneck for this? And are your notes set up in MMS? That would help us better understand what the problem might be. To work around the issue, you could take a snapshot of the other secondary (if available) and copy that to the node, which is described on our page Resync a Member of a Replica Set. The fact that mongod deletes all data and starts the resync from the beginning is expected behavior, and will be addressed in Regards, |