[SERVER-4417] Replication stops in one data centre when reIndex issued on one secondary in that datacentre Created: 02/Dec/11 Updated: 29/Feb/12 Resolved: 07/Dec/11 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 2.0.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Pierre Dane | Assignee: | Kristina Chodorow (Inactive) |
| Resolution: | Done | Votes: | 3 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
windows 64 Mutex build |
||
| Operating System: | Windows |
| Participants: |
| Description |
|
we have three data centres. Master is in one , and we have pairs of secondaries in the other two - these have priority = 0 and sometimes votes = 0. Only one of the secondaries in each data centre is read from. Today I issued a reindex on the secondary that was not being read from and replication to the one that was being read from halted. As soon as the reindex completed the other secondary started syncing again. Servers in other data centres were not affected and continued syncing. Here is a snippet of the mongostat: localhost:3003 0 1239 301 0 96 7 0 140g 141g 12g 3 0 0|0 0|0 254k 4m localhost:3003 0 1205 307 0 107 6 0 140g 141g 12g 4.6 0 0|0 0|0 253k 3m localhost:3003 0 1389 355 0 102 8 0 140g 141g 12g 0 0 0|0 0|0 290k 4m localhost:3003 0 1327 360 0 109 5 0 140g 141g 12g 3 0 1|0 1|0 285k 4m localhost:3003 0 1231 311 0 101 8 0 140g 141g 12g 3 0 0|0 0|0 255k 4m insert query update delete getmore command flushes mapped vsize res locked % idx miss % qr|qw ar|aw netIn netOut c localhost:3003 0 1185 350 0 98 8 0 140g 141g 12g 3 0 0|0 1|0 264k 4m localhost:3003 0 1207 328 0 102 5 0 140g 141g 12g 3 0 0|0 0|0 259k 4m localhost:3003 0 1356 323 0 106 8 0 140g 141g 12g 4.6 0 0|0 0|0 275k 5m localhost:3003 0 1265 333 0 116 6 0 140g 141g 12g 7.6 0 0|0 0|0 268k 5m insert query update delete getmore command flushes mapped vsize res locked % idx miss % qr|qw ar|aw netIn netOut c localhost:3003 0 1286 337 0 117 19 0 140g 141g 12g 12.3 0 0|0 0|0 273k 9m localhost:3003 0 1119 52 0 18 6 0 140g 141g 12g 0 0 0|0 20|0 146k 4m |
| Comments |
| Comment by Pierre Dane [ 07/Dec/11 ] |
|
Well I see that my secondaries are syncing from the other secondary in that data centre which means no double sync across the atlantic. Woop woop. Thanks |
| Comment by Kristina Chodorow (Inactive) [ 07/Dec/11 ] |
|
Yes, 2.0 members will choose the member with the lowest ping time (who's ahead of them in operations) to sync from. Check rs.status() to see ping times and who members are syncing from. |
| Comment by Pierre Dane [ 07/Dec/11 ] |
|
Thanks Kristina - I thought that after the initial sync was completed then all secondaries synced off the master and that sync chaining was still something that was being worked on. Did I miss this new functionality? Great if the slave chaining is implemented |
| Comment by Kristina Chodorow (Inactive) [ 07/Dec/11 ] |
|
I'm guessing that the other server in the data center was syncing from the one that was being reIndexed. Did you happen to run rs.status() on the stuck member during the reIndex? Secondaries try to sync from the nearest member (in terms of ping time) so it's likely for members in a data center to sync from other members in the same data center. In the future, I'd recommend running rs.status() on other members before kicking off a reIndex. Check the "syncingTo" field to make sure no one is syncing from the member being reindexed, as reindexing will block replication. If someone (A) is syncing from a member you want to reindex (B), one option is to take B offline and start it without the --replSet option on a different port. This makes it a stand-alone server that the replica set can't find (because it's listening on a different port) and then you can do the reIndex without affecting the set at all. Meanwhile, A will choose someone else to sync from. |
| Comment by Pierre Dane [ 07/Dec/11 ] |
|
This happened again. I tried killing the operation with no success and could not shut down the server either. Had to wait until the index build completed and so was operating on stale data for a while |
| Comment by Pierre Dane [ 02/Dec/11 ] |
|
Further problems. THe problem above was occuring on another instance on that server. All of a sudden, both servers started syncing, including the one that was reindexing. I restarted the server being queried (hnot the one reindexing) shortly afterwards. Fri Dec 02 14:41:11 [initandlisten] connection accepted from 127.0.0.1:63062 #25733 background ntoreturn:1 reslen:325 124ms ntoreturn:1 reslen:325 202ms ntoreturn:1 reslen:325 202ms ntoreturn:1 reslen:325 202ms } cursorid:80337332639598 nreturned:3 reslen:665 6036794ms |
| Comment by Pierre Dane [ 02/Dec/11 ] |
|
Could you please make this ticket private if possible. Thanks |