[SERVER-12727] index building can make replica set member unreachable / unresponsive Created: 14/Feb/14 Updated: 10/Dec/14 Resolved: 30/Apr/14 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 2.4.9 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | John Greenall | Assignee: | Matt Dannenberg |
| Resolution: | Done | Votes: | 2 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||
| Operating System: | ALL | ||||||||||||
| Steps To Reproduce: | We have a 3-member replica set with no arbiters. We built an index on a large collection( ~40GB, 17M docs) with background=True. This seemed to work okay on primary but 30 mins later (when the secondaries were both told to build the index) our replica set went down as they became entirely unresponsive and were unable to vote. |
||||||||||||
| Participants: | |||||||||||||
| Description |
|
There is already an issue relating to the behaviour of background indexes on secondaries listed as FIXED for 2.5 Separate from this issue though I believe is the behaviour of the secondaries whilst building foreground indices is not entirely acceptable. It is fine that database is locked but the member shouldn't become entirely unresponsive for the time it takes to build the index. |
| Comments |
| Comment by John Greenall [ 01/May/14 ] |
|
@Thomas I hadn't forgotten about this log but have not yet set up a replica set on 2.6 since my 2.6.0 test server fell over on the first set of unit tests I ran (array updates). Now waiting for release of 2.6.1... Will re-raise this log if the issue persists. |
| Comment by Thomas Rueckstiess [ 30/Apr/14 ] |
|
Hi John, We haven't heard back from you in some time. As we are unable to reproduce the issue without further information, I'll go ahead and resolve the ticket now. If this is still an issue after you had the chance to upgrade your environment to 2.6 and you'd like to follow up, feel free to re-open the ticket and provide further details. Regards, |
| Comment by Matt Dannenberg [ 17/Apr/14 ] |
|
Hey John, Have you upgraded to 2.6.0? If so, were you able to reproduce the problem? Thanks, |
| Comment by Matt Dannenberg [ 20/Mar/14 ] |
|
Sure, that plan sounds good. |
| Comment by John Greenall [ 20/Mar/14 ] |
|
I don't want to try to recreate on our live server and to set up a toy replica set from snapshots of our live data is probably an hour or twos' work I'd rather not spend right now. We do plan to move to 2.6.0 however as soon as stable release is available and I will probably have another replica set up for testing at that point. This would be a natural time for me to try to recreate the problem, particularly if there is a chance the issue has already been fixed by work done on related issues. Can we leave this issue open until then? |
| Comment by Matt Dannenberg [ 20/Mar/14 ] |
|
Hi John, We have been unable to reproduce the reported problem. Is this something you are able to reproduce? If so, can you provide us with a step by step reproduction? If not, I will resolve the ticket as can't reproduce. Thanks, |
| Comment by John Greenall [ 24/Feb/14 ] |
|
don't think there's anything relevant in our mongod.conf: logappend=true
port = 27017 dbpath=/data keyFile = /etc/mongo_keyfile
wirewaxReplicaSet:PRIMARY> rs.config() , , { "_id" : 8, "host" : "mdb-ourserverasia.wirewax.com:27017", "priority" : 0.3 } ] |
| Comment by Daniel Pasette (Inactive) [ 23/Feb/14 ] |
|
Hi John, I've tried to reproduce your scenario a few different ways and can't seem to trigger it. Could you include your config file or startup parameters? |
| Comment by John Greenall [ 18/Feb/14 ] |
|
Here is the relevant chunk of log for one of the secondaries that went down (database host names changed). Fri Feb 14 10:33:38.733 [repl writer worker 3] info: indexing in foreground on this replica; was a background index build on the primary Fri Feb 14 10:33:39.080 [initandlisten] connection accepted from 54.199.76.123:34090 #571961 (155 connections now open) Fri Feb 14 10:33:40.158 [conn571958] end connection 54.248.180.187:41423 (154 connections now open) Fri Feb 14 10:33:45.846 [conn571959] end connection 54.217.193.118:38161 (154 connections now open) Fri Feb 14 10:33:48.003 [repl writer worker 3] Index: (1/3) External Sort Progress: 78200/17364881 0% Fri Feb 14 10:33:54.683 [rsHealthPoll] replset info our-server-name-1.com:27017 thinks that we are down Fri Feb 14 10:33:59.014 [rsHealthPoll] replset info our-server-name-1.wirewax.com:27017 thinks that we are down Fri Feb 14 10:34:05.511 [rsHealthPoll] replset info our-server-name-1:27017 thinks that we are down |
| Comment by Daniel Pasette (Inactive) [ 18/Feb/14 ] |
|
Hi John |