[SERVER-12727] index building can make replica set member unreachable / unresponsive Created: 14/Feb/14  Updated: 10/Dec/14  Resolved: 30/Apr/14

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.4.9
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: John Greenall Assignee: Matt Dannenberg
Resolution: Done Votes: 2
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-12264 replicaset nodes appear to be down du... Closed
related to SERVER-2771 Background index builds on replica se... Closed
Operating System: ALL
Steps To Reproduce:

We have a 3-member replica set with no arbiters. We built an index on a large collection( ~40GB, 17M docs) with background=True. This seemed to work okay on primary but 30 mins later (when the secondaries were both told to build the index) our replica set went down as they became entirely unresponsive and were unable to vote.

Participants:

 Description   

There is already an issue relating to the behaviour of background indexes on secondaries listed as FIXED for 2.5
https://jira.mongodb.org/browse/SERVER-2771
It is not entirely clear however how this issue has been fixed. Do the indices get built in background on secondaries similarly to the primary and/or is the building of indices done sequentially rather than synchronously accross all secondaries. It would be good to have clarification on this.

Separate from this issue though I believe is the behaviour of the secondaries whilst building foreground indices is not entirely acceptable. It is fine that database is locked but the member shouldn't become entirely unresponsive for the time it takes to build the index.



 Comments   
Comment by John Greenall [ 01/May/14 ]

@Thomas I hadn't forgotten about this log but have not yet set up a replica set on 2.6 since my 2.6.0 test server fell over on the first set of unit tests I ran (array updates). Now waiting for release of 2.6.1... Will re-raise this log if the issue persists.

Comment by Thomas Rueckstiess [ 30/Apr/14 ]

Hi John,

We haven't heard back from you in some time. As we are unable to reproduce the issue without further information, I'll go ahead and resolve the ticket now. If this is still an issue after you had the chance to upgrade your environment to 2.6 and you'd like to follow up, feel free to re-open the ticket and provide further details.

Regards,
Thomas

Comment by Matt Dannenberg [ 17/Apr/14 ]

Hey John,

Have you upgraded to 2.6.0? If so, were you able to reproduce the problem?

Thanks,
Matt

Comment by Matt Dannenberg [ 20/Mar/14 ]

Sure, that plan sounds good.

Comment by John Greenall [ 20/Mar/14 ]

I don't want to try to recreate on our live server and to set up a toy replica set from snapshots of our live data is probably an hour or twos' work I'd rather not spend right now.

We do plan to move to 2.6.0 however as soon as stable release is available and I will probably have another replica set up for testing at that point. This would be a natural time for me to try to recreate the problem, particularly if there is a chance the issue has already been fixed by work done on related issues. Can we leave this issue open until then?

Comment by Matt Dannenberg [ 20/Mar/14 ]

Hi John,

We have been unable to reproduce the reported problem. Is this something you are able to reproduce? If so, can you provide us with a step by step reproduction? If not, I will resolve the ticket as can't reproduce.

Thanks,
Matt

Comment by John Greenall [ 24/Feb/14 ]

don't think there's anything relevant in our mongod.conf:
***********************************************************************
#where to log
logpath=/log/mongod.log

logappend=true

  1. fork and run in background
    fork = true

port = 27017

dbpath=/data
pidfilepath = /var/run/mongodb/mongod.pid

keyFile = /etc/mongo_keyfile

  1. Replication Options
    replSet=wirewaxReplicaSet
    *********************************************************

wirewaxReplicaSet:PRIMARY> rs.config()
{
"_id" : "wirewaxReplicaSet",
"version" : 751647,
"members" : [

{ "_id" : 4, "host" : "mdb-ourservereuwest.wirewax.com:27017", "priority" : 2 }

,

{ "_id" : 7, "host" : "mdb-ourserveruseast.wirewax.com:27017", "priority" : 0.5 }

,

{ "_id" : 8, "host" : "mdb-ourserverasia.wirewax.com:27017", "priority" : 0.3 }

]
}

Comment by Daniel Pasette (Inactive) [ 23/Feb/14 ]

Hi John,

I've tried to reproduce your scenario a few different ways and can't seem to trigger it. Could you include your config file or startup parameters?

Comment by John Greenall [ 18/Feb/14 ]

Here is the relevant chunk of log for one of the secondaries that went down (database host names changed).

Fri Feb 14 10:33:38.733 [repl writer worker 3] info: indexing in foreground on this replica; was a background index build on the primary
Fri Feb 14 10:33:38.733 [repl writer worker 3] build index metrics.metrics

{ uuId: 1 }

Fri Feb 14 10:33:39.080 [initandlisten] connection accepted from 54.199.76.123:34090 #571961 (155 connections now open)
Fri Feb 14 10:33:39.405 [conn571961] authenticate db: admin

{ authenticate: 1, user: "admin", nonce: "9d0829c4c62ad51d", key: "ac5a65979b9b3742d86934f3fc46423d" }

Fri Feb 14 10:33:40.158 [conn571958] end connection 54.248.180.187:41423 (154 connections now open)
Fri Feb 14 10:33:40.322 [initandlisten] connection accepted from 54.248.180.187:41425 #571962 (155 connections now open)
Fri Feb 14 10:33:40.481 [conn571962] authenticate db: local

{ authenticate: 1, nonce: "b0abb7102f1fbf3a", user: "__system", key: "9f270b557ea7fcb6acbbf32142e80e56" }

Fri Feb 14 10:33:45.846 [conn571959] end connection 54.217.193.118:38161 (154 connections now open)
Fri Feb 14 10:33:45.949 [initandlisten] connection accepted from 54.217.193.118:38163 #571963 (155 connections now open)
Fri Feb 14 10:33:46.045 [conn571963] authenticate db: local

{ authenticate: 1, nonce: "a7bef18306f052c9", user: "__system", key: "5c2d8ed1605b0df34b39a9d9cb8bca2b" }

Fri Feb 14 10:33:48.003 [repl writer worker 3] Index: (1/3) External Sort Progress: 78200/17364881 0%
Fri Feb 14 10:33:52.517 [rsHealthPoll] replset info our-server-name-1.com:27017 thinks that we are down
Fri Feb 14 10:33:52.650 [initandlisten] connection accepted from 54.248.180.187:41426 #571964 (156 connections now open)
Fri Feb 14 10:33:52.816 [conn571964] authenticate db: local

{ authenticate: 1, nonce: "89d74ef60d4b651c", user: "__system", key: "df103d6b4543b6326c169b25e0d254ef" }

Fri Feb 14 10:33:54.683 [rsHealthPoll] replset info our-server-name-1.com:27017 thinks that we are down
Fri Feb 14 10:33:56.848 [rsHealthPoll] replset info our-server-name-1.com:27017 thinks that we are down
Fri Feb 14 10:33:57.993 [rsHealthPoll] replset info our-server-name-2.com:27017 thinks that we are down
Fri Feb 14 10:33:58.012 [repl writer worker 3] Index: (1/3) External Sort Progress: 161400/17364881 0%
Fri Feb 14 10:33:58.179 [initandlisten] connection accepted from 54.217.193.118:38167 #571965 (157 connections now open)
Fri Feb 14 10:33:58.276 [conn571965] authenticate db: local

{ authenticate: 1, nonce: "b10e50a8333b1c4c", user: "__system", key: "2ab8ade423011ccee56686decc6d4f84" }

Fri Feb 14 10:33:59.014 [rsHealthPoll] replset info our-server-name-1.wirewax.com:27017 thinks that we are down
Fri Feb 14 10:34:00.091 [rsHealthPoll] replset info our-server-name-2.wirewax.com:27017 thinks that we are down
Fri Feb 14 10:34:01.179 [rsHealthPoll] replset info our-server-name-1.wirewax.com:27017 thinks that we are down
Fri Feb 14 10:34:02.191 [rsHealthPoll] replset infoour-server-name-2.wirewax.com:27017 thinks that we are down
Fri Feb 14 10:34:03.345 [rsHealthPoll] replset info our-server-name-1.wirewax.com:27017 thinks that we are down
Fri Feb 14 10:34:04.290 [rsHealthPoll] replset info our-server-name-2.wirewax.com:27017 thinks that we are down
Fri Feb 14 10:34:04.979 [initandlisten] connection accepted from 54.248.180.187:41428 #571966 (158 connections now open)
Fri Feb 14 10:34:05.142 [conn571966] authenticate db: local

{ authenticate: 1, nonce: "2d0879fd1e2cb532", user: "__system", key: "543feb73a4ba86cd140dbe5b55787e45" }

Fri Feb 14 10:34:05.511 [rsHealthPoll] replset info our-server-name-1:27017 thinks that we are down
Fri Feb 14 10:34:06.389 [rsHealthPoll] replset info our-server-name-2:27017 thinks that we are down
Fri Feb 14 10:34:06.600 [rsBackgroundSync] replSet db exception in producer: 10278 dbclient error communicating with server: our-server-name-2:27017
Fri Feb 14 10:34:06.767 [conn570843] end connection 54.247.158.181:36952 (157 connections now open)
Fri Feb 14 10:34:06.920 [conn570861] end connection 54.217.112.132:38332 (156 connections now open)
Fri Feb 14 10:34:07.029 [conn571842] end connection 54.217.112.132:60824 (155 connections now open)
Fri Feb 14 10:34:07.404 [conn570846] end connection 54.247.158.181:36959 (154 connections now open)
Fri Feb 14 10:34:07.621 [conn571636] end connection 54.217.112.132:47578 (153 connections now open)
Fri Feb 14 10:34:07.756 [conn571817] end connection 54.247.158.181:41304 (152 connections now open)
Fri Feb 14 10:34:08.012 [repl writer worker 3] Index: (1/3) External Sort Progress: 252000/17364881 1%
Fri Feb 14 10:34:08.144 [conn571516] end connection 54.247.158.181:42112 (151 connections now open)
Fri Feb 14 10:34:08.424 [conn571014] end connection 54.217.112.132:47701 (150 connections now open)
Fri Feb 14 10:34:08.488 [rsHealthPoll] replset info our-server-name-1.com:27017 thinks that we are down
Fri Feb 14 10:34:08.488 [rsHealthPoll] replSet member our-server-name-2.com:27017 is now in state SECONDARY

Comment by Daniel Pasette (Inactive) [ 18/Feb/14 ]

Hi John
The replica set shouldn't normally go down as the result of a foreground index build. Do you still have log files from the time of the incident?

Generated at Thu Feb 08 03:29:26 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.