[SERVER-25842] Secondary started accepting queries before syncing(replSet still syncing, not yet to minValid optime) Created: 29/Aug/16  Updated: 30/Aug/16  Resolved: 30/Aug/16

Status: Closed
Project: Core Server
Component/s: Querying, Replication
Affects Version/s: 2.6.11
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Manoj Vivek Assignee: Unassigned
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-4936 Server support for "maxStalenessMS" r... Closed
Operating System: ALL
Steps To Reproduce:
  1. Add a replica which is out of sync with the primary(2 days in this case)
  2. Observe in the logs of new replica for:

    replSet still syncing, not yet to minValid optime 57bb0c9d:f

  3. Also check the rs.status() of the replicaset whoich will show the state of new replica member as "SECONDARY" (which I guess should be "STARTUP2" or "RECOVERY")
  4. Any application using this replicaset will now start sending queries to this member
Participants:

 Description   

I have brought up a new replica from EBS snapshot(2 days old) of an existing secondary and then added it to the replicaset using rs.add(). But once added it started serving queries immediately even before fully syncing with the primary.

rs.status()

also shows status of this member as "SECONDARY"

Logs:

2016-08-23T05:32:42.470+0000 [rsBackgroundSync] replSet syncing to: 10.0.1.211:27017
2016-08-23T05:32:42.470+0000 [rsSync] replSet still syncing, not yet to minValid optime 57bb0c9d:f
2016-08-23T05:32:42.471+0000 [rsBackgroundSync] replset setting syncSourceFeedback to 10.0.1.211:27017
2016-08-23T05:32:43.635+0000 [initandlisten] connection accepted from 127.0.0.1:43066 #104 (104 connections now open)
2016-08-23T05:32:43.636+0000 [conn104]  authenticate db: admin { authenticate: 1, nonce: "xxx", user: "admin", key: "xxx" }
2016-08-23T05:32:51.112+0000 [rsSync] replSet SECONDARY
2016-08-23T05:32:51.182+0000 [initandlisten] connection accepted from 10.0.0.31:39528 #105 (105 connections now open)
2016-08-23T05:32:51.188+0000 [conn105]  authenticate db: <db> { authenticate: 1, user: <user>, nonce: "xxx", key: "xxx" }
2016-08-23T05:32:51.190+0000 [initandlisten] connection accepted from 10.0.0.31:39530 #106 (106 connections now open)
2016-08-23T05:32:51.190+0000 [conn106]  authenticate db: <db> { authenticate: 1, user: <user>, nonce: "xxx", key: "xxx" }
2016-08-23T05:32:51.357+0000 [initandlisten] connection accepted from 10.0.0.31:39542 #107 (107 connections now open)
2016-08-23T05:32:51.357+0000 [initandlisten] connection accepted from 10.0.0.31:39544 #108 (108 connections now open)
2016-08-23T05:32:51.357+0000 [conn107]  authenticate db: <db> { authenticate: 1, user: <user>, nonce: "xxx", key: "xxx" }
2016-08-23T05:32:51.357+0000 [conn108]  authenticate db: <db> { authenticate: 1, user: <user>, nonce: "xxx", key: "xxx" }
2016-08-23T05:32:51.431+0000 [conn105] query <db.coll> query: { <query> } planSummary: IDHACK ntoskip:0 nscanned:1 nscannedObjects:1 idhack:1 keyUpdates:0 numYields:1 locks(micros) r:336 nreturned:1 reslen:2081 131ms
2016-08-23T05:32:51.455+0000 [initandlisten] connection accepted from 10.0.0.31:39558 #109 (109 connections now open)
2016-08-23T05:32:51.456+0000 [conn109]  authenticate db: <db> { authenticate: 1, user: <user>, nonce: "xxx", key: "xxx" }
2016-08-23T05:32:51.463+0000 [conn107] query <db.coll> query: { <query> } planSummary: IXSCAN { <index> } ntoreturn:0 ntoskip:0 nscanned:1 nscannedObjects:1 keyUpdates:0 numYields:1 locks(micros) r:206862 nreturned:1 reslen:367 104ms
2016-08-23T05:32:51.534+0000 [initandlisten] connection accepted from 10.0.0.41:46923 #110 (110 connections now open)

Made a workaround by removing and making a hidden secondary till it got synced.



 Comments   
Comment by Daniel Pasette (Inactive) [ 30/Aug/16 ]

Manoj, I think what you're seeing is "Worked as Designed." The read preference allowing clients to read from secondary nodes does not check how far behind it is from the primary node. Coming in the next version of MongoDB (v3.4) users will be able to set a read preference specifying the maximum staleness a client will accept before falling back to querying the primary. You can see the details here: SERVER-4936.

In the meantime, adding the node as hidden while it catches up or changing read preference to primary only is the best workaround to this situation.

Thanks
Dan

Comment by Manoj Vivek [ 30/Aug/16 ]

ramon.fernandez I have raised this in the mongodb-user group here https://groups.google.com/forum/#!topic/mongodb-user/jC8_t6jeojU

To reply to your comment, even after the state got changed to SECONDARY rs.status() showed that this new member is more than 10 hours behind the primary and catching up(but serving the queries).

Anyways this is something I have witnessed but didn't gather enough data during that time to back it up now. Also reproducing it now involves quite a bit of time and effort. I will get back if I get a chance to redo this again.

Thanks for your time.

Comment by Ramon Fernandez Marina [ 30/Aug/16 ]

vivek_jonam, I don't see any queries in the secondary before it transitions to SECONDARY. Note that the node can accept connections, but it will not serve any data.

One thing you can try is to configure your secondary to log all operations. If you see any queries please send the logs, but for not I'm closing this ticket as I don't see evidence of a bug.

Please note that the SERVER project is for reporting bugs or feature suggestions for the MongoDB server. For MongoDB-related support discussion please post on the mongodb-user group or Stack Overflow with the mongodb tag, where your question will reach a larger audience. A question like this involving more discussion would be best posted on the mongodb-user group. See also our Technical Support page for additional support resources.

Regards,
Ramón.

Comment by Manoj Vivek [ 30/Aug/16 ]

ramon.fernandez Please ignore the logs in the description and consider the following logs. Apologize for the different timezones in primary and the new secondary logs.
Logs in Primary:

2016-08-22T22:28:24.280-0700 [conn5067651] replSet info saving a newer config version to local.system.replset: { _id: "<name>", version: 211493, members: [ { _id: 4, host: "10.0.1.211:27017" }, { _id: 5, host: "10.0.1.212:27017" }, { _id: 6, host: "10.0.0.141:27017", priority: 0.0, hidden: true }, { _id: 7, host: "10.0.1.215:27017" } ] }
2016-08-22T22:28:24.281-0700 [conn5067651] replSet saveConfigLocally done
2016-08-22T22:28:24.281-0700 [conn5067651] replSet info : additive change to configuration
2016-08-22T22:28:24.281-0700 [conn5067651] replSet replSetReconfig new config saved locally
2016-08-22T22:28:24.282-0700 [rsHealthPoll] replSet member 10.0.1.215:27017 is up
2016-08-22T22:28:24.282-0700 [rsMgr] replSet total number of votes is even - add arbiter or give one member an extra vote
2016-08-22T22:28:38.690-0700 [conn5067753] end connection 10.0.1.212:34231 (411 connections now open)
2016-08-22T22:28:38.690-0700 [initandlisten] connection accepted from 10.0.1.212:34238 #5067757 (413 connections now open)
2016-08-22T22:28:38.691-0700 [conn5067757]  authenticate db: local { authenticate: 1, nonce: "xxx", user: "__system", key: "xxx" }
2016-08-22T22:28:41.143-0700 [initandlisten] connection accepted from 10.0.1.215:55926 #5067758 (413 connections now open)
2016-08-22T22:28:41.144-0700 [conn5067758]  authenticate db: local { authenticate: 1, nonce: "xxx", user: "__system", key: "xxx" }
2016-08-22T22:28:42.829-0700 [conn5067755] end connection 10.0.0.141:56912 (412 connections now open)
2016-08-22T22:28:42.829-0700 [initandlisten] connection accepted from 10.0.0.141:56919 #5067759 (413 connections now open)
2016-08-22T22:28:42.830-0700 [conn5067759]  authenticate db: local { authenticate: 1, nonce: "xxx", user: "__system", key: "xxx" }
2016-08-22T22:29:00.052-0700 [conn3475109]  authenticate db: admin { authenticate: 1, nonce: "xxx", user: "admin", key: "xxx" }
2016-08-22T22:29:00.934-0700 [conn3475109]  authenticate db: admin { authenticate: 1, nonce: "xxx", user: "admin", key: "xxx" }
2016-08-22T22:29:02.290-0700 [rsHealthPoll] replSet member 10.0.1.215:27017 is now in state SECONDARY

Logs in new secondary:

2016-08-23T05:28:41.063+0000 [rsStart] trying to contact 10.0.0.141:27017
2016-08-23T05:28:41.065+0000 [rsStart] trying to contact 10.0.1.211:27017
2016-08-23T05:28:41.066+0000 [rsStart] trying to contact 10.0.1.212:27017
2016-08-23T05:28:41.068+0000 [rsStart] replSet I am 10.0.1.215:27017
2016-08-23T05:28:41.068+0000 [rsStart] replSet got config version 211493 from a remote, saving locally
2016-08-23T05:28:41.068+0000 [rsStart] replSet info saving a newer config version to local.system.replset: { _id: "<name>", version: 211493, members: [ { _id: 4, host: "10.0.1.211:27017" }, { _id: 5, host: "10.0.1.212:27017" }, { _id: 6, host: "10.0.0.141:27017", priority: 0.0, hidden: true }, { _id: 7, host: "10.0.1.215:27017" } ] }
2016-08-23T05:28:41.068+0000 [rsStart] replSet saveConfigLocally done
2016-08-23T05:28:41.068+0000 [rsStart] replSet info Couldn't load config yet. Sleeping 20sec and will try again.
2016-08-23T05:28:54.209+0000 [conn6] end connection 10.0.1.211:53294 (99 connections now open)
2016-08-23T05:28:54.209+0000 [initandlisten] connection accepted from 10.0.1.211:53298 #109 (100 connections now open)
2016-08-23T05:28:54.210+0000 [conn109]  authenticate db: local { authenticate: 1, nonce: "xxx", user: "__system", key: "xxx" }
2016-08-23T05:28:54.625+0000 [conn9] end connection 10.0.1.212:60296 (99 connections now open)
2016-08-23T05:28:54.625+0000 [initandlisten] connection accepted from 10.0.1.212:60299 #110 (100 connections now open)
2016-08-23T05:28:54.626+0000 [conn110]  authenticate db: local { authenticate: 1, nonce: "xxx", user: "__system", key: "xxx" }
2016-08-23T05:28:54.770+0000 [conn16] end connection 10.0.0.141:35555 (99 connections now open)
2016-08-23T05:28:54.771+0000 [initandlisten] connection accepted from 10.0.0.141:35564 #111 (100 connections now open)
2016-08-23T05:28:54.771+0000 [conn111]  authenticate db: local { authenticate: 1, nonce: "xxx", user: "__system", key: "xxx" }
2016-08-23T05:29:01.069+0000 [rsStart] trying to contact 10.0.0.141:27017
2016-08-23T05:29:01.069+0000 [rsStart] trying to contact 10.0.1.211:27017
2016-08-23T05:29:01.070+0000 [rsStart] trying to contact 10.0.1.212:27017
2016-08-23T05:29:01.071+0000 [rsStart] replSet I am 10.0.1.215:27017
2016-08-23T05:29:01.071+0000 [rsHealthPoll] replSet member 10.0.1.212:27017 is up
2016-08-23T05:29:01.071+0000 [rsHealthPoll] replSet member 10.0.1.211:27017 is up
2016-08-23T05:29:01.071+0000 [rsHealthPoll] replSet member 10.0.1.212:27017 is now in state SECONDARY
2016-08-23T05:29:01.071+0000 [rsHealthPoll] replSet member 10.0.1.211:27017 is now in state PRIMARY
2016-08-23T05:29:01.210+0000 [rsStart] replSet STARTUP2
2016-08-23T05:29:01.210+0000 [rsMgr] replSet total number of votes is even - add arbiter or give one member an extra vote
2016-08-23T05:29:01.212+0000 [rsSync] replSet SECONDARY
2016-08-23T05:29:03.071+0000 [rsHealthPoll] replSet member 10.0.0.141:27017 is up
2016-08-23T05:29:03.071+0000 [rsHealthPoll] replSet member 10.0.0.141:27017 is now in state SECONDARY

Let me know if you need any thing else.

Comment by Ramon Fernandez Marina [ 29/Aug/16 ]

vivek_jonam, if I'm not mistaken, what the log snippet above shows is that the replica set is still syncing at 5:32:42, but it finishes syncing at 5:32:51 and then transitions to SECONDARY, at which point is ready to accept queries. Perhaps if you upload the full logs from this secondary from startup, as well as the corresponding logs on the primary for the same time period, we may be able to see more.

Generated at Thu Feb 08 04:10:23 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.