[SERVER-31581] mongodb config server can not stepdown Created: 16/Oct/17  Updated: 27/Oct/23  Resolved: 20/Oct/17

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 3.4.9
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: SuperSuJJ [X] Assignee: Mark Agarunov
Resolution: Gone away Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File config.tar     File configparmarylog.gz     File configsecondary1log.gz     File configsecondary2log.gz    
Issue Links:
Duplicate
is duplicated by SERVER-31577 mongodb upgrade to 3.4 from 3.2,and c... Closed
Operating System: ALL
Participants:

 Description   

after Upgrade Config Servers to Replica Set,it can not stepdown in parmary , it issue:

rs.stepDown()
{
        "ok" : 0,
        "errmsg" : "No electable secondaries caught up as of 2017-10-16T13:44:02.216+0800. Please use {force: true} to force node to step down.",
        "code" : 50,
        "codeName" : "ExceededTimeLimit"

and i use keyfile auth the database.the secondary is :

product_operation_listing_management.sku_images command: count { count: "sku_images", query: { skuId: { $in: [ "a11120200ux0542", "a12092700ux0316", "a12110600ux0434", "a14081900ux0677", "a15051500ux0050", "a15051500ux0071", "a15052900ux0414", "a15061700ux0806", "a15102900ux0336", "a15110300ux0055", "a15112400ux0300", "a15113000ux2072", "a15121900ux1795", "a16111900ux0056", "a16111900ux0071", "a16111900ux0072", "a16111900ux0074", "a16112100ux0044", "a16112100ux0050", "a16112100ux0053", "a16112100ux0059", "a16112100ux0070", "a16112100ux0191", "a16112100ux0220", "a16112100ux0263", "a16112100ux1217", "a16112100ux1218", "a16112100ux1223", "a16112100ux1232", "a16112300ux0171", "a13062600ux0709", "a14120200ux0042" ] } } } planSummary: IXSCAN { skuId: "hashed" } keysExamined:26 docsExamined:11 numYields:2 reslen:124 locks:{ Global: { acquireCount: { r: 6 } }, Database: { acquireCount: { r: 3 } }, Collection: { acquireCount: { r: 3 } } } protocol:op_query 154ms

replica set

2017-10-16T18:59:46.385+0800 I NETWORK  [conn137] received client metadata from 192.168.169.60:36295 conn137: { driver: { name: "NetworkInterfaceASIO-ShardRegistry", version: "3.4.9" }, os: { type: "Linux", name: "CentOS release 6.6 (Final)", architecture: "x86_64", version: "Kernel 2.6.32-504.16.2.el6.x86_64" } }
2017-10-16T18:59:46.388+0800 I ACCESS   [conn137] Successfully authenticated as principal __system on local
2017-10-16T19:00:13.677+0800 I ACCESS   [conn37] SCRAM-SHA-1 authentication failed for admin on admin from client 192.168.169.62:24256 ; AuthenticationFailed: SCRAM-SHA-1 authentication failed, storedKey mismatch
2017-10-16T19:02:13.787+0800 I REPL     [replication-5] Restarting oplog query due to error: InterruptedDueToReplStateChange: operation was interrupted. Last fetched optime (with hash): { ts: Timestamp 1508151723000|2, t: 6 }[-7454072674456839620]. Restarts remaining: 3
2017-10-16T19:02:13.788+0800 I REPL     [replication-5] Scheduled new oplog query Fetcher source: 192.168.169.60:22004 database: local query: { find: "oplog.rs", filter: { ts: { $gte: Timestamp 1508151723000|2 } }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 60000, term: 6 } query metadata: { $replData: 1, $oplogQueryData: 1, $ssm: { $secondaryOk: true } } active: 1 timeout: 65000ms shutting down?: 0 first: 1 firstCommandScheduler: RemoteCommandRetryScheduler request: RemoteCommand 142029 -- target:192.168.169.60:22004 db:local cmd:{ find: "oplog.rs", filter: { ts: { $gte: Timestamp 1508151723000|2 } }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 60000, term: 6 } active: 1 callbackHandle.valid: 1 callbackHandle.cancelled: 0 attempt: 1 retryPolicy: RetryPolicyImpl maxAttempts: 1 maxTimeMillis: -1ms
2017-10-16T19:02:13.797+0800 W REPL     [rsBackgroundSync] Fetcher stopped querying remote oplog with error: InvalidSyncSource: Sync source's last applied OpTime { ts: Timestamp 1508151723000|2, t: 6 } is not greater than our last fetched OpTime { ts: Timestamp 1508151723000|2, t: 6 }. Choosing new sync source.
2017-10-16T19:02:13.798+0800 I REPL     [rsBackgroundSync] could not find member to sync from
2017-10-16T19:02:18.753+0800 I REPL     [SyncSourceFeedback] SyncSourceFeedback error sending update to 192.168.169.60:22004: InvalidSyncSource: Sync source was cleared. Was 192.168.169.60:22004
2017-10-16T19:02:23.800+0800 I REPL     [rsBackgroundSync] sync source candidate: 192.168.169.60:22004
2017-10-16T19:02:23.803+0800 I COMMAND  [conn111] command config.settings command: find { find: "settings", filter: { _id: "balancer" }, readConcern: { level: "majority", afterOpTime: { ts: Timestamp 1508151742000|13, t: 6 } }, limit: 1, maxTimeMS: 30000 } planSummary: IDHACK keysExamined:1 docsExamined:1 cursorExhausted:1 numYields:0 nreturned:1 reslen:405 locks:{ Global: { acquireCount: { r: 2 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { r: 1 } } } protocol:op_command 1486ms
2017-10-16T19:02:23.803+0800 I COMMAND  [conn81] command config.settings command: find { find: "settings", filter: { _id: "chunksize" }, readConcern: { level: "majority", afterOpTime: { ts: Timestamp 1508151742000|13, t: 6 } }, limit: 1, maxTimeMS: 30000 } planSummary: IDHACK keysExamined:1 docsExamined:1 cursorExhausted:1 numYields:0 nreturned:1 reslen:396 locks:{ Global: { acquireCount: { r: 2 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { r: 1 } } } protocol:op_command 1486ms
2017-10-16T19:02:24.833+0800 I NETWORK  [thread2] connection accepted from 192.168.169.61:63533 #138 (23 connections now open)
2017-10-16T19:02:24.834+0800 I NETWORK  [conn138] received client metadata from 192.168.169.61:63533 conn138: { driver: { name: "NetworkInterfaceASIO-RS", version: "3.4.9" }, os: { type: "Linux", name: "CentOS release 6.6 (Final)", architecture: "x86_64", version: "Kernel 2.6.32-504.16.2.el6.x86_64" } }
2017-10-16T19:02:24.838+0800 I ACCESS   [conn138] Successfully authenticated as principal __system on local
2017-10-16T19:02:24.840+0800 I NETWORK  [thread2] connection accepted from 192.168.169.61:63534 #139 (24 connections now open)
2017-10-16T19:02:24.840+0800 I NETWORK  [conn139] received client metadata from 192.168.169.61:63534 conn139: { driver: { name: "NetworkInterfaceASIO-RS", version: "3.4.9" }, os: { type: "Linux", name: "CentOS release 6.6 (Final)", architecture: "x86_64", version: "Kernel 2.6.32-504.16.2.el6.x86_64" } }
2017-10-16T19:02:24.843+0800 I ACCESS   [conn139] Successfully authenticated as principal __system on local



 Comments   
Comment by Mark Agarunov [ 20/Oct/17 ]

Hello Su,

Thank you for the response. I'm glad to hear that the config server not stepping down is no longer an issue. My recommendation regarding the auth error would be to remove and them reconfigure the authentication and keyfiles. This should mitigate any discrepancies in the auth configuration.

Thanks,
Mark

Comment by SuperSuJJ [X] [ 19/Oct/17 ]

i think i find why can not step down. it is i have a mistake when upgrade config server from mirror to replication set.

But now auth still issue,but the sharing cluster is normal,is it have some way to find why have this auth wrong?

Comment by SuperSuJJ [X] [ 18/Oct/17 ]

And the config server can stepdown normal.

Comment by SuperSuJJ [X] [ 18/Oct/17 ]

i see the sharding is health and the config server is health too:

> rs.status()
{
        "set" : "confReplSet",
        "date" : ISODate("2017-10-18T01:35:20.265Z"),
        "myState" : 2,
        "term" : NumberLong(7),
        "syncingTo" : "secondary2:22004",
        "configsvr" : true,
        "heartbeatIntervalMillis" : NumberLong(2000),
        "optimes" : {
                "lastCommittedOpTime" : {
                        "ts" : Timestamp(1508290515, 1),
                        "t" : NumberLong(7)
                },
                "readConcernMajorityOpTime" : {
                        "ts" : Timestamp(1508290515, 1),
                        "t" : NumberLong(7)
                },
                "appliedOpTime" : {
                        "ts" : Timestamp(1508290515, 1),
                        "t" : NumberLong(7)
                },
                "durableOpTime" : {
                        "ts" : Timestamp(1508290515, 1),
                        "t" : NumberLong(7)
                }
        },
        "members" : [
                {
                        "_id" : 0,
                        "name" : "parmary:22004",
                        "health" : 1,
                        "state" : 1,
                        "stateStr" : "PRIMARY",
                        "uptime" : 74903,
                        "optime" : {
                                "ts" : Timestamp(1508290515, 1),
                                "t" : NumberLong(7)
                        },
                        "optimeDurable" : {
                                "ts" : Timestamp(1508290515, 1),
                                "t" : NumberLong(7)
                        },
                        "optimeDate" : ISODate("2017-10-18T01:35:15Z"),
                        "optimeDurableDate" : ISODate("2017-10-18T01:35:15Z"),
                        "lastHeartbeat" : ISODate("2017-10-18T01:35:18.927Z"),
                        "lastHeartbeatRecv" : ISODate("2017-10-18T01:35:19.882Z"),
                        "pingMs" : NumberLong(0),
                        "electionTime" : Timestamp(1508215712, 1),
                        "electionDate" : ISODate("2017-10-17T04:48:32Z"),
                        "configVersion" : 13
                },
                {
                        "_id" : 1,
                        "name" : "secondary1:22004",
                        "health" : 1,
                        "state" : 2,
                        "stateStr" : "SECONDARY",
                        "uptime" : 209368,
                        "optime" : {
                                "ts" : Timestamp(1508290515, 1),
                                "t" : NumberLong(7)
                        },
                        "optimeDate" : ISODate("2017-10-18T01:35:15Z"),
                        "syncingTo" : "parmary:22004",
                        "configVersion" : 13,
                        "self" : true
                },
                {
                        "_id" : 2,
                        "name" : "secondary2:22004",
                        "health" : 1,
                        "state" : 2,
                        "stateStr" : "SECONDARY",
                        "uptime" : 74903,
                        "optime" : {
                                "ts" : Timestamp(1508290515, 1),
                                "t" : NumberLong(7)
                        },
                        "optimeDurable" : {
                                "ts" : Timestamp(1508290515, 1),
                                "t" : NumberLong(7)
                        },
                        "optimeDate" : ISODate("2017-10-18T01:35:15Z"),
                        "optimeDurableDate" : ISODate("2017-10-18T01:35:15Z"),
                        "lastHeartbeat" : ISODate("2017-10-18T01:35:19.024Z"),
                        "lastHeartbeatRecv" : ISODate("2017-10-18T01:35:19.648Z"),
                        "pingMs" : NumberLong(0),
                        "syncingTo" : "secondary1:22004",
                        "configVersion" : 13
                }
        ],
        "ok" : 1
}

but the log is still has error like this:
2017-10-18T09:39:08.767+0800 I ACCESS [conn593] SCRAM-SHA-1 authentication failed for admin on admin from client parmary:55026 ; AuthenticationFailed: SCRAM-SHA-1 authentication failed, storedKey mismatch

Comment by Mark Agarunov [ 17/Oct/17 ]

Hello Su,

The primary could not step down due to seeing no up to date secondaries that could become the primary. If this has solved your issue, the auth failure message is likely unrelated. Could you confirm that everything is now working as expected?

Thanks,
Mark

Comment by SuperSuJJ [X] [ 17/Oct/17 ]

i found the log still have auth fail record.

Comment by SuperSuJJ [X] [ 17/Oct/17 ]

i just remove the secondary member,and reconfig the replica set:
rs.remove("secondary1:port")
rs.remove("secondary2:port")

config = {_id: "confReplSet", members:[
{_id: 1, host: 'parmary:port'},
{_id: 0, host: 'secondary1:port'},
{_id: 2, host: 'secondary2:port'}],
configsvr: true,
version: 1
}

rs.reconfig(config)

and solution it,is it ok now?

Comment by SuperSuJJ [X] [ 17/Oct/17 ]

i chown the keyfile and restart the config server mongod process,but it have the same problem.

Comment by SuperSuJJ [X] [ 17/Oct/17 ]

Hi,mark,
thanks for help. I upload all the config server log files and configuration file , the configuration file is all the same with all server.

and i found a problem, i use the mongodb user run mongod and mongos process, but keyfile owner group is mongod :
rw------ 1 mongodb mongod 813 Jul 19 18:11 keyfile

is it issue this can not stepdown? but the other shard use the same keyfile in the same machine it can step down. is it the config server and replica set use the keyfile auth different?

Comment by Mark Agarunov [ 16/Oct/17 ]

Hello Su,

Thank you for the report. To get a better idea of what may be causing this behavior, could you please provide the following:

  • The complete log files from all mongod nodes while this issue is present
  • The complete configuration file from all mongod nodes

This should give some insight into the issue.

Thanks,
Mark

Generated at Thu Feb 08 04:27:31 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.