[SERVER-66471] 5.0 mongos write hangs on PSA shard after second is shutdown Created: 16/May/22  Updated: 07/Nov/22  Resolved: 07/Nov/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: jing xu Assignee: Ali Mir
Resolution: Won't Fix Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File image-2022-05-16-14-18-00-561.png     PNG File screenshot-1.png    
Operating System: ALL
Sprint: Repl 2022-11-14
Participants:

 Description   

hello:
i read doc,mongo 5.0 PSA default writeConcern is w:1. so i test it ok on PSA replication after second is shutdown.but i test write hangs on cluster.so defaultwriteconcern on mongos cluster is not w:1,on psa replication is w:1.please is it bug for mongos?

if [ (#arbiters > 0) AND (#non-arbiters <= majority(#voting-nodes)) ]
defaultWriteConcern =

{ w: 1 }

else
defaultWriteConcern =

{ w: "majority" }

To reproduce on psa replication:
shard2:PRIMARY> version()
5.0.2
shard2:PRIMARY> status.members.length
shard2:PRIMARY> status.members[0].stateStr
PRIMARY
shard2:PRIMARY> status.members[1].stateStr
(not reachable/healthy)
shard2:PRIMARY> status.members[2].stateStr
ARBITER
shard2:PRIMARY> db.xiaoxu.insert(

{testWriteConcer:1}

)
WriteResult(

{ "nInserted" : 1 }

)

To reproduce using mongos on single psa shard,it not wok.
mongo cluster has one mongos,single psa shard,and one member config,when psa with secondary is shutdown.it write hangs.

mongos> db.testWriteConcern.insert({_id:10,name:"xiaoxu"},

{w:1}

)
WriteResult(

{ "nInserted" : 1 }

)

it is not nerver timeout when no timeout
mongos> db.testWriteConcern.insert({_id:3,name:"xiaoxu"}).
.....

anon@127.0.0.1:31002:PRIMARY:[db]test> rs.status();
{
"set" : "shard2",
"date" : ISODate("2022-05-16T06:09:18.876Z"),
"myState" : 1,
"term" : NumberLong(5),
"syncSourceHost" : "",
"syncSourceId" : -1,
"heartbeatIntervalMillis" : NumberLong(2000),
"majorityVoteCount" : 2,
"writeMajorityCount" : 2,
"votingMembersCount" : 3,
"writableVotingMembersCount" : 2,
"optimes" : {
"lastCommittedOpTime" :

{ "ts" : Timestamp(1652671774, 1), "t" : NumberLong(5) }

,
"lastCommittedWallTime" : ISODate("2022-05-16T03:29:34.156Z"),
"readConcernMajorityOpTime" :

{ "ts" : Timestamp(1652671774, 1), "t" : NumberLong(5) }

,
"appliedOpTime" :

{ "ts" : Timestamp(1652681354, 1), "t" : NumberLong(5) }

,
"durableOpTime" :

{ "ts" : Timestamp(1652681354, 1), "t" : NumberLong(5) }

,
"lastAppliedWallTime" : ISODate("2022-05-16T06:09:14.351Z"),
"lastDurableWallTime" : ISODate("2022-05-16T06:09:14.351Z")
},
"lastStableRecoveryTimestamp" : Timestamp(1652671774, 1),
"electionCandidateMetrics" : {
"lastElectionReason" : "electionTimeout",
"lastElectionDate" : ISODate("2022-05-16T03:17:24.137Z"),
"electionTerm" : NumberLong(5),
"lastCommittedOpTimeAtElection" :

{ "ts" : Timestamp(1652671027, 1), "t" : NumberLong(3) }

,
"lastSeenOpTimeAtElection" :

{ "ts" : Timestamp(1652671027, 1), "t" : NumberLong(3) }

,
"numVotesNeeded" : 2,
"priorityAtElection" : 1,
"electionTimeoutMillis" : NumberLong(10000),
"numCatchUpOps" : NumberLong(0),
"newTermStartDate" : ISODate("2022-05-16T03:17:24.143Z"),
"wMajorityWriteAvailabilityDate" : ISODate("2022-05-16T03:17:24.780Z")
},
"electionParticipantMetrics" : {
"votedForCandidate" : true,
"electionTerm" : NumberLong(3),
"lastVoteDate" : ISODate("2022-05-16T03:15:47.936Z"),
"electionCandidateMemberId" : 0,
"voteReason" : "",
"lastAppliedOpTimeAtElection" :

{ "ts" : Timestamp(1652590596, 1), "t" : NumberLong(2) }

,
"maxAppliedOpTimeInSet" :

{ "ts" : Timestamp(1652670907, 1), "t" : NumberLong(2) }

,
"priorityAtElection" : 1
},
"members" : [
{
"_id" : 0,
"name" : "10.230.10.150:31002",
"health" : 0,
"state" : 8,
"stateStr" : "(not reachable/healthy)",
"uptime" : 0,
"optime" :

{ "ts" : Timestamp(0, 0), "t" : NumberLong(-1) }

,
"optimeDurable" :

{ "ts" : Timestamp(0, 0), "t" : NumberLong(-1) }

,
"optimeDate" : ISODate("1970-01-01T00:00:00Z"),
"optimeDurableDate" : ISODate("1970-01-01T00:00:00Z"),
"lastHeartbeat" : ISODate("2022-05-16T06:09:18.627Z"),
"lastHeartbeatRecv" : ISODate("2022-05-16T03:29:34.936Z"),
"pingMs" : NumberLong(0),
"lastHeartbeatMessage" : "Error connecting to 10.130.10.150:31002 :: caused by :: Connection refused",
"syncSourceHost" : "",
"syncSourceId" : -1,
"infoMessage" : "",
"configVersion" : 4,
"configTerm" : 5
},
{
"_id" : 1,
"name" : "10.230.10.149:31002",
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"uptime" : 10413,
"optime" :

{ "ts" : Timestamp(1652681354, 1), "t" : NumberLong(5) }

,
"optimeDate" : ISODate("2022-05-16T06:09:14Z"),
"syncSourceHost" : "",
"syncSourceId" : -1,
"infoMessage" : "",
"electionTime" : Timestamp(1652671044, 1),
"electionDate" : ISODate("2022-05-16T03:17:24Z"),
"configVersion" : 4,
"configTerm" : 5,
"self" : true,
"lastHeartbeatMessage" : ""
},

{ "_id" : 2, "name" : "10.230.9.150:31002", "health" : 1, "state" : 7, "stateStr" : "ARBITER", "uptime" : 10400, "lastHeartbeat" : ISODate("2022-05-16T06:09:18.161Z"), "lastHeartbeatRecv" : ISODate("2022-05-16T06:09:18.167Z"), "pingMs" : NumberLong(0), "lastHeartbeatMessage" : "", "syncSourceHost" : "", "syncSourceId" : -1, "infoMessage" : "", "configVersion" : 4, "configTerm" : 5 }

],
"ok" : 1,



 Comments   
Comment by Ali Mir [ 07/Nov/22 ]

Hey there 601290552@qq.com! Thanks for this ticket. I'm on the replication team here at MongoDB, and we worked on updating the write concern default to w: "majority" in 5.0.

Please note that this bug around sharded clusters and PSA sets has been fixed in future versions of MongoDB. If you upgrade to 6.0, you will not see this issue. In later versions, if you attempt to start a sharded cluster with any shard that is a PSA set, you'll receive an error on startup. To avoid the error, you'll need to set a cluster wide write concern via the setDefaultRWConcern command (as chris.kelly@mongodb.com mentioned).

To get around this issue on 5.0.2, please follow the steps outlined by Chris. Namely, you should set the CWWC with:

db.adminCommand( {setDefaultRWConcern: 1, defaultWriteConcern:{w:1}})

to set a default of w:1 for the cluster.

I'm going to close out this ticket, but feel free to reply with any additional questions. Thanks!

Comment by Chris Kelly [ 03/Jun/22 ]

Hi Jing,

Thank you for your report! I went ahead and replicated your situation by creating a 2 shard cluster with a  primary, secondary, and arbiter in each shard. I shut down the secondary on shard #2 and attempted your query on mongos. 

When you run:

 db.test.insert({"test": 1})

 before shutting down a secondary node on shard #2, it will work.

After secondary on shard #2 is stopped, the query will hang.

If you run the query specifying writeConcern: 1 instead on mongos, it will work:

db.test.insert({"test":1}, {writeConcern: {w:1}} )

However, if you run setDefaultRWConcern on mongos, you can set this to 1 yourself to get it to insert again without specifying it on each query.

db.adminCommand( {setDefaultRWConcern: 1, defaultWriteConcern:{w:1}})

I will follow up on whether the writeConcern is supposed to be different on mongos by default, regardless of whether arbiters exist on the shards, but you can use this to remediate the hanging for now.  You should also be trying to avoid the use of arbiters if at all possible.

Regards.
Christopher

 

Generated at Thu Feb 08 06:05:32 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.