[SERVER-46969] Test that we eliminate retries of short-running operations due to planned mongos shutdown (with streamable isMaster) Created: 18/Mar/20  Updated: 28/May/20  Resolved: 28/May/20

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: Tess Avitabile (Inactive) Assignee: Jason Chan
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File mongos_quiesce_mode_streamable_ismaster_with_hanging_read.txt     Text File mongos_quiesce_mode_streamable_ismaster_with_writes.txt    
Issue Links:
Depends
depends on SERVER-46964 Test locally that we eliminate retrie... Closed
depends on SERVER-46968 Test locally that we eliminate retrie... Closed
Sprint: Repl 2020-06-01
Participants:

 Description   

Test the mongos workload created inĀ SERVER-46968 with the latest version of pymongo.



 Comments   
Comment by Jason Chan [ 28/May/20 ]

Similarly, we expect that if a mongos enters quiesce mode while a read is still in progress, the read should return successfully to the driver even if the driver has already marked the node as UNKNOWN.

Our observations for this test case in a sharded cluster with two mongos nodes:

We set two failpoints - one to hang a read on the mongos, and one to cause the same mongos to hang once it enters quiesce mode:

[js_test:mongos_quiesce_mode_streamable_ismaster_with_hanging_read] 2020-05-27T20:12:15.988-0400 s20024| 2020-05-27T20:12:15.988-04:00 W  CONTROL  23829   [conn12] "Set failpoint","attr":{"failPointName":"waitInFindBeforeMakingBatch","failPoint":{"mode":1,"data":{},"timesEntered":0}}
[js_test:mongos_quiesce_mode_streamable_ismaster_with_hanging_read] 2020-05-27T20:12:15.989-0400 s20024| 2020-05-27T20:12:15.988-04:00 I  COMMAND  51803   [conn12] "Slow query","attr":{"type":"command","ns":"admin.waitInFindBeforeMakingBatch","appName":"MongoDB Shell","command":{"configureFailPoint":"waitInFindBeforeMakingBatch","mode":"alwaysOn","data":{},"lsid":{"id":{"$uuid":"aeadc76f-41a5-43d4-ad86-9f38f8db4351"}},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1590624735,"i":7}},"signature":{"hash":{"$binary":{"base64":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"0"}},"keyId":0}},"$db":"admin"},"numYields":0,"reslen":174,"protocol":"op_msg","durationMillis":0}
...
[js_test:mongos_quiesce_mode_streamable_ismaster_with_hanging_read] 2020-05-27T20:12:16.093-0400 s20024| 2020-05-27T20:12:16.092-04:00 W  CONTROL  23829   [conn12] "Set failpoint","attr":{"failPointName":"hangDuringQuiesceMode","failPoint":{"mode":1,"data":{},"timesEntered":0}}
[js_test:mongos_quiesce_mode_streamable_ismaster_with_hanging_read] 2020-05-27T20:12:16.093-0400 s20024| 2020-05-27T20:12:16.093-04:00 I  COMMAND  51803   [conn12] "Slow query","attr":{"type":"command","ns":"admin.hangDuringQuiesceMode","appName":"MongoDB Shell","command":{"configureFailPoint":"hangDuringQuiesceMode","mode":"alwaysOn","data":{},"lsid":{"id":{"$uuid":"aeadc76f-41a5-43d4-ad86-9f38f8db4351"}},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1590624735,"i":7}},"signature":{"hash":{"$binary":{"base64":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"0"}},"keyId":0}},"$db":"admin"},"numYields":0,"reslen":174,"protocol":"op_msg","durationMillis":0}

Wait for the read to hang on mongos node s20024:

[js_test:mongos_quiesce_mode_streamable_ismaster_with_hanging_read] 2020-05-27T20:12:16.074-0400 s20024| 2020-05-27T20:12:16.074-04:00 I  QUERY    20908   [conn16] "Waiting in find before making batch for query - {cq_Short}","attr":{"cq_Short":"ns: test.state query: {} sort: {} projection: {} limit: 1"}

Call shutdown on mongos node s20024 and wait for it to hang in quiesce mode:

[js_test:mongos_quiesce_mode_streamable_ismaster_with_hanging_read] 2020-05-27T20:12:16.100-0400 s20024| 2020-05-27T20:12:16.099-04:00 I  CONTROL  23377   [SignalHandler] "got signal {sig} ({strsignal_sig})","attr":{"sig":15,"strsignal_sig":"Terminated"}
[js_test:mongos_quiesce_mode_streamable_ismaster_with_hanging_read] 2020-05-27T20:12:16.100-0400 s20024| 2020-05-27T20:12:16.099-04:00 I  CONTROL  23378   [SignalHandler] "kill from pid:{si_si_pid} uid:{si_si_uid}","attr":{"si_si_pid":13110,"si_si_uid":1000}
[js_test:mongos_quiesce_mode_streamable_ismaster_with_hanging_read] 2020-05-27T20:12:16.100-0400 s20024| 2020-05-27T20:12:16.099-04:00 I  CONTROL  23381   [SignalHandler] "will terminate after current cmd ends"
[js_test:mongos_quiesce_mode_streamable_ismaster_with_hanging_read] 2020-05-27T20:12:16.101-0400 s20024| 2020-05-27T20:12:16.099-04:00 I  COMMAND  4695700 [SignalHandler] "hangDuringQuiesceMode failpoint enabled"

The node s20024 returns a ShutdownInProgress isMaster response to the driver immediately:

[js_test:mongos_quiesce_mode_streamable_ismaster_with_hanging_read] 2020-05-27T20:12:16.102-0400 s20024| 2020-05-27T20:12:16.102-04:00 I  NETWORK  51800   [conn21] "client metadata","attr":{"remote":"127.0.0.1:35914","client":"conn21","doc":{"driver":{"name":"PyMongo","version":"3.11.0b1.dev0"},"os":{"type":"Linux","name":"Linux","architecture":"x86_64","version":"4.13.0-46-generic"},"platform":"CPython 3.7.0.final.0"}}
[js_test:mongos_quiesce_mode_streamable_ismaster_with_hanging_read] 2020-05-27T20:12:16.103-0400 s20024| 2020-05-27T20:12:16.102-04:00 D1 -        23074   [conn21] "User assertion","attr":{"error":"ShutdownInProgress: Mongos is in quiesce mode and will shut down","file":"src/mongo/s/mongos_topology_coordinator.cpp","line":117}
[js_test:mongos_quiesce_mode_streamable_ismaster_with_hanging_read] 2020-05-27T20:12:16.103-0400 s20024| 2020-05-27T20:12:16.102-04:00 D1 SHARDING 22772   [conn21] "Exception thrown while processing command","attr":{"db":"admin","headerId":1102520059,"error":"ShutdownInProgress: Mongos is in quiesce mode and will shut down"}
[js_test:mongos_quiesce_mode_streamable_ismaster_with_hanging_read] 2020-05-27T20:12:16.103-0400 s20024| 2020-05-27T20:12:16.102-04:00 I  COMMAND  51803   [conn21] "Slow query","attr":{"type":"command","ns":"admin.$cmd","command":{"ismaster":1,"client":{"driver":{"name":"PyMongo","version":"3.11.0b1.dev0"},"os":{"type":"Linux","name":"Linux","architecture":"x86_64","version":"4.13.0-46-generic"},"platform":"CPython 3.7.0.final.0"},"$db":"admin"},"numYields":0,"ok":0,"errMsg":"Mongos is in quiesce mode and will shut down","errName":91,"errCode":91,"reslen":340,"protocol":"op_query","durationMillis":0}'

Allow the hanging read to finish on s20024. The driver should not raise any errors here:

[js_test:mongos_quiesce_mode_streamable_ismaster_with_hanging_read] 2020-05-27T20:12:17.105-0400 s20024| 2020-05-27T20:12:17.105-04:00 I  COMMAND  51803   [conn16] "Slow query","attr":{"type":"command","ns":"test.state","command":{"find":"state","filter":{},"limit":1,"singleBatch":true,"lsid":{"id":{"$uuid":"018e25e2-5ad8-4225-aaee-5e12e09a115d"}},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1590624735,"i":7}},"signature":{"hash":{"$binary":{"base64":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"0"}},"keyId":0}},"$db":"test","$readPreference":{"mode":"primary"}},"nShards":1,"cursorExhausted":true,"numYields":0,"nreturned":1,"reslen":272,"protocol":"op_msg","durationMillis":1042}

We see that all future reads target the other mongos node s20025. Node s20024 is not targeted for the rest of its quiesce period:

[js_test:mongos_quiesce_mode_streamable_ismaster_with_hanging_read] 2020-05-27T20:12:17.358-0400 s20025| 2020-05-27T20:12:17.358-04:00 I  COMMAND  51803   [conn14] "Slow query","attr":{"type":"command","ns":"test.state","command":{"find":"state","filter":{},"limit":1,"singleBatch":true,"lsid":{"id":{"$uuid":"018e25e2-5ad8-4225-aaee-5e12e09a115d"}},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1590624735,"i":7}},"signature":{"hash":{"$binary":{"base64":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"0"}},"keyId":0}},"$db":"test","$readPreference":{"mode":"primary"}},"nShards":1,"cursorExhausted":true,"numYields":0,"nreturned":1,"reslen":269,"protocol":"op_msg","durationMillis":1}

The test completes successfully with driver option retryReads set to false, indicating that all reads completed successfully without the driver ever having to retry.

Example logs of test case: mongos_quiesce_mode_streamable_ismaster_with_hanging_read.txt

Comment by Jason Chan [ 28/May/20 ]

For this test, we expect to observe that as a mongos enters quiesce mode, it should still be able to service short-running reads and writes. Since we are using the streamable isMaster protocol, the mongos should return a ShutdownInProgress error to the driver immediately upon entering quiesce mode. The driver should then set the server to be in the UNKNOWN state and no longer consider it as a target for future operations.

Our observations for this test case in a sharded cluster with two mongos nodes:

Mongos node s20024 services a series of short-running reads and writes:

[js_test:mongos_quiesce_mode_streamable_ismaster] 2020-05-27T23:56:34.511-0400 s20024| 2020-05-27T23:56:34.510-04:00 I  COMMAND  51803   [conn16] "Slow query","attr":{"type":"command","ns":"test.state","command":{"update":"state","ordered":true,"lsid":{"id":{"$uuid":"3379383f-b7b2-447d-9485-d0ab8fd5f169"}},"txnNumber":3,"$clusterTime":{"clusterTime":{"$timestamp":{"t":1590638194,"i":8}},"signature":{"hash":{"$binary":{"base64":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"0"}},"keyId":0}},"$db":"test","$readPreference":{"mode":"primary"}},"nShards":1,"nMatched":1,"nModified":1,"numYields":0,"reslen":185,"protocol":"op_msg","durationMillis":1}
...
[js_test:mongos_quiesce_mode_streamable_ismaster] 2020-05-27T23:56:34.514-0400 s20024| 2020-05-27T23:56:34.513-04:00 I  COMMAND  51803   [conn16] "Slow query","attr":{"type":"command","ns":"test.state","command":{"find":"state","filter":{},"limit":1,"singleBatch":true,"lsid":{"id":{"$uuid":"3379383f-b7b2-447d-9485-d0ab8fd5f169"}},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1590638194,"i":9}},"signature":{"hash":{"$binary":{"base64":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"0"}},"keyId":0}},"$db":"test","$readPreference":{"mode":"primary"}},"nShards":1,"cursorExhausted":true,"numYields":0,"nreturned":1,"reslen":279,"protocol":"op_msg","durationMillis":1}

We call shutdown on mongos node s20024, causing it to enter quiesce mode:

[js_test:mongos_quiesce_mode_streamable_ismaster] 2020-05-27T23:56:34.765-0400 s20024| 2020-05-27T23:56:34.764-04:00 I  CONTROL  23377   [SignalHandler] "got signal {sig} ({strsignal_sig})","attr":{"sig":15,"strsignal_sig":"Terminated"}
[js_test:mongos_quiesce_mode_streamable_ismaster] 2020-05-27T23:56:34.766-0400 s20024| 2020-05-27T23:56:34.764-04:00 I  CONTROL  23378   [SignalHandler] "kill from pid:{si_si_pid} uid:{si_si_uid}","attr":{"si_si_pid":15306,"si_si_uid":1000}
[js_test:mongos_quiesce_mode_streamable_ismaster] 2020-05-27T23:56:34.766-0400 s20024| 2020-05-27T23:56:34.765-04:00 I  CONTROL  23381   [SignalHandler] "will terminate after current cmd ends"
[js_test:mongos_quiesce_mode_streamable_ismaster] 2020-05-27T23:56:34.766-0400 s20024| 2020-05-27T23:56:34.765-04:00 I  COMMAND  4695701 [SignalHandler] "Entering quiesce mode for mongos shutdown","attr":{"quiesceTimeMillis":5000}

Mongos node s20024 responds to the driver immediately with a ShutdownInProgressError:

[js_test:mongos_quiesce_mode_streamable_ismaster] 2020-05-27T23:56:34.768-0400 s20024| 2020-05-27T23:56:34.767-04:00 I  NETWORK  51800   [conn20] "client metadata","attr":{"remote":"127.0.0.1:36304","client":"conn20","doc":{"driver":{"name":"PyMongo","version":"3.11.0b1.dev0"},"os":{"type":"Linux","name":"Linux","architecture":"x86_64","version":"4.13.0-46-generic"},"platform":"CPython 3.7.0.final.0"}}
[js_test:mongos_quiesce_mode_streamable_ismaster] 2020-05-27T23:56:34.768-0400 s20024| 2020-05-27T23:56:34.767-04:00 D1 -        23074   [conn20] "User assertion","attr":{"error":"ShutdownInProgress: Mongos is in quiesce mode and will shut down","file":"src/mongo/s/mongos_topology_coordinator.cpp","line":117}
[js_test:mongos_quiesce_mode_streamable_ismaster] 2020-05-27T23:56:34.768-0400 s20024| 2020-05-27T23:56:34.767-04:00 D1 SHARDING 22772   [conn20] "Exception thrown while processing command","attr":{"db":"admin","headerId":2044897763,"error":"ShutdownInProgress: Mongos is in quiesce mode and will shut down"}
[js_test:mongos_quiesce_mode_streamable_ismaster] 2020-05-27T23:56:34.769-0400 s20024| 2020-05-27T23:56:34.768-04:00 I  COMMAND  51803   [conn20] "Slow query","attr":{"type":"command","ns":"admin.$cmd","command":{"ismaster":1,"client":{"driver":{"name":"PyMongo","version":"3.11.0b1.dev0"},"os":{"type":"Linux","name":"Linux","architecture":"x86_64","version":"4.13.0-46-generic"},"platform":"CPython 3.7.0.final.0"},"$db":"admin"},"numYields":0,"ok":0,"errMsg":"Mongos is in quiesce mode and will shut down","errName":91,"errCode":91,"reslen":340,"protocol":"op_query","durationMillis":0}

The driver sends all future reads and writes to the other mongos node s20025:

[js_test:mongos_quiesce_mode_streamable_ismaster] 2020-05-27T23:56:35.023-0400 s20025| 2020-05-27T23:56:35.023-04:00 I  COMMAND  51803   [conn16] "Slow query","attr":{"type":"command","ns":"test.state","command":{"update":"state","ordered":true,"lsid":{"id":{"$uuid":"3379383f-b7b2-447d-9485-d0ab8fd5f169"}},"txnNumber":5,"$clusterTime":{"clusterTime":{"$timestamp":{"t":1590638194,"i":10}},"signature":{"hash":{"$binary":{"base64":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"0"}},"keyId":0}},"$db":"test","$readPreference":{"mode":"primary"}},"nShards":1,"nMatched":1,"nModified":1,"numYields":0,"reslen":185,"protocol":"op_msg","durationMillis":1}
...
[js_test:mongos_quiesce_mode_streamable_ismaster] 2020-05-27T23:56:35.026-0400 s20025| 2020-05-27T23:56:35.026-04:00 I  COMMAND  51803   [conn16] "Slow query","attr":{"type":"command","ns":"test.state","command":{"find":"state","filter":{},"limit":1,"singleBatch":true,"lsid":{"id":{"$uuid":"3379383f-b7b2-447d-9485-d0ab8fd5f169"}},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1590638195,"i":1}},"signature":{"hash":{"$binary":{"base64":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"0"}},"keyId":0}},"$db":"test","$readPreference":{"mode":"primary"}},"nShards":1,"cursorExhausted":true,"numYields":0,"nreturned":1,"reslen":279,"protocol":"op_msg","durationMillis":1}

The test completes successfully with both driver options retryReads and retryWrites set to false, indicating that all operations completed successfully without the driver ever having to retry.

Example logs for this test case: mongos_quiesce_mode_streamable_ismaster_with_writes.txt

Generated at Thu Feb 08 05:12:56 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.