[SERVER-46963] Test locally that we eliminate retries of short-running reads due to planned secondary shutdown (without streamable isMaster) Created: 18/Mar/20  Updated: 19/May/20  Resolved: 19/May/20

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: Tess Avitabile (Inactive) Assignee: Jason Chan
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-46952 Create server parameter shutdownTimeo... Closed
depends on SERVER-47017 Attach topologyVersion to ShutdownErr... Closed
is depended on by SERVER-46964 Test locally that we eliminate retrie... Closed
is depended on by SERVER-46965 Test that we eliminate retries of sho... Closed
is depended on by SERVER-46968 Test locally that we eliminate retrie... Closed
Sprint: Repl 2020-05-18, Repl 2020-06-01
Participants:

 Description   

Write a workload of short-running reads (<1 second) with readPreference “secondary”. This should be run against a ReplSetTest with 3 nodes, where the two secondaries are alternately shut down. Run this workload with the 4.2 version of pymongo to test without streamable isMaster (consider using a smaller client heartbeat frequency or a longer quiesce mode, in order to be robust to slow machines). Simulate 1 second of network latency between the driver and the secondaries using mongobridge. The test should disable retries in pymongo and assert that all commands succeed. This demonstrates that the need for retries is eliminated. Consider only running this test on variants that are not typically slow.



 Comments   
Comment by Jason Chan [ 19/May/20 ]

For this testing ticket and ones similar, we decided to perform local testing and to not commit any of the work as there is currently a race condition in Drivers(DRIVERS-1262) that prevents us from being able to eliminate retries completely. Instead, the goal of this testing ticket is to show two things:
(1) A find should still complete successfully when issued to a secondary S while in quiesce mode (the driver has yet to recognize S to be in the UNKNOWN state).
(2) A find that gets issued once the driver recognizes S is UNKNOWN should be sent to a different available secondary.

We can see the following sequence of events in our logs from local testing by increasing the log verbosity for the command component:

Secondary node d20021 enters quiesce mode:

[js_test:quiesce_mode_non_streamable_ismaster] 2020-05-18T23:34:54.733+0000 d20021| 2020-05-18T23:34:54.733+00:00 I  STORAGE  4695102 [SignalHandler] "Entering quiesce mode for shutdown","attr":{"quiesceTimeMillis":5000}

d20021 continues to service reads once in quiesce mode:

[js_test:quiesce_mode_non_streamable_ismaster] 2020-05-18T23:34:55.325+0000 d20021| 2020-05-18T23:34:55.325+00:00 D2 COMMAND  21965   [conn20] "About to run the command","attr":{"db":"test","commandArgs":{"find":"state","filter":{},"limit":1,"singleBatch":true,"lsid":{"id":{"$uuid":"19665e22-f14e-44f4-b1cf-a15309cfecd4"}},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1589844894,"i":2}},"signature":{"hash":{"$binary":{"base64":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"0"}},"keyId":0}},"$db":"test","$readPreference":{"mode":"secondary"}}}
[js_test:quiesce_mode_non_streamable_ismaster] 2020-05-18T23:34:55.326+0000 d20021| 2020-05-18T23:34:55.325+00:00 I  COMMAND  51803   [conn20] "Slow query","attr":{"type":"command","ns":"test.state","command":{"find":"state","filter":{},"limit":1,"singleBatch":true,"lsid":{"id":{"$uuid":"19665e22-f14e-44f4-b1cf-a15309cfecd4"}},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1589844894,"i":2}},"signature":{"hash":{"$binary":{"base64":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"0"}},"keyId":0}},"$db":"test","$readPreference":{"mode":"secondary"}},"planSummary":"COLLSCAN","keysExamined":0,"docsExamined":1,"cursorExhausted":true,"numYields":0,"nreturned":1,"reslen":272,"locks":{"ReplicationStateTransition":{"acquireCount":{"w":1}},"Global":{"acquireCount":{"r":1}},"Database":{"acquireCount":{"r":1}},"Collection":{"acquireCount":{"r":1}},"Mutex":{"acquireCount":{"r":1}}},"storage":{},"protocol":"op_msg","durationMillis":0}

Secondary d20021 Returns a ShutdownInProgress error to the driver:

[js_test:quiesce_mode_non_streamable_ismaster] 2020-05-18T23:34:56.540+0000 d20021| 2020-05-18T23:34:56.540+00:00 D2 COMMAND  21965   [conn21] "About to run the command","attr":{"db":"admin","commandArgs":{"ismaster":1,"client":{"driver":{"name":"PyMongo","version":"3.10.1"},"os":{"type":"Linux","name":"Linux","architecture":"x86_64","version":"2.6.32-220.el6.x86_64"},"platform":"CPython 3.7.0.final.0"},"$db":"admin"}}
[js_test:quiesce_mode_non_streamable_ismaster] 2020-05-18T23:34:56.540+0000 d20021| 2020-05-18T23:34:56.540+00:00 I  NETWORK  51800   [conn21] "client metadata","attr":{"remote":"10.122.11.183:43715","client":"conn21","doc":{"driver":{"name":"PyMongo","version":"3.10.1"},"os":{"type":"Linux","name":"Linux","architecture":"x86_64","version":"2.6.32-220.el6.x86_64"},"platform":"CPython 3.7.0.final.0"}}
[js_test:quiesce_mode_non_streamable_ismaster] 2020-05-18T23:34:56.541+0000 d20021| 2020-05-18T23:34:56.540+00:00 D1 COMMAND  21962   [conn21] "Assertion while executing command","attr":{"command":"ismaster","db":"admin","commandArgs":{"ismaster":1,"client":{"driver":{"name":"PyMongo","version":"3.10.1"},"os":{"type":"Linux","name":"Linux","architecture":"x86_64","version":"2.6.32-220.el6.x86_64"},"platform":"CPython 3.7.0.final.0"},"$db":"admin"},"error":"ShutdownInProgress: The server is in quiesce mode and will shut down"}
[js_test:quiesce_mode_non_streamable_ismaster] 2020-05-18T23:34:56.541+0000 d20021| 2020-05-18T23:34:56.541+00:00 I  COMMAND  51803   [conn21] "Slow query","attr":{"type":"command","ns":"admin.$cmd","command":{"ismaster":1,"client":{"driver":{"name":"PyMongo","version":"3.10.1"},"os":{"type":"Linux","name":"Linux","architecture":"x86_64","version":"2.6.32-220.el6.x86_64"},"platform":"CPython 3.7.0.final.0"},"$db":"admin"},"numYields":0,"ok":0,"errMsg":"The server is in quiesce mode and will shut down","errName":91,"errCode":91,"reslen":344,"locks":{},"protocol":"op_query","durationMillis":0}

We see that the finds are now being sent to the other secondary d20022 and we don't see d20021 service anymore as it transitions out of quiesce mode:

[js_test:quiesce_mode_non_streamable_ismaster] 2020-05-18T23:34:56.588+0000 d20022| 2020-05-18T23:34:56.588+00:00 D2 COMMAND  21965   [conn16] "About to run the command","attr":{"db":"test","commandArgs":{"find":"state","filter":{},"limit":1,"singleBatch":true,"lsid":{"id":{"$uuid":"19665e22-f14e-44f4-b1cf-a15309cfecd4"}},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1589844894,"i":2}},"signature":{"hash":{"$binary":{"base64":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"0"}},"keyId":0}},"$db":"test","$readPreference":{"mode":"secondary"}}}
[js_test:quiesce_mode_non_streamable_ismaster] 2020-05-18T23:34:56.589+0000 d20022| 2020-05-18T23:34:56.588+00:00 I  COMMAND  51803   [conn16] "Slow query","attr":{"type":"command","ns":"test.state","command":{"find":"state","filter":{},"limit":1,"singleBatch":true,"lsid":{"id":{"$uuid":"19665e22-f14e-44f4-b1cf-a15309cfecd4"}},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1589844894,"i":2}},"signature":{"hash":{"$binary":{"base64":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"0"}},"keyId":0}},"$db":"test","$readPreference":{"mode":"secondary"}},"planSummary":"COLLSCAN","keysExamined":0,"docsExamined":1,"cursorExhausted":true,"numYields":0,"nreturned":1,"reslen":272,"locks":{"ReplicationStateTransition":{"acquireCount":{"w":1}},"Global":{"acquireCount":{"r":1}},"Database":{"acquireCount":{"r":1}},"Collection":{"acquireCount":{"r":1}},"Mutex":{"acquireCount":{"r":1}}},"storage":{},"protocol":"op_msg","durationMillis":0}
[js_test:quiesce_mode_non_streamable_ismaster] 2020-05-18T23:34:56.840+0000 d20022| 2020-05-18T23:34:56.840+00:00 D2 COMMAND  21965   [conn16] "About to run the command","attr":{"db":"test","commandArgs":{"find":"state","filter":{},"limit":1,"singleBatch":true,"lsid":{"id":{"$uuid":"19665e22-f14e-44f4-b1cf-a15309cfecd4"}},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1589844894,"i":2}},"signature":{"hash":{"$binary":{"base64":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"0"}},"keyId":0}},"$db":"test","$readPreference":{"mode":"secondary"}}}
...

The test completes successfully with driver option retryReads set to false, indicating that all reads completed successfully without the driver ever having to retry. With this, the test is able to show that goals (1) and (2) are achieved.

Example logs here.

Comment by Jason Chan [ 18/May/20 ]

For this ticket, we decided that we will no longer make simulating network latency a test requirement since the current mongobridge API currently only supports forwarding messages between two servers and not between a driver and a server.

Generated at Thu Feb 08 05:12:55 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.