[SERVER-67609] Sharded time-series insert can return Interrupted after a stepdown (delayed BF-investigation) Created: 28/Jun/22  Updated: 29/Oct/23  Resolved: 12/Apr/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 7.1.0-rc0

Type: Bug Priority: Major - P3
Reporter: Dan Larkin-York Assignee: Dianna Hohensee (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
Assigned Teams:
Storage Execution
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Execution Team 2023-03-06, Execution Team 2023-03-20, Execution Team 2023-04-17
Participants:
Linked BF Score: 4

 Description   

It looks like a time-series insert can return a plain 'Interrupted' error code after a stepdown (at least in certain circumstances) rather than the expected 'InterruptedDueToReplStateChange', resulting in the operation not being retried.

This resulted in BF-25238. We have a band-aid fix (SERVER-67533) to exclude the test from the relevant concurrency passthroughs for the time being, but we would like to get to the root cause. It only seems to affect time-series inserts, not regular collections, and it only seems to happen in the sharded concurrency passthroughs with stepdowns involved.

In the continuous stepdown suites, you get the 'Interrupted' error code with a default error message. In the kill_primary suites, you get the 'Interrupted' error code with a more descriptive error message about read preference, like "Write results unavailable from failing to target a host in the shard shard-rs0 :: caused by :: Could not find host matching read preference { mode: \"primary\" } for set shard-rs0".



 Comments   
Comment by Githook User [ 12/Apr/23 ]

Author:

{'name': 'Dianna Hohensee', 'email': 'dianna.hohensee@mongodb.com', 'username': 'DiannaHohensee'}

Message: SERVER-67609 Remove does_not_support_stepdowns exclusion test tag from timeseries_insert_idle_bucket_expiration.js
Branch: master
https://github.com/mongodb/mongo/commit/87e8a710fddd94b0e80ab969eb97b2d0cd281e9c

Comment by Dan Larkin-York [ 15/Sep/22 ]

arun.banala@mongodb.com That does help narrow it down. I can dig into the time-series-specific write paths a bit and see if there's anything that can explain it. Thanks for taking a look!

Comment by Sergi Mateo Bellido [ 01/Jul/22 ]

Passing this ticket to the query-execution team since they own this part. Feel free to reach me if you believe that the problem is sharding-related, I spent a bunch of hours but I didn't find any interesting lead.

Thank you!

Comment by Sergi Mateo Bellido [ 01/Jul/22 ]

I took a look at the two kind of BFGs that we have:

  • kill_primary suites: it looks like the test started to perform inserts without having a primary node in all shards (logs here)

    [j0:s0] Killing the primary on port 20003 of replica set 'shard-rs0'.
    ...
    [j0:s0:n1] | 2022-05-16T05:22:42.708+00:00 I  REPL     21331   [OplogApplier-0] "Transition to primary complete; database writes are now permitted"
    ...
    [j0:s1:n1] | 2022-05-16T05:22:53.705+00:00 I  CONTROL  23378   [SignalHandler] "Signal was sent by kill(2)","attr":{"pid":7179,"uid":1000}
    ...
    [j0:s1:n0] | 2022-05-16T05:22:53.774+00:00 I  REPL     21331   [OplogApplier-0] "Transition to primary complete; database writes are now permitted"
    ...
    [j0:c] Killing the primary on port 20002 of replica set 'config-rs'.
    ...
    [j0:c:n1] | 2022-05-16T05:23:12.171+00:00 I  REPL     21331   [OplogApplier-0] "Transition to primary complete; database writes are now permitted"
    ...
    [j0:s0] Killing the primary on port 20004 of replica set 'shard-rs0'.
    ...
    [j0:s1] | 2022-05-16T05:23:48.540+00:00 I  NETWORK  4333208 [ReplicaSetMonitor-TaskExecutor] "RSM host selection timeout","attr":{"replicaSet":"shard-rs0","error":"FailedToSatisfyReadPreference: Could not find host matching read preference { mode: \"primary\" } for set shard-rs0"}
    [j0:s1] | 2022-05-16T05:23:48.541+00:00 I  COMMAND  51803   [conn647] "Slow query","attr":{"type":"command","ns":"test9_fsmdb0.timeseries_insert_idle_bucket_expiration_fsmcoll0","appName":"tid:9","command":{"insert":"timeseries_insert_idle_bucket_expiration_fsmcoll0","ordered":false,"lsid":{"id":{"$uuid":"af0a5c7a-a048-4215-877f-0682fe319da4"}},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1652678613,"i":743}},"signature":{"hash":{"$binary":{"base64":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"0"}},"keyId":0}},"txnNumber":88,"writeConcern":{"w":"majority","wtimeout":300321},"$db":"test9_fsmdb0"},"nShards":1,"ninserted":0,"numYields":0,"reslen":22178,"readConcern":{"level":"local","provenance":"implicitDefault"},"remote":"127.0.0.1:45108","protocol":"op_msg","durationMillis":15035}
    ...
    [j0:s0:n1] | 2022-05-16T05:24:18.197+00:00 I  REPL     21331   [OplogApplier-0] "Transition to primary complete; database writes are now permitted"
    ...
    [j0:s1] Killing the primary on port 20006 of replica set 'shard-rs1'.
    ...
    [j0:s1:n1] | 2022-05-16T05:24:18.272+00:00 I  REPL     21331   [OplogApplier-0] "Transition to primary complete; database writes are now permitted"
    ...
    [j0:c] Killing the primary on port 20001 of replica set 'config-rs'.
    ...
    [j0:c:n0] | 2022-05-16T05:24:24.393+00:00 I  REPL     21331   [OplogApplier-0] "Transition to primary complete; database writes are now permitted"
    ...
    [fsm_workload_test:timeseries_insert_idle_bucket_expiration]         			"errmsg" : "Write results unavailable from failing to target a host in the shard shard-rs0 :: caused by :: Could not find host matching read preference { mode: \"primary\" } for set shard-rs0"
    

  • continuous_stepdown: I didn't find anything interesting here, perhaps are we stepping down too often?
Generated at Thu Feb 08 06:08:35 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.