[SERVER-29160] Sharding commonly uses write concern timeouts of 15 seconds and these are timing out in migration related operations and causing BFs Created: 12/May/17  Updated: 30/Oct/23  Resolved: 25/Sep/18

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 3.6.9, 4.0.4, 4.1.4

Type: Bug Priority: Major - P3
Reporter: Dianna Hohensee (Inactive) Assignee: Misha Tyulenev
Resolution: Fixed Votes: 0
Labels: sharding-wfbf-day
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Duplicate
is duplicated by SERVER-32693 Increase write concern timeout used d... Closed
Related
related to SERVER-37377 Consider increasing default distribut... Closed
Backwards Compatibility: Fully Compatible
Backport Requested:
v4.0, v3.6
Sprint: Sharding 2018-09-24, Sharding 2018-10-08
Participants:
Linked BF Score: 25

 Description   

Sharding's writeConcern timeouts related to writes performed throughout the migration process should be bumped higher to prevent BFs. This write specifically caused the linked BF. Any other related writes that can be bumped without seriously affecting the rest of the system should be as well.

Proposing a bump to 30 second timeouts rather than the 15 second timeout that's the norm in sharding.

suggested fix

as we have 20 different kMajorityWriteConcern values defined in the anonymous namespaces but most still connected we can add the durations to write_concern_options.h

    static constexpr Seconds kWriteConcernTimeoutSharding{30};
    static constexpr Seconds kWriteConcernTimeoutMigration{60};
    static constexpr Seconds kWriteConcernTimeoutUserCommand{60};

and use as instead of

const WriteConcernOptions kMajorityWriteConcern(WriteConcernOptions::kMajority,
                                                WriteConcernOptions::SyncMode::UNSET,
                                                Seconds(30));

use

const WriteConcernOptions kMajorityWriteConcern(WriteConcernOptions::kMajority,
                                                WriteConcernOptions::SyncMode::UNSET,
                                                WriteConcernOptions::kWriteConcernTimeoutSharding);



 Comments   
Comment by Githook User [ 03/Oct/18 ]

Author:

{'name': 'Misha Tyulenev', 'email': 'misha@mongodb.com', 'username': 'mikety'}

Message: SERVER-29160 bump timeout for migration operations to 60sec
Branch: v3.6
https://github.com/mongodb/mongo/commit/5f8dfb3ca2bac43ba65f2fb9907953deb96fba91

Comment by Githook User [ 03/Oct/18 ]

Author:

{'name': 'Misha Tyulenev', 'email': 'misha@mongodb.com', 'username': 'mikety'}

Message: SERVER-29160 bump timeout for migration operations
Branch: v4.0
https://github.com/mongodb/mongo/commit/54cf97e5366ad421ed775d6fd74d2aa4fddaed02

Comment by Githook User [ 29/Sep/18 ]

Author:

{'name': 'Misha Tyulenev', 'email': 'misha@mongodb.com', 'username': 'mikety'}

Message: SERVER-29160 follow up change wtimeout in unit tests to match
Branch: v3.6
https://github.com/mongodb/mongo/commit/890e5c76917c6bad1e00eb1e5d6dfadfd94db573

Comment by Githook User [ 28/Sep/18 ]

Author:

{'name': 'Misha Tyulenev', 'email': 'misha@mongodb.com', 'username': 'mikety'}

Message: SERVER-29160 bump timeout for migration operations

(cherry picked from commit 3f618b86df0473ab905cc4a0ad78f4be8d3428e3)
Branch: v3.6
https://github.com/mongodb/mongo/commit/ef72d37036d96ba77f7b528df1b8952441cc66ad

Comment by Githook User [ 25/Sep/18 ]

Author:

{'name': 'Misha Tyulenev', 'email': 'misha@mongodb.com', 'username': 'mikety'}

Message: SERVER-29160 bump timeout for migration operations
Branch: master
https://github.com/mongodb/mongo/commit/3f618b86df0473ab905cc4a0ad78f4be8d3428e3

Comment by Misha Tyulenev [ 10/Sep/18 ]

For the default in command.cpp https://github.com/mongodb/mongo/blob/master/src/mongo/db/commands.cpp#L74

Comment by Esha Maharishi (Inactive) [ 10/Sep/18 ]

Looks great!

One question, why do we need kWriteConcernTimeoutUserCommand?

Comment by Misha Tyulenev [ 10/Sep/18 ]

esha.maharishi please ack the approach outlined in the description

Comment by Dianna Hohensee (Inactive) [ 12/Oct/17 ]

Linking BF-6834 because it has similar issues, though not migration related commands. It's also a bit odd looking. It shows a {{writeConcern:

{ w: \"majority\", wtimeout: 15000 }

}}} 15 second timeout, but takes 39 seconds to complete, and completes after the test fails. Perhaps a config set network timeout of 30 seconds, and then the write on the shard had a 15 second timeout set.

Comment by Dianna Hohensee (Inactive) [ 16/Jun/17 ]

BF-5723's scenario is startCommit timing out after 30 seconds, followed closely by the migrateThread timing out (and failing the migration) after 15 seconds. Consider upping one of those timeouts. However SERVER-29698 is solved might have some effect on those timeouts.

Generated at Thu Feb 08 04:20:03 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.