[SERVER-31934] Setting orphanCleanupDelaySecs=0 in our testing infrastructure is unsafe Created: 13/Nov/17  Updated: 30/Oct/23  Resolved: 21/Nov/17

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 3.6.0-rc5, 3.7.1

Type: Bug Priority: Major - P3
Reporter: Dianna Hohensee (Inactive) Assignee: Misha Tyulenev
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Duplicate
is duplicated by SERVER-31000 Investigate concurrency suite timeout... Closed
Related
related to SERVER-30984 Investigate agg_base.js workload with... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v3.6
Sprint: Sharding 2017-12-04
Participants:
Linked BF Score: 0

 Description   

Our testing has a default of orphanCleanupDelaySecs=0, which is causing safe secondary reads to fail in our tests.

If cursors are established on secondaries, and then a migration completes, the range deleter can proceed to delete documents out from under the cursors, whose filtering information does not change, without interrupting them.



 Comments   
Comment by Githook User [ 21/Nov/17 ]

Author:

{'name': 'Misha Tyulenev', 'username': 'mikety', 'email': 'misha@mongodb.com'}

Message: SERVER-31934 set orphanCleanupDelaySecs=1 for all tests

(cherry picked from commit a094c5b4b4783895d5fe168541bd97103a0d05f5)
Branch: v3.6
https://github.com/mongodb/mongo/commit/26981405f83426e94f38772a773f9e4d4e55801e

Comment by Githook User [ 21/Nov/17 ]

Author:

{'name': 'Misha Tyulenev', 'username': 'mikety', 'email': 'misha@mongodb.com'}

Message: SERVER-31934 set orphanCleanupDelaySecs=1 for all tests
Branch: master
https://github.com/mongodb/mongo/commit/a094c5b4b4783895d5fe168541bd97103a0d05f5

Comment by Kaloian Manassiev [ 20/Nov/17 ]

3.6 Required no longer means that it is a blocker for the release. It means that we should fix it as soon as possible and backport it to 3.6.

Comment by Misha Tyulenev [ 20/Nov/17 ]

kaloian.manassiev please clarify why its 3.6 required. This change is making agg_base.js test running more reliably and the fix is not affecting any code.

Comment by Dianna Hohensee (Inactive) [ 13/Nov/17 ]

max.hirschhorn, I think the conclusion is to increase 'orphanCleanupDelaySecs' and leave the suite active. Under the assumption that upon increase, the failure will very rarely occur. It's also desirable to add something to make the cause of failure obvious when/if it does on occasion fail this way.

Comment by Max Hirschhorn [ 13/Nov/17 ]

dianna.hohensee, kaloian.manassiev, should we just remove the concurrency_sharded_causal_consistency_and_balancer.yml test suite if reading from a secondary is permitted to return an incomplete result set (as compared to what the primary would have done) when chunk migrations are happening?

Comment by Dianna Hohensee (Inactive) [ 13/Nov/17 ]

max.hirschhorn, setting "orphanCleanupDelaySecs" greater than 0 will not be safe. It will merely reduce the frequency of build failures: slow machines could still the range deletion delay to expire before the secondary queries are complete. I really don't like how this part of the system behaves. The only way to make it safe is to have range deletion interrupt secondary cursors, or wait for existing dependent cursors to close before range deletion.

Comment by Max Hirschhorn [ 13/Nov/17 ]

dianna.hohensee, could you explain why a value for the "orphanCleanupDelaySecs" server parameter strictly greater than 0 would always be safe? During the code review of the changes from c63465a4 as part of SERVER-29405, there was an idea of setting it to 2 seconds; however, it's unclear to me why cursors established on a secondary wouldn't still skip over documents.

Generated at Thu Feb 08 04:28:39 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.