[SERVER-31934] Setting orphanCleanupDelaySecs=0 in our testing infrastructure is unsafe Created: 13/Nov/17 Updated: 30/Oct/23 Resolved: 21/Nov/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | None |
| Fix Version/s: | 3.6.0-rc5, 3.7.1 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Dianna Hohensee (Inactive) | Assignee: | Misha Tyulenev |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||
| Backport Requested: |
v3.6
|
||||||||||||||||||||||||
| Sprint: | Sharding 2017-12-04 | ||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||
| Linked BF Score: | 0 | ||||||||||||||||||||||||
| Description |
|
Our testing has a default of orphanCleanupDelaySecs=0, which is causing safe secondary reads to fail in our tests. If cursors are established on secondaries, and then a migration completes, the range deleter can proceed to delete documents out from under the cursors, whose filtering information does not change, without interrupting them. |
| Comments |
| Comment by Githook User [ 21/Nov/17 ] |
|
Author: {'name': 'Misha Tyulenev', 'username': 'mikety', 'email': 'misha@mongodb.com'}Message: (cherry picked from commit a094c5b4b4783895d5fe168541bd97103a0d05f5) |
| Comment by Githook User [ 21/Nov/17 ] |
|
Author: {'name': 'Misha Tyulenev', 'username': 'mikety', 'email': 'misha@mongodb.com'}Message: |
| Comment by Kaloian Manassiev [ 20/Nov/17 ] |
|
3.6 Required no longer means that it is a blocker for the release. It means that we should fix it as soon as possible and backport it to 3.6. |
| Comment by Misha Tyulenev [ 20/Nov/17 ] |
|
kaloian.manassiev please clarify why its 3.6 required. This change is making agg_base.js test running more reliably and the fix is not affecting any code. |
| Comment by Dianna Hohensee (Inactive) [ 13/Nov/17 ] |
|
max.hirschhorn, I think the conclusion is to increase 'orphanCleanupDelaySecs' and leave the suite active. Under the assumption that upon increase, the failure will very rarely occur. It's also desirable to add something to make the cause of failure obvious when/if it does on occasion fail this way. |
| Comment by Max Hirschhorn [ 13/Nov/17 ] |
|
dianna.hohensee, kaloian.manassiev, should we just remove the concurrency_sharded_causal_consistency_and_balancer.yml test suite if reading from a secondary is permitted to return an incomplete result set (as compared to what the primary would have done) when chunk migrations are happening? |
| Comment by Dianna Hohensee (Inactive) [ 13/Nov/17 ] |
|
max.hirschhorn, setting "orphanCleanupDelaySecs" greater than 0 will not be safe. It will merely reduce the frequency of build failures: slow machines could still the range deletion delay to expire before the secondary queries are complete. I really don't like how this part of the system behaves. The only way to make it safe is to have range deletion interrupt secondary cursors, or wait for existing dependent cursors to close before range deletion. |
| Comment by Max Hirschhorn [ 13/Nov/17 ] |
|
dianna.hohensee, could you explain why a value for the "orphanCleanupDelaySecs" server parameter strictly greater than 0 would always be safe? During the code review of the changes from c63465a4 as part of |