[SERVER-34956] big_object1.js is not resilient to unexpected stepdowns Created: 11/May/18 Updated: 29/Oct/23 Resolved: 25/Jun/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | 3.6.6, 4.0.1, 4.1.1 |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Kyle Suarez | Assignee: | James Wahlin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | neweng | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||
| Backport Requested: |
v4.0, v3.6
|
||||||||
| Sprint: | Query 2018-07-02 | ||||||||
| Participants: | |||||||||
| Linked BF Score: | 16 | ||||||||
| Description |
|
In one instance when this test was run in the retryable_writes_jscore_stepdown_passthrough, the primary receives a replSetStepDown command while the test's main insert loop is executing.
Reading the code for this test, it looks like any sort of error is going to cause the while-loop to break prematurely.
Having exited the loop without performing the needed inserts, the assertion fails.
If it were important for this test to be resilient in the face of unexpected stepdowns, we should rewrite the main insert loop, as it seems to be expecting to exit the loop for a particular reason (and not necessarily because a write failed due to a state change). However, I think that it would be easier to simply blacklist the test from the retryable_writes_jscore_stepdown_passthrough suite, as I can't really imagine how the coverage in the passthrough provides anything more than what is already achieved by running this test in jscore. |
| Comments |
| Comment by Githook User [ 29/Jun/18 ] | ||||||||
|
Author: {'username': 'jameswahlin', 'name': 'James Wahlin', 'email': 'james@mongodb.com'}Message: (cherry picked from commit 051262cc7b3e9584602c364b8cf803d31d47d5f8) | ||||||||
| Comment by Githook User [ 29/Jun/18 ] | ||||||||
|
Author: {'username': 'jameswahlin', 'name': 'James Wahlin', 'email': 'james@mongodb.com'}Message: (cherry picked from commit 051262cc7b3e9584602c364b8cf803d31d47d5f8) | ||||||||
| Comment by Githook User [ 25/Jun/18 ] | ||||||||
|
Author: {'username': 'jameswahlin', 'name': 'James Wahlin', 'email': 'james@mongodb.com'}Message: | ||||||||
| Comment by James Wahlin [ 22/Jun/18 ] | ||||||||
|
I just triaged another recent failure of big_object1.js. Looking at the BFG history this test fails a few times a week. Rather than rewrite / fix, I think we should extend max_doc_size.js (which also tests the boundaries of BSONObj size and is more stable) to read large documents as well. With this change max_doc_size.js will cover a super-set of big_object.js and we can retire. | ||||||||
| Comment by Kyle Suarez [ 11/May/18 ] | ||||||||
|
Here's our first retry:
Error code 11602 is InterruptedDueToReplStateChange, which is caused by the replSetStepDown command run by the stepdown thread, which starts before this test activates. Each retry hits the same error.
Though in theory the error is retryable, because the node has stepped down, all inserts will fail. I suppose the failover didn't happen fast enough for us to hit NotMaster, which would have let the retry logic know that the error is permanent and not retryable (at least for that host). This is why I am proposing that we should simply blacklist this test from the stepdown passthrough; I don't see any benefit in keeping it. | ||||||||
| Comment by Max Hirschhorn [ 11/May/18 ] | ||||||||
|
kyle.suarez, why aren't the automatic retries from retryable writes in the mongo shell as well as from the auto_retry_on_network_error.js override file sufficient to transparently handle the error response from the server? Is result.hasWriteError() returning true because there is some non-retryable error that's occurring as a result of the stepdown somehow? |