[SERVER-34956] big_object1.js is not resilient to unexpected stepdowns Created: 11/May/18  Updated: 29/Oct/23  Resolved: 25/Jun/18

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 3.6.6, 4.0.1, 4.1.1

Type: Improvement Priority: Major - P3
Reporter: Kyle Suarez Assignee: James Wahlin
Resolution: Fixed Votes: 0
Labels: neweng
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Backwards Compatibility: Fully Compatible
Backport Requested:
v4.0, v3.6
Sprint: Query 2018-07-02
Participants:
Linked BF Score: 16

 Description   

In one instance when this test was run in the retryable_writes_jscore_stepdown_passthrough, the primary receives a replSetStepDown command while the test's main insert loop is executing.

[ReplicaSetFixture:job3:node2] 2018-05-01T18:15:45.659+0000 I COMMAND  [conn149] Attempting to step down in response to replSetStepDown command
[ReplicaSetFixture:job3:node2] 2018-05-01T18:15:45.659+0000 I REPL     [conn149] transition to SECONDARY from PRIMARY

Reading the code for this test, it looks like any sort of error is going to cause the while-loop to break prematurely.

    while (true) {
        var result;
        n = {_id: x, a: []};
        for (i = 0; i < 14 + x; i++)
            n.a.push(s);
        try {
            result = t.insert(n);
            o = n;
        } catch (e) {
            break;
        }
 
        if (result.hasWriteError())
            break;
        x++;
    }
 
    printjson(t.stats(1024 * 1024));
 
    assert.lt(15 * 1024 * 1024, Object.bsonsize(o), "A1");
    assert.gt(17 * 1024 * 1024, Object.bsonsize(o), "A2");

Having exited the loop without performing the needed inserts, the assertion fails.

[js_test:big_object1] 2018-05-01T17:55:36.315+0000 2018-05-01T17:55:35.871+0000 E QUERY    [js] Error: 15728640 is not less than 13926560 : A1 :
[js_test:big_object1] 2018-05-01T17:55:36.316+0000 doassert@src/mongo/shell/assert.js:18:14
[js_test:big_object1] 2018-05-01T17:55:36.316+0000 _assertCompare@src/mongo/shell/assert.js:756:9
[js_test:big_object1] 2018-05-01T17:55:36.317+0000 assert.lt@src/mongo/shell/assert.js:760:1
[js_test:big_object1] 2018-05-01T17:55:36.317+0000 @jstests/core/big_object1.js:32:1
[js_test:big_object1] 2018-05-01T17:55:36.320+0000 failed to load: jstests/core/big_object1.js

If it were important for this test to be resilient in the face of unexpected stepdowns, we should rewrite the main insert loop, as it seems to be expecting to exit the loop for a particular reason (and not necessarily because a write failed due to a state change).

However, I think that it would be easier to simply blacklist the test from the retryable_writes_jscore_stepdown_passthrough suite, as I can't really imagine how the coverage in the passthrough provides anything more than what is already achieved by running this test in jscore.



 Comments   
Comment by Githook User [ 29/Jun/18 ]

Author:

{'username': 'jameswahlin', 'name': 'James Wahlin', 'email': 'james@mongodb.com'}

Message: SERVER-34956 Replace big_object1.js with max_doc_size.js

(cherry picked from commit 051262cc7b3e9584602c364b8cf803d31d47d5f8)
Branch: v3.6
https://github.com/mongodb/mongo/commit/a86f311e3f8f50f8f2ed37f223da17f12da855a7

Comment by Githook User [ 29/Jun/18 ]

Author:

{'username': 'jameswahlin', 'name': 'James Wahlin', 'email': 'james@mongodb.com'}

Message: SERVER-34956 Replace big_object1.js with max_doc_size.js

(cherry picked from commit 051262cc7b3e9584602c364b8cf803d31d47d5f8)
Branch: v4.0
https://github.com/mongodb/mongo/commit/337a07588639701e9042b06a56690a55c65993bc

Comment by Githook User [ 25/Jun/18 ]

Author:

{'username': 'jameswahlin', 'name': 'James Wahlin', 'email': 'james@mongodb.com'}

Message: SERVER-34956 Replace big_object1.js with max_doc_size.js
Branch: master
https://github.com/mongodb/mongo/commit/051262cc7b3e9584602c364b8cf803d31d47d5f8

Comment by James Wahlin [ 22/Jun/18 ]

I just triaged another recent failure of big_object1.js. Looking at the BFG history this test fails a few times a week.

Rather than rewrite / fix, I think we should extend max_doc_size.js (which also tests the boundaries of BSONObj size and is more stable) to read large documents as well. With this change max_doc_size.js will cover a super-set of big_object.js and we can retire.

Comment by Kyle Suarez [ 11/May/18 ]

Here's our first retry:

[jsTest] Retrying insert due to retryable write concern error (code=11602), subsequent retries remaining: 2

Error code 11602 is InterruptedDueToReplStateChange, which is caused by the replSetStepDown command run by the stepdown thread, which starts before this test activates. Each retry hits the same error.

[jsTest] ----
[jsTest] Retrying insert due to network error, subsequent retries remaining: 1
[jsTest] ----
...
[jsTest] ----
[jsTest] Retrying insert due to retryable write concern error (code=11602), subsequent retries remaining: 0
[jsTest] ----

Though in theory the error is retryable, because the node has stepped down, all inserts will fail. I suppose the failover didn't happen fast enough for us to hit NotMaster, which would have let the retry logic know that the error is permanent and not retryable (at least for that host).

This is why I am proposing that we should simply blacklist this test from the stepdown passthrough; I don't see any benefit in keeping it.

Comment by Max Hirschhorn [ 11/May/18 ]

kyle.suarez, why aren't the automatic retries from retryable writes in the mongo shell as well as from the auto_retry_on_network_error.js override file sufficient to transparently handle the error response from the server? Is result.hasWriteError() returning true because there is some non-retryable error that's occurring as a result of the stepdown somehow?

Generated at Thu Feb 08 04:38:23 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.