[SERVER-32865] Dropped write during insert loop with retry_writes on Created: 23/Jan/18  Updated: 21/Feb/18  Resolved: 24/Jan/18

Status: Closed
Project: Core Server
Component/s: Shell
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Timothy Olsen (Inactive) Assignee: Max Hirschhorn
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File new-primary.log     Text File old-primary.log    
Operating System: ALL
Steps To Reproduce:

1. Start a MongoDB 3.6.2 replica set with FCV 3.6
2. Start a loop in the mongo shell to insert 10,000 documents with retry_writes set to true:

use test
db = db.getMongo().startSession({retry_writes: true}).getDatabase(db.getName());
for (i = 0; i < 10000; i++) { db.test.insert({}); }

3. While the inserts are happening issue an rs.stepDown() command on the primary
4. The insert loop will eventually finish but db.test.count() will indicate that only 9,999 documents have been inserted.

Participants:

 Description   

max.hirschhorn suggested I file this ticket.

If I start a an insert loop that attempts to insert 10,000 documents with retry_writes on and then step down the primary only 9,999 documents are inserted.

MongoDB is running with FCV 3.6.



 Comments   
Comment by Timothy Olsen (Inactive) [ 24/Jan/18 ]

Yep, that was it! I found that parameter from someone's blog. I should have checked our official documentation. You can close this ticket.

Comment by Max Hirschhorn [ 24/Jan/18 ]

I believe I'm using a replica set connection string. Here is the command line I used to start the mongo shell:

/Users/tim/mongodb-osx-x86_64-3.6.2/bin/mongo "mongodb://host1.local.10gen.cc:27000,host2.local.10gen.cc:27010,host3.local.10gen.cc:27020/admin?replicaSet=proxytest" --username u --password p --ssl --sslCAFile=main/ca.pem

Yeah, that looks like a replica set connection string.

db = db.getMongo().startSession({retry_writes: true}).getDatabase(db.getName());

tim.olsen, it looks like you may not actually be enabling retryable writes because the parameter is "retryWrites" as camelCase and not snake_case. Additionally, the --retryWrites command line option to the mongo shell is intended to be the user-facing way to enable retryable writes in the mongo shell.

Comment by Timothy Olsen (Inactive) [ 24/Jan/18 ]

I believe I'm using a replica set connection string. Here is the command line I used to start the mongo shell:

/Users/tim/mongodb-osx-x86_64-3.6.2/bin/mongo "mongodb://host1.local.10gen.cc:27000,host2.local.10gen.cc:27010,host3.local.10gen.cc:27020/admin?replicaSet=proxytest" --username u --password p --ssl --sslCAFile=main/ca.pem

Comment by Max Hirschhorn [ 24/Jan/18 ]

tim.olsen, could you confirm that the mongo shell you were using to perform the insert operations was using a replica set connection string? I think the "not master" error you're seeing after having put in the assert.writeOK() call is potentially indicating that the mongo shell attempted to retry the failed insert operation on the same server rather than trying to discover the new primary. This would happen if you attempted to use retryable writes while directly connected to the primary you're stepping down. Below is a codified version of what I understood your procedure to be and I haven't seen it fail.

(function() {
    "use strict";
 
    load("jstests/libs/parallelTester.js");
 
    function stepDownPrimary(primaryHost, reconnectedEvent) {
        try {
            load("jstests/replsets/rslib.js");
 
            const conn = new Mongo(primaryHost);
            const error = assert.throws(function() {
                const res = conn.adminCommand({replSetStepDown: 1, force: true});
                print("replSetStepDown did not throw exception but returned: " + tojson(res));
            });
            assert(isNetworkError(error),
                   "replSetStepDown did not disconnect client; failed with " + tojson(error));
 
            // We use the reconnect() function to run a command against the former primary that
            // acquires the global lock to ensure that it has finished stepping down and has
            // therefore closed all of its client connections. This ensures commands sent on other
            // connections to the former primary trigger a network error rather than potentially
            // returning a "not master" error while the server is in the midst of closing client
            // connections.
            reconnect(conn);
 
            return {ok: 1};
        } catch (e) {
            return {ok: 0, err: e.toString(), stack: e.stack};
        } finally {
            // Signal to the main thread that the primary would have closed all of its connections
            // by now and that the stepdown has therefore completely finished.
            reconnectedEvent.countDown();
        }
    }
 
    const rst = new ReplSetTest({nodes: 3});
 
    rst.startSet();
    rst.initiate();
 
    const primary = rst.getPrimary();
    const rsConn = new Mongo(rst.getURL());
    const db = rsConn.startSession({retryWrites: true}).getDatabase("test");
 
    // Use 'reconnectedEvent' to have the stepdown thread signal to the main thread when the primary
    // stepdown has completely finished.
    const reconnectedEvent = new CountDownLatch(1);
    const thread = new ScopedThread(stepDownPrimary, primary.host, reconnectedEvent);
    thread.start();
 
    let numIterations = 0;
 
    function doInsert() {
        assert.writeOK(db.mycoll.insert({}));
        ++numIterations;
    }
 
    while (numIterations < 10000 || reconnectedEvent.getCount() === 1) {
        doInsert();
    }
 
    // Do another insert to guarantee that the main thread observed a network error at least once
    // while performing an insert.
    doInsert();
 
    // There should be as many documents as insert attempts because a failed attempt would have been
    // automatically retried.
    assert.eq(numIterations, db.mycoll.find().itcount());
    print(`${numIterations} of ${numIterations} inserts present`);
 
    thread.join();
    assert.commandWorked(thread.returnData(), "stepdown thread exited with an error");
 
    db.getSession().endSession();
    rst.stopSet();
})();

Comment by Timothy Olsen (Inactive) [ 24/Jan/18 ]

Here is the shell output with TestData=

{logRetryAttempts: true}

and using assert.writeOK(). In this case however the final document count is 7,349. I assume that means the loop got interrupted somehow.

proxytest:PRIMARY> TestData={logRetryAttempts: true}
{ "logRetryAttempts" : true }
proxytest:PRIMARY> use test
switched to db test
proxytest:PRIMARY> db = db.getMongo().startSession({retry_writes: true}).getDatabase(db.getName());
test
proxytest:PRIMARY> for (i = 0; i < 10000; i++) { assert.writeOK(db.test.insert({})); }
2018-01-24T11:42:18.820-0500 I NETWORK  [thread1] Marking host host1.local.10gen.cc:27000 as failed :: caused by :: NotMaster: got not master from: host1.local.10gen.cc:27000 of repl set: proxytest
assert: write failed with error: { "writeError" : { "code" : 10107, "errmsg" : "not master" } }
_getErrorWithCode@src/mongo/shell/utils.js:25:13
doassert@src/mongo/shell/assert.js:16:14
assert.writeOK@src/mongo/shell/assert.js:564:9
@(shell):1:31
 
2018-01-24T11:42:18.824-0500 E QUERY    [thread1] Error: write failed with error: { "writeError" : { "code" : 10107, "errmsg" : "not master" } } :
_getErrorWithCode@src/mongo/shell/utils.js:25:13
doassert@src/mongo/shell/assert.js:16:14
assert.writeOK@src/mongo/shell/assert.js:564:9
@(shell):1:31
2018-01-24T11:42:18.828-0500 W NETWORK  [thread1] Unable to reach primary for set proxytest
2018-01-24T11:42:19.330-0500 W NETWORK  [thread1] Unable to reach primary for set proxytest
2018-01-24T11:42:19.572-0500 W NETWORK  [ReplicaSetMonitor-TaskExecutor-0] Unable to reach primary for set proxytest
2018-01-24T11:42:19.834-0500 W NETWORK  [thread1] Unable to reach primary for set proxytest
2018-01-24T11:42:20.339-0500 W NETWORK  [thread1] Unable to reach primary for set proxytest
2018-01-24T11:42:20.843-0500 I NETWORK  [thread1] Detected bad connection created at 1516812109563129 microSec, clearing pool for host1.local.10gen.cc:27000 of 0 connections
2018-01-24T11:42:20.843-0500 I NETWORK  [thread1] Dropping all pooled connections to host1.local.10gen.cc:27000(with timeout of 5 seconds)
2018-01-24T11:42:20.843-0500 I NETWORK  [thread1] Ending connection to host host1.local.10gen.cc:27000(with timeout of 5 seconds) due to bad connection status; 0 connections to that host remain open
2018-01-24T11:42:20.843-0500 I NETWORK  [thread1] Marking host host1.local.10gen.cc:27000 as failed :: caused by :: HostUnreachable: network error while attempting to run command 'ismaster' on host 'host1.local.10gen.cc:27000' 
2018-01-24T11:42:20.844-0500 W NETWORK  [thread1] Unable to reach primary for set proxytest
2018-01-24T11:42:21.355-0500 I NETWORK  [thread1] Successfully connected to host1.local.10gen.cc:27000 (1 connections now open to host1.local.10gen.cc:27000 with a 5 second timeout)
2018-01-24T11:42:21.356-0500 W NETWORK  [thread1] Unable to reach primary for set proxytest
2018-01-24T11:42:21.861-0500 W NETWORK  [thread1] Unable to reach primary for set proxytest
2018-01-24T11:42:22.364-0500 W NETWORK  [thread1] Unable to reach primary for set proxytest
2018-01-24T11:42:22.869-0500 W NETWORK  [thread1] Unable to reach primary for set proxytest
2018-01-24T11:42:23.374-0500 W NETWORK  [thread1] Unable to reach primary for set proxytest
2018-01-24T11:42:23.878-0500 W NETWORK  [thread1] Unable to reach primary for set proxytest
2018-01-24T11:42:24.382-0500 W NETWORK  [thread1] Unable to reach primary for set proxytest
2018-01-24T11:42:24.884-0500 W NETWORK  [thread1] Unable to reach primary for set proxytest
2018-01-24T11:42:25.389-0500 W NETWORK  [thread1] Unable to reach primary for set proxytest
2018-01-24T11:42:25.893-0500 W NETWORK  [thread1] Unable to reach primary for set proxytest
2018-01-24T11:42:26.399-0500 W NETWORK  [thread1] Unable to reach primary for set proxytest
2018-01-24T11:42:26.904-0500 W NETWORK  [thread1] Unable to reach primary for set proxytest
2018-01-24T11:42:27.408-0500 W NETWORK  [thread1] Unable to reach primary for set proxytest
2018-01-24T11:42:27.913-0500 W NETWORK  [thread1] Unable to reach primary for set proxytest
2018-01-24T11:42:28.418-0500 W NETWORK  [thread1] Unable to reach primary for set proxytest
2018-01-24T11:42:28.923-0500 W NETWORK  [thread1] Unable to reach primary for set proxytest
2018-01-24T11:42:29.427-0500 W NETWORK  [thread1] Unable to reach primary for set proxytest
2018-01-24T11:42:29.933-0500 W NETWORK  [thread1] Unable to reach primary for set proxytest
2018-01-24T11:42:30.437-0500 W NETWORK  [thread1] Unable to reach primary for set proxytest
2018-01-24T11:42:30.943-0500 W NETWORK  [thread1] Unable to reach primary for set proxytest
2018-01-24T11:42:31.449-0500 W NETWORK  [thread1] Unable to reach primary for set proxytest
proxytest:PRIMARY> db.test.count()
7349
proxytest:PRIMARY> db.test.count()

Comment by Timothy Olsen (Inactive) [ 24/Jan/18 ]

I've attached the logs from the old primary and new primary.

Comment by Max Hirschhorn [ 23/Jan/18 ]

kaloian.manassiev, I had Tim define TestData={logRetryAttempts: true} during his local setup to see if there'd be evidence of the mongo shell actually doing the retry following a stepdown, and we didn't see the print() statements that Jack had added from SERVER-31194. I think it is possible that the mongo shell isn't actually retrying for some reason.

Comment by Kaloian Manassiev [ 23/Jan/18 ]

Would it be possible to attach the logs from both the new and the old primary? I am curious if there are any errors reported.

max.hirschhorn, just curious - have you discounted a possible bug in the shell?

CC jack.mulrow

Generated at Thu Feb 08 04:31:31 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.