[SERVER-12146] writeback listener may not get correct code back from ClientInfo::getLastError Created: 17/Dec/13  Updated: 11/Jul/16  Resolved: 21/Dec/13

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 2.2.7, 2.4.9

Type: Bug Priority: Major - P3
Reporter: Greg Studer Assignee: Greg Studer
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File writeback_retry.js    
Issue Links:
Depends
Related
Operating System: ALL
Participants:

 Description   
Issue Status as of January 2nd, 2014

ISSUE SUMMARY
Under very rare circumstances mongos may incorrectly report a write as successful. The bug can manifest in the unlikely event that the mongos reuses a previously-used connection from the shared pool which contains a stale writeback field. In this situation, mongos cannot guarantee the correct post-migration location of writes and thus may incorrectly report the write as successful. Since mongos outgoing connections are tied to incoming client connections, this can only occur in cases of high connection turnover and low latency. The bug is difficult to trigger, but has caused a lost write in one known case.

This race condition can only occur on the first occurrence of a writeback being queued for a shard. Once a writeback is queued, the connection is cached.

USER IMPACT

Affected Version: All versions of MongoDB prior to and including v2.4.8.
Conditions Required: Sharded cluster with balancing enabled and active.
Frequency: Extremely rare.
Root Cause: In certain cases, it is possible for the getLastError aggregation in mongos ClientInfo to not return the correct code to the writeback listener. We ignore any previous writebacks when reprocessing a write in the writeback listener, but incorrectly do not append the other getLastError fields contained in "res" (the getLastError result from the shard).

In short, when retrying a write via the writeback listener, it is possible for the writeback listener to miss the special stale config code it needs to continue retrying.

SOLUTION
Always aggregate results from getLastError even in the presence of previous writebacks.

WORKAROUNDS
Temporarily disable the balancer until all mongos are updated to ensure your sharded cluster is not susceptible to this bug.

PATCHES
Production release v2.4.9 and v2.2.7 contain the fix for this issue, and production release v2.6.0 will contain the fix as well. Upgrading all mongos processes to MongoDB v2.4.9 or MongoDB v2.2.7 is required to avoid this issue.

Original Description

In certain cases, it seems possible for the getLastError aggregation in mongos ClientInfo to not return the correct code to the writeback listener.

The core issue is here:

            if ( writebacks.size() ){
                vector<BSONObj> v = _handleWriteBacks( writebacks , fromWriteBackListener );
                if ( v.size() == 0 && fromWriteBackListener ) {
                    // ok
                }
                ...
            }
            else {
                result.append( "singleShard" , theShard );
                result.appendElements( res );
            }

We ignore any writebacks when reprocessing a write in the WBL, but incorrectly do not append the other getLastError fields contained in "res" (the getLastError result from the shard).

In short, when retrying a command in the WBL, it's possible for the WBL to not get the special stale config code it needs to continue retrying.



 Comments   
Comment by Githook User [ 06/Jan/14 ]

Author:

{u'username': u'monkey101', u'name': u'Dan Pasette', u'email': u'dan@10gen.com'}

Message: SERVER-12146 do not check writebacks if calling gle from wbl
Branch: v2.2
https://github.com/mongodb/mongo/commit/307fb42c66350981525d64ca8f6a2dbfe6a3d8f4

Comment by Daniel Pasette (Inactive) [ 02/Jan/14 ]

This patch will be backported to the 2.2 branch.

Comment by Githook User [ 21/Dec/13 ]

Author:

{u'username': u'monkey101', u'name': u'Dan Pasette', u'email': u'dan@10gen.com'}

Message: SERVER-12146 do not check writebacks if calling gle from wbl
Branch: v2.4
https://github.com/mongodb/mongo/commit/bd3553dff93786447130c242c274678f969cd513

Comment by Greg Studer [ 19/Dec/13 ]

Attached test case reproduces with two fail points in the WBL - difficult to trigger deterministically.

Generated at Thu Feb 08 03:27:44 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.