Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Done
Priority: Major - P3
Fix Version/s: 2.2.7, 2.4.9
Affects Version/s: None
Component/s: Sharding
Labels:
None

Operating System:
ALL
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Issue Status as of January 2nd, 2014

ISSUE SUMMARY
Under very rare circumstances mongos may incorrectly report a write as successful. The bug can manifest in the unlikely event that the mongos reuses a previously-used connection from the shared pool which contains a stale writeback field. In this situation, mongos cannot guarantee the correct post-migration location of writes and thus may incorrectly report the write as successful. Since mongos outgoing connections are tied to incoming client connections, this can only occur in cases of high connection turnover and low latency. The bug is difficult to trigger, but has caused a lost write in one known case.

This race condition can only occur on the first occurrence of a writeback being queued for a shard. Once a writeback is queued, the connection is cached.

USER IMPACT

Affected Version: All versions of MongoDB prior to and including v2.4.8.
Conditions Required: Sharded cluster with balancing enabled and active.
Frequency: Extremely rare.
Root Cause: In certain cases, it is possible for the getLastError aggregation in mongos ClientInfo to not return the correct code to the writeback listener. We ignore any previous writebacks when reprocessing a write in the writeback listener, but incorrectly do not append the other getLastError fields contained in "res" (the getLastError result from the shard).

In short, when retrying a write via the writeback listener, it is possible for the writeback listener to miss the special stale config code it needs to continue retrying.

SOLUTION
Always aggregate results from getLastError even in the presence of previous writebacks.

WORKAROUNDS
Temporarily disable the balancer until all mongos are updated to ensure your sharded cluster is not susceptible to this bug.

PATCHES
Production release v2.4.9 and v2.2.7 contain the fix for this issue, and production release v2.6.0 will contain the fix as well. Upgrading all mongos processes to MongoDB v2.4.9 or MongoDB v2.2.7 is required to avoid this issue.

Original Description

In certain cases, it seems possible for the getLastError aggregation in mongos ClientInfo to not return the correct code to the writeback listener.

The core issue is here:

            if ( writebacks.size() ){
                vector<BSONObj> v = _handleWriteBacks( writebacks , fromWriteBackListener );
                if ( v.size() == 0 && fromWriteBackListener ) {
                    // ok
                }
                ...
            }
            else {
                result.append( "singleShard" , theShard );
                result.appendElements( res );
            }

We ignore any writebacks when reprocessing a write in the WBL, but incorrectly do not append the other getLastError fields contained in "res" (the getLastError result from the shard).

In short, when retrying a command in the WBL, it's possible for the WBL to not get the special stale config code it needs to continue retrying.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

writeback_retry.js
4 kB
Dec 19 2013 06:03:04 PM UTC

Assignee:: Greg Studer (Inactive)
Reporter:: Greg Studer (Inactive)
Participants:: Daniel Pasette, Githook User, Greg Studer
Votes:: 0 Vote for this issue
Watchers:: 13 Start watching this issue

Created:: Dec 17 2013 09:02:16 PM UTC
Updated:: Jul 11 2016 05:40:19 PM UTC
Resolved:: Dec 21 2013 03:48:02 PM UTC

Details

Description

Original Description

Attachments

Attachments

Activity

People

Dates

PagerDuty