[SERVER-30705] Concurrent updates and FCV change can cause dbhash mismatch between primary and secondary Created: 16/Aug/17  Updated: 30/Oct/23  Resolved: 14/Sep/17

Status: Closed
Project: Core Server
Component/s: Querying, Write Ops
Affects Version/s: None
Fix Version/s: 3.6.0-rc0

Type: Bug Priority: Critical - P2
Reporter: David Storch Assignee: Justin Seyster
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-5030 Document equality should be independe... Backlog
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Query 2017-08-21, Query 2017-09-11, Query 2017-10-02
Participants:

 Description   

3.5.x versions of the server have two implementations of the update subsystem: the "old" (3.4 and earlier) system, and the new system in src/mongo/db/update which is both more performant and supports more expressive array updates. The old and new systems have different behavior with respect to field ordering. In order to ensure that the field ordering is consistent across all nodes in the replica set, the primary and secondaries must use the same version of the update subsystem. The is achieved via the feature compatibility version mechanism. Users must set the feature compatibility version (FCV) to "3.6" in order to enable the new update system.

The FCV check, however, does not guarantee that a given update uses the same version of the update code on every node. Consider the following sequence of events:

  1. A two node replica set is started. Both nodes are version 3.6 but have FCV "3.4".
  2. The client concurrently issues an update and setFeatureCompatibilityVersion("3.6"). These operations take compatible locks, and therefore execute concurrently on the server.
  3. The setFCV command writes its update to admin.system.version to the oplog at optime t.
  4. After this oplog entry is written but before the in-memory FCV state changes, the update is logged with some optime greater than t. This uses the old update system, since the FCV in-memory state has not yet been changed.
  5. The two oplog entries are applied on the secondary. Since the admin.system.version write has an earlier optime (and must be applied in its own batch), the update uses the new update system.

I was able to reproduce a dbhash mismatch against a two-node 3.5.x replica set by running two scripts concurrently from two shells connected to the primary node. The first script repeatedly issues an update with two $set's, that will result in different field ordering depending on which version of the update implementation is used:

(function() {
    "use strict";
 
    db.c.drop();
    for (var i = 0; i < 1000; i++) {
        assert.writeOK(db.c.insert({_id: i}));
        assert.writeOK(db.c.update({_id: i}, {$set: {b: 1, a: 1}}));
    }
}());

The second script repeatedly sets the FCV from "3.4" to "3.6" and back again:

(function() {
    "use strict";
 
    while (true) {
        assert.commandWorked(db.adminCommand({setFeatureCompatibilityVersion: "3.4"}))
        assert.commandWorked(db.adminCommand({setFeatureCompatibilityVersion: "3.6"}))
    }
}());

After the first script completes, running the dbHash command against the test database on each node should show different hashes for test.c.



 Comments   
Comment by Ramon Fernandez Marina [ 14/Sep/17 ]

Author:

{'username': u'jseyster', 'name': u'Justin Seyster', 'email': u'justin.seyster@mongodb.com'}

Message:SERVER-30705 Add $v field for update semantics in oplog updates.

With the new UpdateNodes class hierarchy, there are two code paths for
applying an update to a document that have slightly different
semantics. The order of fields in the resulting document can vary
depending on which code path is used to apply an update. A difference
in ordering between documents in a replica set is considered a
"mismatch," so we need to ensure that secondaries always apply updates
using the same update system that the primary uses.

When an update executes as part of the application of an oplog entry,
the update is now allowed to have a $v field, which allows it to
specify which semantics were used by the operation that we are
replicating by applying the entry. When the primary uses the new
semantics (because it is a 3.6 mongod with featureCompatibilityVersion
set to 3.6), it includes {$v: 1} in the oplog's update document to
indicate that the secondary should apply with the newer 'UpdateNode'
semantics.

There are two other places where we need this behavior:
1) In role_graph_update.cpp, where the handleOplogUpdate observer
needs to update its in-memory BSON representation of a role to
reflect an update in the admin database and
2) in the applyOps command, which is used for testing how oplog
entries get applied.

Both these code paths set the fromOplogApplication flag, which
replaces the old fromReplication flag, and they also gain behavior
that used to be exclusive to oplog applications from
replication. (Specifically, they skip update validation checks, which
should have already passed before the oplog entry was created.)
Branch:master
https://github.com/mongodb/mongo/commit/390e5f47f00dcf133f361e3f9027e4da7d08d628

Comment by Tess Avitabile (Inactive) [ 18/Aug/17 ]

I think it is unlikely we will do SERVER-5030 and make the query language order-independent for 3.6 (though it is something to consider for future work on query language semantics), so I would be in favor of fixing this and SERVER-30470.

Comment by Spencer Brody (Inactive) [ 17/Aug/17 ]

This brings up the bigger question about whether we consider field ordering a meaningful property of a document that we want to ensure stays consistent across replica set members. This came up recently in SERVER-30470 and is related to SERVER-5030, which is currently unscheduled. If we decide that MongoDB provides no guarantees about field orderings (and thus decide to implement SERVER-5030), then this and SERVER-30470 become irrelevant. If not, then we probably need to fix both.

Comment by David Storch [ 17/Aug/17 ]

Per our in-person discussion today, we plan to pursue tess.avitabile's idea for how to fix this, since it is much simpler to implement. schwerin, we can definitely just throw the old version out once we branch for 3.8.

Comment by Andy Schwerin [ 16/Aug/17 ]

If we do as tess.avitabile proposes first, and we have a performance problem, we can address it in a point release. If there is no problem, we can decide what to do in 3.8 separately. Perhaps the new update system will need an order-preserving mode, or perhaps we can just throw the old version out in 3.8.

Comment by Tess Avitabile (Inactive) [ 16/Aug/17 ]

An alternative is that secondaries always use the old system, which creates new fields in the order specified by the primary. The advantage is that this requires no changes to the oplog format, and the disadvantage is that secondaries do not get any perf improvement for updates. I'm not sure which solution is better.

I don't have much worked scheduled for next sprint--I think I'll be working on expressive lookup.

Comment by David Storch [ 16/Aug/17 ]

tess.avitabile, after discussing with Andy, I think the problem here is that secondaries should never rely on FCV checks. Instead, the primary should explicitly log which update system it used in the oplog. The secondary should interpret this information and select the appropriate code path. This is akin to how we require primaries to explicitly include the index version in createIndex oplog entries.

I think this needs to be addressed in 3.5. Do you or justin.seyster have time to take it this or next sprint?

Generated at Thu Feb 08 04:24:43 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.