[SERVER-61878] Secondary nodes filter out legit write operations believing that they were on orphans Created: 03/Dec/21  Updated: 27/Oct/23  Resolved: 30/Dec/21

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Sergi Mateo Bellido Assignee: Antonio Fuschetto
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Sprint: Sharding EMEA 2021-12-13, Sharding EMEA 2021-12-27, Sharding EMEA 2022-01-10
Participants:

 Description   

The goal of this ticket is to investigate why secondary nodes are filtering oplog entries believing that they are over orphans. This shouldn't happen and probably means that the filtering information on the shard is not up to date.

We managed to reproduce it with this test + removing the skips that we have on secondaries + adding an error if we filter something on secondaries.

(function() {
'use strict';
var st = new ShardingTest({
    shards: 2,
    rs: { nodes: 2 }
});
assert.commandWorked(st.s.adminCommand({enablesharding: "test"}));
assert.commandWorked(st.s.adminCommand({
    setDefaultRWConcern: 1,
    defaultReadConcern: {level: "available"},
    defaultWriteConcern: {w: 1},
    writeConcern: {w: "majority"}
}));
st.ensurePrimaryShard('test', st.shard0.shardName);
st.rs0.add({'shardsvr': ""});
try {
    st.rs0.reInitiate();
} catch (e) {
    print(e);
}
st.rs0.awaitReplication();
st.rs0.waitForState(st.rs0.getSecondaries(), ReplSetTest.State.SECONDARY, 180 * 1000);
assert.commandWorked(st.s0.adminCommand({shardcollection: "test.foo", key: {x: 1}}));
assert.commandWorked(st.s0.adminCommand({split: "test.foo", middle: {x: 50}}));
var other = st.config.shards.findOne({_id: {$ne: st.shard0.shardName}});
assert.commandWorked(st.getDB('admin').runCommand({
    moveChunk: "test.foo",
    find: {x: 10},
    to: other._id,
    _secondaryThrottle: true,
    writeConcern: {w: 2},
    _waitForDelete: true
}));
st.rs0.awaitReplication();
var m = new Mongo(st.s.name);
var ts = m.getDB("test").foo;
m.setSecondaryOk();
printjson(ts.find().batchSize(5).explain()); // THIS TRIGGERS THE PROBLEM!!
const coll = st.s.getCollection("test.foo");
assert.commandWorked(coll.insert({primaryOnly: true, x: 60}));
print("DEBUG-9 ---#1---");
assert.commandWorked(coll.remove({primaryOnly: true, x: 60}, {writeConcern: {w: 3}}));
print("DEBUG-9 ---#2---");
st.stop();
})();



 Comments   
Comment by Antonio Fuschetto [ 30/Dec/21 ]

Problem

Secondary nodes filter the oplog entities believing that there are any writes on orphaned documents (orphans) to be filtered. This is incorrect because the primary node already filtered out writes on orphans so the oplog collection does not contain operations to filter out.

Although this activity on secondary nodes does not produce negative effects, from a functional point of view this is completely useless as there are no operations to filter out.

The previous implementation to identify writes on orphans (the final one is completely different but this problem remained to be investigated) was based on the presence/absence of filtering metadata about the affected collection. This logic was wrong, even in light of this analysis.

Analysis

When a secondary processes an oplog entry, it must use the right filtering information. The previous implementation didn't recover the filtering information when it was unknown, leading secondaries to make wrong decisions of ownership.

Now, if a client is configured to read data from a secondary node, this would trigger the collection metadata refresh just on the specific node, differentiating its status from other secondaries. For example:

var m = new Mongo(st.s.name);
var ts = m.getDB("test").foo;
m.setSecondaryOk();
ts.find().batchSize(5).explain(); // This triggers the refresh of filtering metadata on a secondary

Conclusion

A scenario where secondary nodes have different information in their cache is expected. The filtering logic cannot be based on the availability of filtering metadata and more in general secondaries shouldn't check for write operations on orphans in the oplog.

In order to validate the correctness of the logic (and only for this purpose), the new implementation has been successfully tested by enabling on the secondary nodes the logic to filter out writes on orphans.

Comment by Kaloian Manassiev [ 08/Dec/21 ]

antonio.fuschetto, as we discussed yesterday, it is not "unexpected" to get non-null collection metadata if a secondary has received a request previously like it is happening I think in the repro. I believe that the problem might be due to the fact that on the recipient shard, we do not wait for the just refreshed filtering information to get flushed to the config.system.cache.chunks. collection like we do on the donor.

Comment by Antonio Fuschetto [ 07/Dec/21 ]

The problem is that, in some occurrences, the secondary nodes are unexpectedly able to get a not-null collection metadata causing the execution of the new orphan filtering logic. Thus, I added an explicit and temporary safety measure to prevent secondaries to execute the new logic. In spite of this, the restriction of the new logic to primary node is also required to avoid that the FeatureFlag::isEnabled function hits the invariant when the FCV is not yet initialized at cluster start up.

Although the problem described is no loner related to PM-2423, as the implementation to filter out orphaned documents has been replaced by using the public Sharding API (so no longer need to deal with collection metadata), it deserves investigation.

Generated at Thu Feb 08 05:53:33 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.