[SERVER-38722] CollectionCloner should handle QueryPlanKilled on collection drop Created: 20/Dec/18  Updated: 29/Oct/23  Resolved: 28/Feb/19

Status: Closed
Project: Core Server
Component/s: Querying, Replication
Affects Version/s: 3.4.19, 3.6.10, 4.0.6
Fix Version/s: 3.6.12, 4.0.7

Type: Bug Priority: Major - P3
Reporter: David Storch Assignee: David Storch
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Documented
is documented by DOCS-12519 Docs for SERVER-38722: CollectionClon... Closed
Related
is related to SERVER-37451 Move all cursor ownership to the glob... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.0, v3.6, v3.4
Sprint: Query 2019-02-25, Query 2019-03-11
Participants:

 Description   

The CollectionCloner, used in the first phase of initial sync, has code to interpret error codes from the getMore command as indicative of a collection drop:

https://github.com/mongodb/mongo/blob/bf58b1ab2abfb2a3ab7a86c154f9f5954ed6f98c/src/mongo/db/repl/collection_cloner.cpp#L576-L582

In 4.0 and older branches, it handles OperationFailed and CursorNotFound. However, collection drops can result in a getMore returning QueryPlanKilled. This error code should be handled as well. This was fixed in 4.1.7 by SERVER-37451, but it still needs to be fixed in older branches. As part of moving ClientCursor ownership to the global cursor manager, SERVER-37451 changed the server's behavior such that collection drops result in QueryPlanKilled rather than CursorNotFound. This necessitated an immediate fix in master in order to ensure that initial sync remains resilient to collection drops. This ticket tracks the remaining backport work.



 Comments   
Comment by Githook User [ 06/Mar/19 ]

Author:

{'name': 'David Storch', 'username': 'dstorch', 'email': 'david.storch@10gen.com'}

Message: SERVER-38722 Make CollectionCloner tolerate QueryPlanKilled error on collection drop.
Branch: v3.6
https://github.com/mongodb/mongo/commit/03e8ccfaa221646f58e79c53b705f8ea71c1101d

Comment by David Storch [ 28/Feb/19 ]

The fix for this ticket will be included in the 4.0.7 release. This ensures that 4.0 nodes which are 4.0.7 and newer should be able to correctly tolerate collection drop when syncing from a 4.2 node.

Note that in SERVER-37451, we decided to make 4.2 nodes always return QueryPlanKilled in response to a collection drop, without gating this change on featureCompatibilityVersion. Therefore, 4.0.6 and older minor versions of 4.0 may fail to initial sync from a 4.2 node when a collection being cloned is dropped between getMore commands. Let's say that you have a 4.0.6 node which is initial syncing from a 4.2 node. While the CollectionCloner is cloning some collection, the collection is dropped between getMores. This will cause the collection clone to fail (and I think the initial sync attempt will have to restart?). Such a scenario could only occur during the 3.6 => 4.0 upgrade if the collection was dropped during the getMore, so the situation has gotten slightly worse for users who upgrade to 4.2 directly from 4.0.6 or an older minor release in the 4.0 series.

Users who may drop a collection during initial sync, and who may initial sync a 4.0 node from a 4.2 node during the 4.0 => 4.2 upgrade, should perform a minor version upgrade to at least 4.0.7 before attempting the major version upgrade to 4.2. I expect this situation to be unusual; most users should be able to upgrade directly to 4.2 from any 4.0 minor release.

Comment by Githook User [ 28/Feb/19 ]

Author:

{'name': 'David Storch', 'username': 'dstorch', 'email': 'david.storch@10gen.com'}

Message: SERVER-38722 Make CollectionCloner tolerate QueryPlanKilled error on collection drop.
Branch: v4.0
https://github.com/mongodb/mongo/commit/c592fea5fa07a08a82c7e11f7cddbc0b17a98f40

Comment by Judah Schvimer [ 27/Feb/19 ]

Since SERVER-31267 added the OperationFailed and CursorNotFound checks and it was not backported to 3.4, I think this should not be backported to 3.4 either.

Comment by David Storch [ 27/Feb/19 ]

judah.schvimer tess.avitabile, given that SERVER-31267 was fixed in 3.6 and backport to 3.4 was declined, can you confirm that the backport of this ticket to 3.4 should be declined as well?

The fix for this ticket applies cleanly on 4.0 and 3.6, so I do plan to backport it with your approval to these newest two stable branches.

Comment by Tess Avitabile (Inactive) [ 02/Jan/19 ]

I don't think this will backport cleanly to 3.4, since the check for OperationFailed and CursorNotFound was added inĀ SERVER-31267.

Comment by Judah Schvimer [ 02/Jan/19 ]

The CollectionCloner code should be very similar in v3.4, so I'd vote for it if it's a clean backport as expected.

Comment by David Storch [ 20/Dec/18 ]

tess.avitabile, yep, exactly. In fact, I think it might be the case that 4.0 raises QueryPlanKilled or CursorNotFound, but never OperationFailed. Thanks, I'll request backport, at least to 4.0. Should we backport even further back as well?

Comment by Tess Avitabile (Inactive) [ 20/Dec/18 ]

Are you saying that it's possible for 4.0 to raise QueryPlanKilled if there is a collection drop? If so, then yes, we would be interested in a backport for the second piece.

Comment by David Storch [ 20/Dec/18 ]

tess.avitabile siyuan.zhou, I believe there is a bug affecting 4.0. There are two pieces to this work:

  • Make query always raise QueryPlanKilled instead of OperationFailed in FCV 4.2.
  • Fix the CollectionCloner to tolerate QueryPlanKilled.

Are you interested in a backport for the second piece? The first piece cannot be backported, so the CollectionCloner must continue to handle CursorNotFound and OperationFailed in older branches.

Generated at Thu Feb 08 04:49:48 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.