[SERVER-31766] Deprecate creation of new sharded output collection in mapReduce Created: 30/Oct/17  Updated: 06/Dec/22  Resolved: 27/Nov/19

Status: Closed
Project: Core Server
Component/s: MapReduce
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: Daniel Pasette (Inactive) Assignee: Backlog - Query Team (Inactive)
Resolution: Duplicate Votes: 0
Labels: qopt-team
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates SERVER-42134 Deprecate mapparams, out.sharded, and... Closed
Related
related to DOCS-9834 MapReduce output on sharded clusters Closed
Assigned Teams:
Query
Sprint: Query 2017-11-13, Query 2019-12-16
Participants:

 Description   

Starting in v3.6, creation of a new sharded output collections should be deprecated.

When used, log a warning and bump a serverStatus counter so that we can track how often it's used.



 Comments   
Comment by Nicholas Zolnierz [ 07/Nov/19 ]

Flagging for scheduling, I believe we have this covered by generically deprecating the 'out.sharded' option.

Comment by Gregory McKeon (Inactive) [ 08/Feb/18 ]

david.storch based on an offline conversation with Kal, we've found a workaround where we don't need to remove this in order to get rid of the distributed lock, so it is not a blocker for sharding anymore.

Bumping the ticket back to query to triage whether or not to deprecate this capability in 4.0 for other reasons such as the difficulty with selecting sharding key.

Comment by Nicholas Zolnierz [ 30/Jan/18 ]

greg.mckeon this is certainly possible, the most logical place is in this function:

https://github.com/mongodb/mongo/blob/master/src/mongo/s/commands/cluster_map_reduce_cmd.cpp#L635

IIRC, there are potential issues with this approach as there's a chance that mongos thinks the sharded output collection exists when it doesn't and vice-versa. Are we OK with this sort of uncertainty when logging this? For the stats, it seemed much more relevant that the number be accurate.

CC david.storch in case we want to bring this into the current sprint.

Comment by Gregory McKeon (Inactive) [ 30/Jan/18 ]

Tracking an offline conversation with Dan - it's ok if we just log a deprecation warning occasionally on any mapreduce a user runs (sharded or not), as long as the warning clarifies that it is specifically creation of new sharded output collection that is deprecated. nicholas.zolnierz does that then make this possible? If so, we'd like to get a deprecation message in the docs and backported to 3.6 ASAP.

CC: schwerin kaloian.manassiev

Comment by Nicholas Zolnierz [ 13/Nov/17 ]

Moving back to the Query team backlog as this has fallen out of scope for 3.6. Based on latest discussion (in-person and CR), tracking the counter only in mongos would be inaccurate as the sharding catalog cache is not guaranteed to be up-to-date. One option to consider when revisiting this ticket is to modify the ConfigSvrShardCollectionCommand, which is sent from mongos to the configsvr, to include a detail in the response indicating whether the collection previously existed or not. Additional information may be needed in the request to allow the configsvr to identify that the originating command is a cluster mapReduce through mongos.

Comment by David Storch [ 09/Nov/17 ]

Per offline discussion, our plan is to deprecate the creation of a new sharded output collection, or a mapReduce to an existing unsharded collection with the sharded:true flag. In other words, we will deprecate mapReduce's ability to automatically shard and pre-split an output collection. Output to a preexisting sharded collection will not be deprecated, nor will output to a new unsharded collection. This description holds across all merge modes (replace, merge, and reduce).

Comment by David Storch [ 03/Nov/17 ]

nicholas.zolnierz, your proposal for the new counter seems reasonable to me, though we should verify with downstream consumers of serverStatus() that their code will be tolerant of this addition. Looks like it should be fine to add a new ServerStatusMetricField<Counter64> inside the map reduce command.

The only remaining question I have is whether we need to deprecate the replace mode with sharded output. As schwerin mentioned, this might not work correctly without being able to rename a sharded collection.

Comment by Nicholas Zolnierz [ 03/Nov/17 ]

david.storch, schwerin As discussed in person, the first step is to understand the behavior of MR to an output collection with the various modes. Below is a table of the tests I ran and the results (the top row represents the "state" of the output collection, and the 1st column is the mode). I used this test as a reference:

  Non-existent Sharded from output of previous MR Sharded but with different shard key Exists but not sharded
Merge pass pass Fails to run if previous docs exist Fails with incorrect count if previous docs exist
Replace pass pass pass pass
Reduce pass pass Fails to run if previous docs exist Fails with incorrect count if previous docs exist

As for the counter, I think the best place for it would be under metrics.commands.mapReduce.shardedOutNewCollection, however I'm not sure the counters/commands allow for arbitrary fields without including the counter in all commands. Thoughts?

Comment by David Storch [ 31/Oct/17 ]

Clarifying that we only intend to deprecate the creation of new sharded collections. Also, should this include deprecation of sharded: true with the replace action, regardless of whether the output collection already exists?

Generated at Thu Feb 08 04:28:07 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.