[SERVER-48128] mapreduce and aggregation with output don't work on rs to cluster upgrade Created: 12/May/20  Updated: 29/Oct/23  Resolved: 27/Jul/20

Status: Closed
Project: Core Server
Component/s: Querying
Affects Version/s: 4.5.1
Fix Version/s: 4.7.0, 4.4.2

Type: Bug Priority: Major - P3
Reporter: Marcos José Grillo Ramirez Assignee: Bernard Gorman
Resolution: Fixed Votes: 0
Labels: qexec-team
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Related
related to SERVER-59924 Error executing aggregate with $out w... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.4
Steps To Reproduce:
  1. Set up a replica set
  2. Add some data
  3. Restart the replica set with --shardsvr
  4. Add the replica set to a sharded cluster
  5. Run a mapreduce with output on the primary directly
Sprint: Query 2020-06-01, Query 2020-06-15, Query 2020-06-29, Query 2020-07-13, Query 2020-07-27, Query 2020-08-10
Participants:

 Description   

DOCSP-10021 describe the basic steps followed on Atlas to upgrade from a replica set to a sharded cluster. SERVER-47701 adds a test to ensure this process works, however, the mapreduce and aggregation commands with output don't work when connected to the primary MongoD directly, they fail with a similar error:

 uncaught exception: Error: command failed: {
 	"operationTime" : Timestamp(1589285086, 3),
 	"ok" : 0,
 	"errmsg" : "MapReduce internal error :: caused by :: don't know dbVersion for database test",
 	"code" : 249,
 	"codeName" : "StaleDbVersion",
 	"db" : "test",
 	"vReceived" : {
 		"uuid" : UUID("f4c5c7df-b0bc-4401-a1b4-98135ec255bf"),
 		"lastMod" : 1
 	},
 	"$gleStats" : {
 		"lastOpTime" : {
 			"ts" : Timestamp(1589285085, 4),
 			"t" : NumberLong(2)
 		},
 		"electionId" : ObjectId("7fffffff0000000000000002")
 	},
 	"lastCommittedOpTime" : Timestamp(1589285086, 3),
 	"$configServerState" : {
 		"opTime" : {
 			"ts" : Timestamp(1589285086, 1),
 			"t" : NumberLong(1)
 		}
 	},
 	"$clusterTime" : {
 		"clusterTime" : Timestamp(1589285086, 3),
 		"signature" : {
 			"hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
 			"keyId" : NumberLong(0)
 		}
 	}
 }

And it looks like it fails while doing a listCollections:

d20020| 2020-05-12T14:19:53.932+02:00 D2 COMMAND  [conn35] About to run the command{"db":"test","commandArgs":{"listCollections":1,"filter":{"name":"mrOutput"},"databaseVersion":{"uuid":{"$uuid":"110b9724-a042-4c7d-85aa-2a9dfcc31c0d"},"lastMod":1},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1589285993,"i":4}},"signature":{"hash":{"$binary":{"base64":"PkCmXSWB1yC5LPJZOPOYdjkFbt0=","subType":"0"}},"keyId":6825931235076866061}},"$configServerState":{"opTime":{"ts":{"$timestamp":{"t":1589285992,"i":8}},"t":1}},"$db":"test"}}
d20020| 2020-05-12T14:19:53.933+02:00 W  -        [conn35] DBException thrown{"error":{"code":249,"codeName":"StaleDbVersion","errmsg":"don't know dbVersion for database test","db":"test","vReceived":{"uuid":{"$uuid":"110b9724-a042-4c7d-85aa-2a9dfcc31c0d"},"lastMod":1}}}

With the following stacktrace:

...
mongo::uassertedWithLocation+0x315 [.../src/mongo/util/assert_util.cpp @ 256]
<lambda_e2366172eb6715dc0c9d847effa7a2b3>::operator()+0x25C [.../src/mongo/db/s/database_sharding_state.cpp @ 149]
mongo::DatabaseShardingState::checkDbVersion+0x2A0 [.../src/mongo/db/s/database_sharding_state.cpp @ 149]
mongo::AutoGetDb::AutoGetDb+0x156 [.../src/mongo/db/catalog_raii.cpp @ 54]
mongo::`anonymous namespace'::CmdListCollections::run+0x63B [.../src/mongo/db/commands/list_collections.cpp @ 292]
mongo::BasicCommand::runWithReplyBuilder+0xAA [.../src/mongo/db/commands.h @ 807]
mongo::BasicCommandWithReplyBuilderInterface::Invocation::run+0x178 [.../src/mongo/db/commands.cpp @ 770]
mongo::CommandHelpers::runCommandInvocation+0xDC [.../src/mongo/db/commands.cpp @ 186]
...



 Comments   
Comment by Githook User [ 11/Sep/20 ]

Author:

{'name': 'Bernard Gorman', 'email': 'bernard.gorman@gmail.com', 'username': 'gormanb'}

Message: SERVER-48128 ShardServerProcessInterface should only version internal commands if the parent op is versioned

(cherry picked from commit 64c7ccfac9ae6b4765481d6158e6447a69b2914b)
Branch: v4.4
https://github.com/mongodb/mongo/commit/9b571a2576cb9b584625df82e84ce169d69745ca

Comment by Githook User [ 27/Jul/20 ]

Author:

{'name': 'Bernard Gorman', 'email': 'bernard.gorman@gmail.com', 'username': 'gormanb'}

Message: SERVER-48128 ShardServerProcessInterface should only version internal commands if the parent op is versioned
Branch: master
https://github.com/mongodb/mongo/commit/64c7ccfac9ae6b4765481d6158e6447a69b2914b

Comment by Arun Banala [ 16/Jun/20 ]

The issue here is, the aggregation request makes an internal request for listCollections as part of $out stage. We append a dbVersion to this request. The listCollections command validates the dbVersion received in the input, against the dbVersion present in cache (DatabaseShardingState). If there is a mismatch, it throws an error which propagates all the way to the client.

One possible fix is to treat the requests sent directly to a node as un-versioned. We could attach the dbVersion to the internal commands only when the client is mongos.

We need to fix this issue in all the previous version as well since this is part of the Atlas upgrade from Replica set workflow. I've tested the workflow on 4.2 and the aggregate $out command doesn't fail there. So this seems to be an issue only on 4.4 and master.

Comment by Kaloian Manassiev [ 12/May/20 ]

Just a heads-up that in this case, these are direct writes to a shard, so there should not be StaleDb/ShardVersion being thrown at all (hence nothing to be retried). So likely they are not duplicates.

Comment by Craig Homa [ 12/May/20 ]

Hey Arun, this looks like it is related to SERVER-47420. Please investigate and close if this is a duplicate.

Generated at Thu Feb 08 05:16:13 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.