[SERVER-36403] Cluster aggregation error message should indicate which shard(s) raised an error Created: 01/Aug/18  Updated: 29/Oct/23  Resolved: 13/Dec/18

Status: Closed
Project: Core Server
Component/s: Aggregation Framework, Diagnostics
Affects Version/s: None
Fix Version/s: 4.1.7

Type: Task Priority: Major - P3
Reporter: Kyle Suarez Assignee: Vlad Rachev (Inactive)
Resolution: Fixed Votes: 0
Labels: neweng
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Backwards Compatibility: Fully Compatible
Sprint: Query 2018-12-17, Query 2018-12-31
Participants:

 Description   

When a sharded aggregation throws, we don't report from where in the cluster the error was generated. To test this, I wrote a simple $assert stage that always throws.

mongos> db.runCommand({aggregate: "coll", cursor: {}, pipeline: [{$assert: 1}, {$match: {x: 1}}, {$group: {_id: "$x"}}]})
{
        "ok" : 0,
        "errmsg" : "throwing from $assert",
        "code" : 50893,
        "codeName" : "Location50893",
        "operationTime" : Timestamp(1533156181, 222),
        "$clusterTime" : {
                "clusterTime" : Timestamp(1533156243, 3),
                "signature" : {
                        "hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
                        "keyId" : NumberLong(0)
                }
        }
}

The error message format is the same if I force an assertion in the merger part:

mongos> db.runCommand({aggregate: "coll", cursor: {}, pipeline: [{$match: {x: 1}}, {$group: {_id: "$x"}}, {$assert: 1}]})
{
        "ok" : 0,
        "errmsg" : "throwing from $assert",
        "code" : 50893,
        "codeName" : "Location50893",
        "operationTime" : Timestamp(1533156181, 222),
        "$clusterTime" : {
                "clusterTime" : Timestamp(1533156243, 3),
                "signature" : {
                        "hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
                        "keyId" : NumberLong(0)
                }
        }
}

We suspect the AsyncResultsMerger converts the AsyncRequestsSender::Response objects from each shard into a status and immediately throws if it's non-OK. However, this is losing important information; we could indicate from which shard the error occurred. It also hides any other errors that might have been collected.

This has implications for the improved $out project, as a failing sharded $out would not indicate from where the failures occurred, making diagnosis harder.



 Comments   
Comment by Githook User [ 13/Dec/18 ]

Author:

{'username': 'vrachev', 'email': 'vlad.rachev@mongodb.com', 'name': 'vrachev'}

Message: SERVER-36403 Blacklist agg_error_reports_shard_host_and_port.js on sharding_last_stable_and_mixed_shards
Branch: master
https://github.com/mongodb/mongo/commit/6c6ec3833e4773a95803e9371ced79cfeaa34ea9

Comment by Githook User [ 13/Dec/18 ]

Author:

{'username': 'vrachev', 'email': 'vlad.rachev@mongodb.com', 'name': 'vrachev'}

Message: SERVER-36403 Cluster aggregation error message should indicate which shard(s) raised an error
Branch: master
https://github.com/mongodb/mongo/commit/7c59c0287705363f4251d13a9929fe7cc7e1a2d8

Comment by Kyle Suarez [ 03/Aug/18 ]

For reference, commands like createIndexes have responses like

mongos> db.coll.createIndex({x: 1})
{
        "raw" : {
                "shardingtest-rs0/kimchi:20000" : {
                        "createdCollectionAutomatically" : false,
                        "numIndexesBefore" : 3,
                        "numIndexesAfter" : 3,
                        "note" : "all indexes already exist",
                        "ok" : 1
                },
                "shardingtest-rs1/kimchi:20001" : {
                        "createdCollectionAutomatically" : false,
                        "numIndexesBefore" : 3,
                        "numIndexesAfter" : 3,
                        "note" : "all indexes already exist",
                        "ok" : 1
                }
        },
        "ok" : 1,
        "operationTime" : Timestamp(1533156397, 2),
        "$clusterTime" : {
                "clusterTime" : Timestamp(1533156397, 2),
                "signature" : {
                        "hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
                        "keyId" : NumberLong(0)
                }
        }
}

which are a little more friendly. So we should keep that format in mind when designing this one.

Comment by Kyle Suarez [ 01/Aug/18 ]

Throwing into Query Team triage queue to debate whether or not we should do this as part of the $out project.

Generated at Thu Feb 08 04:42:59 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.