[SERVER-75240] Emit log on ShardingDDLCoordinator definite failure Created: 24/Mar/23  Updated: 29/Oct/23  Resolved: 15/Apr/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 7.1.0-rc0

Type: Task Priority: Minor - P4
Reporter: Tommaso Tocci Assignee: Antonio Fuschetto
Resolution: Fixed Votes: 0
Labels: neweng, sharding-wfbf-day
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Assigned Teams:
Sharding EMEA
Backwards Compatibility: Fully Compatible
Sprint: Sharding EMEA 2023-04-17
Participants:

 Description   

Currently we only emit this log that does not contain information about the abort reason.

Many coordinators implement their own error logging:



 Comments   
Comment by Githook User [ 15/Apr/23 ]

Author:

{'name': 'Antonio Fuschetto', 'email': 'antonio.fuschetto@mongodb.com', 'username': 'afuschetto'}

Message: SERVER-75240 Emit log on ShardingDDLCoordinator definite failure
Branch: master
https://github.com/mongodb/mongo/commit/9343018ba0b2ce733b95657bf79d8a419c916d1a

Comment by Antonio Fuschetto [ 12/Apr/23 ]

Adding the reason to the DDL coordinator's release log message, the result is:

"msg":"Releasing sharding DDL coordinator",
"attr":{
   "coordinatorId":{
      "namespace":"test_db",
      "operationType":"movePrimary"
   },
   "reason":"NamespaceExists: unsharded collection with same namespace test_db.test_coll_1 already exists."
}

The reported information is enough for troubleshooting, since (1) for each namespace it is not possible to have concurrent operations and (2) for each of them it is now possible to determine the operation result. This implies that the customer log messages implemented at the level of coordinator are now redundant and can be removed. Ideally, we can also report the name of the failed phase. An example of existing custom log message is:

"msg":"Failed movePrimary operation",
"attr":{
   "db":"test_db",
   "to":"move_primary_basic-rs1",
   "phase":"clone",
   "error":"NamespaceExists: unsharded collection with same namespace test_db.test_coll_1 already exists."
}

The same cannot be said for the operation start messages, as they report specific information of the operation (e.g., the destination shard for the movePrimary operation). An example follows:

"msg":"Running movePrimary operation",
"attr":{
   "db":"test_db",
   "to":"move_primary_basic-rs1"
}

Generated at Thu Feb 08 06:29:40 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.