we have encountered the issue SERVER-47553 in Alibaba Cloud hosted mongodb instances, we notice that the mongos crash issue has been fixed, which was caused by throwing exception in destructor function when use ON_BLOCK_EXIT to call appendRequiredFieldsToResponse, and that will cause mongos to call std::terminate().
however, the fix about SERVER-47553 just avoid mongos to crash, but the root issue looks unresolved. because after we apply this patch, it seems mongos can't connect mongod server nodes yet, so we dig out this problem. seems like there is a bug in monitoring-keys-for-HMAC thread on the primary node of config server, would cause signing keys not generated by the KeysRotationIntervalSec interval, and when mongos call KeysCollectionManager::refreshNow to ask config server for new signing keys, it will fail with a timeout exception, which cause this problem to happen.
I am sure the root cause is a bug in "howMuchSleepNeedFor" function, which caculate the wake-up interval for monitoring-keys-for-HMAC thread on the primary node of config server:
auto millisBeforeExpire = 1000 * (expiredSecs - currentSecs);
here expiredSecs and currentSecs are type of unsigned int, and the default wake-up interval is 90days(7776000 seconds), after a unit conversion to mills, it will be 7776000000, which will be an overflow value since the max is 4294967295
this will cause a serious problem, because mongos can't reconnect mongod server nodes even if after restart many times, a feasible resolution is to restart config server nodes and this will trigger monitoring-keys-for-HMAC thread to generate new signing keys, and mongos can reconnect successfully after that.
- is related to
SERVER-52654 new signing keys not generated by the monitoring-keys-for-HMAC thread