[SERVER-52654] new signing keys not generated by the monitoring-keys-for-HMAC thread Created: 06/Nov/20  Updated: 22/Jan/24  Resolved: 10/Dec/20

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 4.2.10
Fix Version/s: 4.0.22, 3.6.22, 4.4.3, 4.2.12

Type: Bug Priority: Critical - P2
Reporter: Jingcheng Li Assignee: Jack Mulrow
Resolution: Fixed Votes: 2
Labels: sharding-wfbf-day
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Duplicate
is duplicated by SERVER-53337 Mongos hangs and stop responding Closed
is duplicated by SERVER-53540 DBException handling request, closing... Closed
is duplicated by SERVER-57738 sharding cluster, clients cannot conn... Closed
Related
related to SERVER-48709 signing key generator thread on confi... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.4, v4.2, v4.0, v3.6
Sprint: Sharding 2020-12-14
Participants:
Case:

 Description   
Issue Status as of Jan 7, 2021

ISSUE DESCRIPTION AND IMPACT

The bug causes a failure of the thread that creates new Hash-based Message Authentication Code (HMAC) signing keys every 90 days.

New keys are generated when the Config Server Replica Set (CSRS) fails over. So, if a failover does not happen on the CSRS for 90 days, operations across the sharded cluster will start to fail and will not succeed again until the CSRS fails over.

DIAGNOSIS AND AFFECTED VERSIONS

MongoDB 4.2.2 to 4.2.11 and 4.4.0 to 4.4.2 are affected. The bug may exist in previous versions but mechanisms other than failover cause the CSRS primary to re-generate the HMAC keys successfully in those versions.

To check the expiration date of the HMAC keys, use a mongo shell to connect to a mongos node, or the CSRS primary, authenticate as a user with admin privilege and run the following command to check the expiration date for the HMAC signing keys. The cluster will experience this issue when all the HMAC signing keys expire.

db.getSiblingDB("admin").system.keys.find().map(k => { return { _id: k._id, purpose: k.purpose, expiresAt: new Date(k.expiresAt.getTime()*1000) }})

To perform this check the database user must have permissions to query the admin.system.keys collection. To grant these permissions, create a new role with the find action on the admin.system.keys collection and grant this role to an admin user with the following commands, replacing ADMIN with the username:

use admin;
 
db.createRole({
  role: "query_keys",
  privileges: [
     { resource: { db: "admin", collection: "system.keys"}, actions: [ "find" ] },
  ],
  roles: [  ]
});
 
db.grantRolesToUser("ADMIN", ["query_keys"])

REMEDIATION AND WORKAROUNDS

The fix is included in the 3.6.22, 4.0.22, 4.2.12 and 4.4.3 production releases and later. To prevent the issue before upgrading to a fixed release, step down the CSRS primary to initiate a failover before the 90 days limit is reached.

Original Description

I see the overflow issue SERVER-48709 is fixed, but the problem already happens after we upgraded the config server to a version of 4.2.10, new signing keys not generated by the monitoring-keys-for-HMAC thread, after 90 or 180 days, when the signing keys are expired, mongos can't connect mongod server nodes successfully. we have to restart the config server, so that new signing keys will be generated when monitoring-keys-for-HMAC thread start, and then mongos successfully connect mongod server nodes again.
I think the root cause of SERVER-47553 and SERVER-48709 maybe is the same, but it have not been digged out, as this issue may cause unexpected downtime for our service, it's a very serious problem, wish it can be fixed ASAP, Thanks!



 Comments   
Comment by Ilan M [ 13/Apr/21 ]

"To prevent the issue before upgrading to a fixed release, step down the CSRS primary to initiate a failover before the 90 days limit is reached."

 

Could you clarify on the above statement whether we need to just restart the config primary or all config nodes ? as a workaround for this fix. Thank you.

Comment by jun park [ 06/Apr/21 ]

While using version 4.4.1, read/write did not work and mongos was not connected.

When I checked the expiration date of the key, it was the time when mongos could not connect, and it was also the time after the wrong query was called to mongodb.

Could it be triggered by a wrong query?

Comment by Aayushi Mangal [ 01/Mar/21 ]

Hi Jack Mulrow/   jcli.china@gmail.com.

How to reproduce this issue, could you please share the steps. I tried by making system clock ahead but that will not work here. I would like to reproduce it for 4.2.10.

Comment by Githook User [ 10/Dec/20 ]

Author:

{'name': 'Jack Mulrow', 'email': 'jack.mulrow@mongodb.com', 'username': 'jsmulrow'}

Message: SERVER-52654 HMAC keys monitoring thread should never sleep longer than 20 days

(cherry picked from commit e804031ae4ea69c2cfbfcca47202fcc468d826b2)
Branch: v4.0
https://github.com/mongodb/mongo/commit/5fbb02045b2775ae69376c8c60bf20df90c99383

Comment by Githook User [ 10/Dec/20 ]

Author:

{'name': 'Jack Mulrow', 'email': 'jack.mulrow@mongodb.com', 'username': 'jsmulrow'}

Message: SERVER-52654 HMAC keys monitoring thread should never sleep longer than 20 days

(cherry picked from commit e804031ae4ea69c2cfbfcca47202fcc468d826b2)
Branch: v4.2
https://github.com/mongodb/mongo/commit/a0bd4ff103f31a5f96438d47eddcc915bdf2cdef

Comment by Githook User [ 10/Dec/20 ]

Author:

{'name': 'Jack Mulrow', 'email': 'jack.mulrow@mongodb.com', 'username': 'jsmulrow'}

Message: SERVER-52654 HMAC keys monitoring thread should never sleep longer than 20 days

(cherry picked from commit e804031ae4ea69c2cfbfcca47202fcc468d826b2)
Branch: v3.6
https://github.com/mongodb/mongo/commit/907de1c96699460cd5be04ad579d14567c90639f

Comment by Githook User [ 10/Dec/20 ]

Author:

{'name': 'Jack Mulrow', 'email': 'jack.mulrow@mongodb.com', 'username': 'jsmulrow'}

Message: SERVER-52654 HMAC keys monitoring thread should never sleep longer than 20 days

(cherry picked from commit e804031ae4ea69c2cfbfcca47202fcc468d826b2)
Branch: v4.4
https://github.com/mongodb/mongo/commit/6dd2aee6bd51adcba6c34520b28d956379bf0a3d

Comment by Githook User [ 10/Dec/20 ]

Author:

{'name': 'Jack Mulrow', 'email': 'jack.mulrow@mongodb.com', 'username': 'jsmulrow'}

Message: SERVER-52654 HMAC keys monitoring thread should never sleep longer than 20 days
Branch: master
https://github.com/mongodb/mongo/commit/e804031ae4ea69c2cfbfcca47202fcc468d826b2

Comment by DEokhyun Lee [ 09/Dec/20 ]

Hi~

We had the same problem while using 4.2.

It seems that this issue may also occur in 3.6 and 4.0.
Do you have the same patch plan for 3.6 and 4.0?

Thank you~

Comment by Jingcheng Li [ 12/Nov/20 ]

Hello,

When I try to reproduce this problem, I use pstack command to dump the call stack of monitoring-keys-for-HMAC thread, and then I do some text processes for the pstack result, I notice that the monitoring-keys-for-HMAC thread finally use poll to sleep and wait for a wake-up event until reaching a deadline time, Unfortunately, the third argument of the system call 'poll' is type of signed int and the unit of time is also millisecond, since the howMuchSleepNeedFor function use a timeout about 90days(7776000000 ms), as 7776000000 is an overflow value for signed int type, the result will be an negative value(-813934592) after a type conversion, which will cause an infinite time of sleep and the thread never be waken up.

So I think the solution is simple, ajust the sleep interval to a less value than INT_MAX will fix this issue.

FYI,

Thanks!

cat pstack.log | awk 'BEGIN { s = ""; } /^Thread/ { print s; s = ""; } /^#/ { if (s != "" ) { s = s "," $4} else { s = $4 } } END { print s }'  | sort | uniq -c | sort -r -n -k 1,1 | grep _doPeriodicRefresh
1 poll,mongo::transport::TransportLayerASIO::BatonASIO::run(mongo::ClockSource),mongo::transport::TransportLayerASIO::BatonASIO::run_until(mongo::ClockSource,,mongo::ClockSource::waitForConditionUntil(mongo::stdx::condition_variable&,,mongo::OperationContext::waitForConditionOrInterruptNoAssertUntil(mongo::stdx::condition_variable&,,mongo::KeysCollectionManager::PeriodicRunner::_doPeriodicRefresh(mongo::ServiceContext*,,std::thread::State_impl<std::thread::Invoker<std::tuple<mongo::KeysCollectionManager::PeriodicRunner::start(mongo::ServiceContext*,,execute_native_thread_routine,start_thread,clone

Comment by Kelsey Schubert [ 09/Nov/20 ]

Thanks for the report, jcli.china@gmail.com. We'll investigate.

Generated at Thu Feb 08 05:28:38 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.