[SERVER-53415] Intent Lock timeout lead to server crash Created: 17/Dec/20  Updated: 21/Jan/21  Resolved: 14/Jan/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 4.4.1
Fix Version/s: 4.9.0, 4.4.4

Type: Question Priority: Major - P3
Reporter: Yan Zhou Assignee: Dianna Hohensee (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File crash.log    
Issue Links:
Depends
Related
is related to SERVER-53720 Tag onCommit/onRollback Changes with ... Closed
is related to SERVER-53745 Improve lock identification informati... Closed
Sprint: Execution Team 2021-01-25
Participants:

 Description   

Our MongoDB server had a few instances where it crashed after seeing the following log message,

 

{{

Unknown macro: {"t"}

,"s":"F", "c":"CONTROL", "id":4757800, "ctx":"conn455","msg":"Writing fatal message","attr":{"message":"DBException::toString(): LockTimeout: Unable to acquire IS lock on '

Unknown macro: {16140901064495857683}

' within 5ms.\nActual exception type: mongo::error_details::ExceptionForImpl<(mongo::ErrorCodes::Error)24, mongo::ExceptionForCat<(mongo::ErrorCategory)2> >\n"}}}}

 
I can understand that it causes writing to fail, but why would it be a fatal error? This can entirely be due to a transient hardware issues such as slow access to disk.



 Comments   
Comment by Dianna Hohensee (Inactive) [ 14/Jan/21 ]

SERVER-48994 has now been backported to the v4.4 branch, I believe this to be resolved.

Comment by Dianna Hohensee (Inactive) [ 13/Jan/21 ]

I see that in v4.4 sharding takes a MODE_IS ResourceMutex in an onCommit handler here – and registered here. In master, unlike v4.4, there's an UninterruptibleLockGuard on the same code. That was put in by SERVER-48994, which appears to have been addressing a 5ms transactions lock timeout. This seems very likely to be the problem – or an identical one. So SERVER-48994 should be backported.

Comment by Bruce Lucas (Inactive) [ 17/Dec/20 ]

Thanks yan.zhou@cubistsystematic.com, we will investigate.

Comment by Yan Zhou [ 17/Dec/20 ]

I have attached the tail of the logs, starting from where it still looks normal (transaction successful etc).

I notice that it appears to happen around when chunk migration happens at the same time when a transaction is started. I understand that bulk data insertion without disable the balancer is not optimal, and can impact performance. But I can't just disable balancer for every short burst of write activities. And the worst thing that could happen I expected to be either time out errors or temporary slow performance.

Just FYI, mongodb is deployed via docker,

Comment by Bruce Lucas (Inactive) [ 17/Dec/20 ]

yan.zhou@cubistsystematic.com, can you please attach a complete log file showing such a crash to this ticket? Alternatively, you can upload it to this secure private portal if it's too large to attach or if it contains sensitive information that you can't share on this public ticket.

In addition, please archive and attach the contents of $dbpath/diagnostic.data from a node that has experienced this issue recently (past few days to a week or so), together with the log file(s) for that node.

Generated at Thu Feb 08 05:30:51 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.