[SERVER-26578] Add startup warning for Intel CPUs which might have TSX bugs Created: 11/Oct/16  Updated: 10/May/23  Resolved: 24/Feb/22

Status: Closed
Project: Core Server
Component/s: Stability
Affects Version/s: 3.0.12, 3.2.10, 3.4.0-rc0
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Kaloian Manassiev Assignee: Backlog - Service Architecture
Resolution: Won't Do Votes: 1
Labels: devtools-to-servicearch, re-triaged-ticket
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-24283 Invariant failure grantedCounts[mode]... Closed
related to SERVER-26018 Inconsistency between the LockManager... Closed
Assigned Teams:
Service Arch
Participants:
Linked BF Score: 5

 Description   

Certain versions of the Intel CPU microcode have TSX bugs which might lead to unexplained concurrency issues. We should include server startup warnings or if possible even refuse to start the server if we discover this situation.

More information on this was provided by user xiaost as part of SERVER-26018:

  • can only be reproduced on servers with the new CPU(E5-2630 v4)
  • can be easily reproduced by modification of unittests
  • can only be reproduced under particular code execution sequence
  • it works well if we add some debug codes into the lock context

after debugging, we started to focus on hardware issue, including memory / CPU.

With the help of Google, we found the TSX feature, speeding up execution of multi-threaded software through lock elision, seems to be evil of everything since 2014:
[1 [2 [3

In August 2014, Intel announced a bug in the TSX implementation on current steppings of Haswell, Haswell-E, Haswell-EP and early Broadwell CPUs, which resulted in disabling the TSX feature on affected CPUs via a microcode update.

we checkout our microcode changelog. In the latest release:

+ Likely fixes a recently identified, critical but low-hitting TSX erratum on Broadwell, Broadwell-E and related Xeons (Broadwell-DE/WS/EP: Xeon-D 1500, E3-v4 and E5-v4)



 Comments   
Comment by Blake Oler [ 22/Nov/22 ]

Keeping this closed since the issue seems to have gone away.

Comment by Lauren Lewis (Inactive) [ 24/Feb/22 ]

We haven’t heard back from you for at least one calendar year, so this issue is being closed. If this is still an issue for you, please provide additional information and we will reopen the ticket.

Comment by xihui he [ 04/Dec/17 ]

@Adun To reproduce the issue, we made some code changes in lock_manager_test, but unfortunately, the code is lost...
After we upgrade the microcode, the issue is gone, and it runs well till now.

Comment by Adun [ 03/Dec/17 ]

Our production ran into this issue occasionally. I try the branch r3.4.10 `lock_manager_test` on platfroms:

1. Debian 8 + gcc 5.3 + libc 2.19-18+deb8u1 + microcode 0xb00001b + E5-2630v4
2. Debian 8 + gcc 5.3 + libc 2.19-18+deb8u1 + microcode 0xb000021 + E5-2630v4
3. Debian 8 + gcc 5.3 + libc 2.19-18+deb8u10 + microcode 0xb00001b + E5-2630v4
4. Debian 8 + gcc 5.3 + libc 2.19-18+deb8u10 + microcode 0xb000021 + E5-2630v4
5. Debian 9 + gcc 6.3 + libc 2.24-11+deb9u1 + microcode 0xb000021 + E5-2630v4

unfortunately, I can't reproduce this issue.
so, what can I do to to confirm that is not because of Intel TSX?

Thanks.

Generated at Thu Feb 08 04:12:34 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.