[SERVER-75120] libunwind stacktrace issues with --dbg=on on arm64 Created: 22/Mar/23  Updated: 29/Oct/23  Resolved: 07/Jul/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 7.1.0-rc0, 7.0.1

Type: Bug Priority: Major - P3
Reporter: Daniel Moody Assignee: Mark Benvenuto
Resolution: Fixed Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Related
related to SERVER-78782 Complete TODO listed in SERVER-75120 Closed
is related to SERVER-78304 Temporarily disable building dbg libu... Closed
Assigned Teams:
Service Arch
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v7.0
Steps To Reproduce:

get arm64 machine (graviton workstation will allow for icecream)
python buildscripts/scons.py --variables-files=etc/scons/mongodbtoolchain_v4_gcc.vars --link-model=dynamic --ninja ICECC=icecc --dbg=on --opt=off
ninja +stacktrace_test

Sprint: Security 2023-07-10, Service Arch 2023-06-12
Participants:

 Description   

This test seems to fail on arm64 when debug is on, however disabling libunwind causes the test to pass. This is probably an issue with libunwind on that platform.

The stacks with libunwind show general libunwind errors so the libunwind should be debugged and determined why its failing to generate stacks in the config.

 

Part of this ticket should revert SERVER-78304



 Comments   
Comment by Githook User [ 16/Aug/23 ]

Author:

{'name': 'Mark Benvenuto', 'email': 'mark.benvenuto@mongodb.com', 'username': 'markbenvenuto'}

Message: SERVER-78782 Remove TODO for SERVER-75120

(cherry picked from commit 1609d5cf48678e71fa64f1c219739e3791408a85)
Branch: v7.0
https://github.com/mongodb/mongo/commit/2790d2d5f5516e62fe9d78df9ad611d76da71175

Comment by Githook User [ 16/Aug/23 ]

Author:

{'name': 'Mark Benvenuto', 'email': 'mark.benvenuto@mongodb.com', 'username': 'markbenvenuto'}

Message: SERVER-75120 Capture and walk stack in same frame

(cherry picked from commit 8bee9c79d9222d1a0f9106376797da32b5365878)
Branch: v7.0
https://github.com/mongodb/mongo/commit/4125dda8d3394126154ba3872ad741aa45145ee6

Comment by Githook User [ 20/Jul/23 ]

Author:

{'name': 'Mark Benvenuto', 'email': 'mark.benvenuto@mongodb.com', 'username': 'markbenvenuto'}

Message: SERVER-78782 Remove TODO for SERVER-75120
Branch: master
https://github.com/mongodb/mongo/commit/1609d5cf48678e71fa64f1c219739e3791408a85

Comment by Kaloian Manassiev [ 11/Jul/23 ]

I would like to request at least a 7.0 backport of this change as well, please.

Comment by Githook User [ 07/Jul/23 ]

Author:

{'name': 'Mark Benvenuto', 'email': 'mark.benvenuto@mongodb.com', 'username': 'markbenvenuto'}

Message: SERVER-75120 Capture and walk stack in same frame
Branch: master
https://github.com/mongodb/mongo/commit/8bee9c79d9222d1a0f9106376797da32b5365878

Comment by Mark Benvenuto [ 28/Jun/23 ]

A few things to note:

  1. This issue is only on clang, not gcc
  2. This is an issue with libunwind, not the debugger (gdb) and (lldb). They both read the same .debug_frames table without issue. I have not tried llvm's libunwind yet.
  3. Clang generates DW_CFA_def_cfa as the first instruction instead of DW_CFA_def_cfa_offset like ARM64 gcc or x86 clang do. This is what I believe causes it problems and then causes it to read an uninitialized register (i.e. 0) and then offset from that (32 + -24 = 8) and then segfault

llvm-dwarfdump --regex --name ".getStackTrace." build/install/bin/stacktrace_test
build/install/bin/stacktrace_test: file format elf64-littleaarch64

0x00b8d0e2: DW_TAG_subprogram
              DW_AT_low_pc      (0x0000000000efdc90)
              DW_AT_high_pc     (0x0000000000efddec)
              DW_AT_frame_base  (DW_OP_reg29 W29)
              DW_AT_linkage_name        ("_ZN5mongo18stack_trace_detail12_GLOBAL__N_117getStackTraceImplERKNS1_7OptionsE")
              DW_AT_name        ("getStackTraceImpl")
              DW_AT_decl_file   ("/home/ubuntu/mongo/src/mongo/util/stacktrace_posix.cpp")
              DW_AT_decl_line   (423)
              DW_AT_type        (0x00b8e0b8 "class ")

Note: readelf and llvm-dwarfdump are dumping the same information, just different styles

readelf --debug-dump=frames build/install/bin/stacktrace_test

00165d80 0000000000000024 00026314 FDE cie=0013fa70 pc=0000000000efdc90..0000000000efddec
  Augmentation data:     eb 4a e2 ff ff ff ff ff
  DW_CFA_advance_loc: 16 to 0000000000efdca0
  DW_CFA_def_cfa: r29 (x29) ofs 32
  DW_CFA_offset: r28 (x28) at cfa-16
  DW_CFA_offset: r30 (x30) at cfa-24
  DW_CFA_offset: r29 (x29) at cfa-32
  DW_CFA_nop
  DW_CFA_nop
  DW_CFA_nop
  DW_CFA_nop
  DW_CFA_nop

llvm-dwarfdump-14 --debug-frame build/install/bin/stacktrace_test

00165d80 00000024 00026314 FDE cie=0013fa70 pc=00efdc90...00efddec
  Format:       DWARF32
  LSDA Address: 000000000099f944
  DW_CFA_advance_loc: 16
  DW_CFA_def_cfa: W29 +32
  DW_CFA_offset: W28 -16
  DW_CFA_offset: W30 -24
  DW_CFA_offset: W29 -32
  DW_CFA_nop:
  DW_CFA_nop:
  DW_CFA_nop:
  DW_CFA_nop:
  DW_CFA_nop:

0xefdc90: CFA=WSP
0xefdca0: CFA=W29+32: W28=[CFA-16], W29=[CFA-32], W30=[CFA-24]

Comment by Alex Neben [ 21/Jun/23 ]

As part of this work please revert SERVER-78304 to prevent this issue from hitting people on local development.

Comment by Alex Neben [ 21/Jun/23 ]

Hey gregory.wlodarek@mongodb.com I am sorry I missed you last comment. jason.chan@mongodb.com / blake.oler@mongodb.com do you think you could prioritize this fix / give an estimate for when it might be fixed? I think we should make this a P2 since this is affecting a lot of people.

Depending on the estimate given by service arch we can decide next steps. If it is soon then we can skip the PSA. If it is longer maybe we can include something in the build system or send out an email (as you suggested).

I would also like to note that this was something I missed when testing the new OSes so I really appreciate service-arch's help on this

Comment by Gregory Wlodarek [ 09/Jun/23 ]

This is also preventing core dumps from being generated when using --dbg=on.
See: https://mongodb.slack.com/archives/C0V79S1PY/p1686324885797059

alex.neben@mongodb.com can we make a PSA somewhere? I spent several hours trying to figure out what's going on.

Comment by Daniel Moody [ 06/Jun/23 ]

This seems to be related to issues besides just in the test failure. Recently we switched to recommending the graviton workstation which are arm based, and now we have been seeing libunwind stack issues in some dev workflows pop up.

The current workaround would be building with "--use-libunwind=off" and probably needing to add "LINKFLAGS=-rdynamic" to make sure the built in unwinder can see all the symbols, I think there may still be issues seeing hidden visibility symbols in that case, but we don't use hidden visibility much I think? What about hidden inlines option?

Comment by Billy Donahue [ 24/May/23 ]

I'm pretty sure <stacktrace> won't be suitable for use with async signals, as it is built from from the GCC backtrace() system, which is AS-Unsafe (Async signal unsafe). This lack of safety is why we needed libunwind in the first place. So we will have to fix this some other way.

https://www.gnu.org/software/libc/manual/html_node/Backtraces.html#index-backtrace-1


<stacktrace> is not too bad. It seems to work from a synchronous signal hander. But we need more than that. https://gcc.godbolt.org/z/zMaahcKM5

Comment by Alex Neben [ 23/May/23 ]

daniel.moody@mongodb.com brought up that in c++23 we can use <stacktrace> to accomplish this. That might be a little bit away but just an FYI that maybe we can avoid doing this work?

Comment by Kaloian Manassiev [ 17/May/23 ]

It looks like libunwind doesn't work on aarch64 when --gdb=on is specified. Given that it is probably the most common compilation argument that we have for local development, could we make it so it works out of the box or at least write in the description how are we supposed to build on that platform?

Comment by Daniel Moody [ 30/Mar/23 ]

it works on arm, just not with --dbg=on, also the stacktrace_libunwind_test passes both --dbg=on or --dbg=off for aarch64.

Comment by Alex Neben [ 30/Mar/23 ]

I tend to agree with Jason here. I think getting libunwind to work on arm64 is out of scope here and we should release what we are sure about. Since this is a major release I think we should roll this back for 7.0. daniel.moody@mongodb.com I would be open to discussing this if you think it would be easy to make this work on arm64.

I am also adding the 7.0 blocking tag since this will be a customer release.

Comment by Daniel Moody [ 29/Mar/23 ]

jason.chan@mongodb.com it was turned on in this commit https://github.com/mongodb/mongo/commit/6dd404e028547a29c21b047c2d91ed90ebb1edfb#diff-d78c5c377c03dd251edc72cdb0ae7d9986fa08361eb392187218f8c670c0bc4e

That said I do thinks it probably an issue with libunwind, just wanted to let you know we have been running the libunwind unittest on aarch64 for a while.

Comment by Daniel Moody [ 29/Mar/23 ]

jason.chan@mongodb.com The stacktrace_libunwind_test passes both --dbg=on or --dbg=off for aarch64.

Comment by Jason Chan [ 29/Mar/23 ]

Sending back to SDP since service arch avoided using libunwind on aarch64 due to known issues with the library. We currently use gcc's backtrace function instead (not ideal) of libunwind for that platform. If SDP wants to add support for libunwind on aarch64, we should first verify whether the previous issues were fixed. Otherwise, consider leaving it unsupported or explore other alternatives.

Comment by Daniel Moody [ 24/Mar/23 ]

In the last upgrade of libunwind, we add supported for aarch64. The stacktrace_libunwind_test passes both --dbg=on or --dbg=off for aarch64. The stacktrace_test passes --dbg=on and fails --dbg-off on aarch64, unless --use-libunwind=off, in which case it passes both.

Here is an example stack trace: https://parsley.mongodb.com/resmoke/ba725b421944d3348b28b4457ee96872/test/174e8b98e43302d851fad7792814a285?bookmarks=0,104&shareLine=0

Comment by Billy Donahue [ 22/Mar/23 ]

I seem to recall that we didn't try to use libunwind on aarch64 as it was known to have issues. This could be old info on my part.

Can you provide more details about the failure or attach a log?

Comment by Alex Neben [ 22/Mar/23 ]

Sending to service arch since they have recently done work on backtrace.

Generated at Thu Feb 08 06:29:22 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.