[SERVER-29152] Segfault in multiple shard primaries under regular load Created: 12/May/17  Updated: 30/Oct/23  Resolved: 30/May/17

Status: Closed
Project: Core Server
Component/s: Networking
Affects Version/s: 3.2.13
Fix Version/s: 3.2.14, 3.4.5, 3.5.9

Type: Bug Priority: Critical - P2
Reporter: Meni Livne Assignee: Samantha Ritter (Inactive)
Resolution: Fixed Votes: 2
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File shard0-primary.txt     Text File shard1-primary.txt     Text File shard2-primary.txt    
Issue Links:
Backports
Duplicate
is duplicated by SERVER-29510 mongos on all servers crash when addi... Closed
is duplicated by SERVER-29310 server crash during chunk split Closed
Related
related to SERVER-29377 Make the logging subsystem immortal Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v3.4, v3.2
Participants:

 Description   

Our database is divided into 4 shards, each having one primary, secondary and arbiter. Primaries are r4.2xlarge servers on AWS EC2, and secondaries are r4.xlarge.

Our work load is intensive in both reads and writes, but these servers usually handle the load without a problem. However during their regular work, primaries of 3 of the 4 shards suddenly crashed, within a very short time of each other. We don't know what could have caused this.

Attached are the logs of the segfaults from the primary servers. The one from shard1 seems different that the other two.



 Comments   
Comment by Githook User [ 30/May/17 ]

Author:

{u'username': u'samantharitter', u'name': u'samantharitter', u'email': u'samantha.ritter@10gen.com'}

Message: SERVER-29152 Do not cache logging ostream in threadlocal when in other thread-specific contexts
Branch: master
https://github.com/mongodb/mongo/commit/fad590916a30ff34dc8c3b37afcfffa2c4e5c8bc

Comment by Samantha Ritter (Inactive) [ 26/May/17 ]

It appears our hook did not catch the 3.2 commit, it's here:

Author: samantharitter
Message: SERVER-29152 Do not cache logging ostream in threadlocal when in other thread-specific contexts
Branch: v3.2
https://github.com/mongodb/mongo/commit/85aa900eae81fdc07d00aa4b1fb782f7ca5b4664

Comment by Githook User [ 26/May/17 ]

Author:

{u'username': u'samantharitter', u'name': u'samantharitter', u'email': u'samantha.ritter@10gen.com'}

Message: SERVER-29152 Do not cache logging ostream in threadlocal when in other thread-specific contexts
Branch: v3.4
https://github.com/mongodb/mongo/commit/36a4a00321bb531190bcd00f523bce95a81b5ab2

Comment by Samantha Ritter (Inactive) [ 22/May/17 ]

Hi Meni,

I wanted to update you on the status of this bug. New logging code that was added by SERVER-28760 tries to log while a thread is exiting, in which case the logging subsystem may already be destroyed. The order in which these objects are destroyed seems quasi-random, depending on the build or on the system's memory allocation. This influences whether these objects are destroyed peacefully or whether they are destroyed in a bad order that leads to a crash. We are investigating exactly what determines the ordering of the destruction of these objects. We are still working to reproduce the crash on our end as we investigate what the best fix will be. Thank you for your patience.

As to what actual event may have triggered the thread to exit here in your case, can you provide complete log files from these crashes? The stack traces you've linked have been very helpful, and it would also help us to see what the system was doing up until things went south.

Thank you,
Samantha

Comment by Meni Livne [ 13/May/17 ]

We're using the mongodb-org-server packages for ubuntu from the official mongodb repositories. As far as we know these don't add any log rotation settings, and we haven't implemented any ourselves, and never noticed the log file being rotated.

Comment by Samantha Ritter (Inactive) [ 12/May/17 ]

Hi there,

Thanks for opening this ticket, I'm sorry you experienced these crashes. I'm looking into what might have happened on these servers. Given the stack traces, it's possible that we have a bug in our logging subsystem. Are you running with rotating log files? If so, is there any chance that these servers' log files were being rotated around the time the crash occurred?

Thanks,
Samantha

Generated at Thu Feb 08 04:20:02 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.