[SERVER-30412] mongos Segmentation fault during aggregation workload Created: 28/Jul/17 Updated: 27/Oct/23 Resolved: 09/Jan/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 3.5.11 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | James O'Leary | Assignee: | [DO NOT USE] Backlog - Sharding Team |
| Resolution: | Gone away | Votes: | 0 |
| Labels: | sysperf-36 | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Assigned Teams: |
Sharding
|
||||||||
| Operating System: | ALL | ||||||||
| Sprint: | Sharding 2017-10-02 | ||||||||
| Participants: | |||||||||
| Description |
|
There is a persistent seg fault being thrown during one of the MMAP aggregation tests. See the following log lines:
|
| Comments |
| Comment by Kaloian Manassiev [ 10/Oct/17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
jim.oleary, how do you propose that we continue with this ticket? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by James O'Leary [ 10/Oct/17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I have gone through all the system failures on sys_perf_linux_3_shard_agg_query_comparison_bestbuy_(WT|MMAPv1) from the end of August. There is no recurrence of this backtrace since then BUT:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kevin Duong [ 10/Oct/17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
pinging jim.oleary to help address this further as needed. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Nathan Myers [ 06/Oct/17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Is it still possible to reproduce this? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kevin Duong [ 06/Oct/17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Changing this from debugging with submitter to 3.6 required. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 01/Sep/17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {'username': 'gormanb', 'name': 'Bernard Gorman', 'email': 'bernard.gorman@gmail.com'}Message: | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Bernard Gorman [ 30/Aug/17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
This turned out to be a bit subtle. The bug was exposed by this test in the bestbuy_agg_query_comparison.js workload (note that the find is converted to an equivalent aggregation):
Uniquely among SplittableDocumentSources, when mongoS splits the pipeline at a $limit stage the $limit returns a pointer to the same object for both the shard and merge pipelines. Previously this didn't matter, since both the shard and merge pipes were simply serialised to command objects and sent to the relevant shards, but after When we begin to run the merge pipeline on mongos, the $limit splitpoint is correctly stitched into it. We retrieve the first batch, register the cursor containing the pipeline to the ClusterCursorManager, and return the results. However, at this point the shard pipeline - which has already been dispatched to the remotes - is destroyed as its unique_ptr goes out of scope; as part of this process, its deleter calls stitch() on the pipeline. The $limit stage, which exists in both the shard pipeline being destroyed and the merge pipeline stored in the ClusterCursorManager, is stitched back into the shard pipeline and redirected to point to the preceding shard stage, which is promptly destroyed. When we next retrieve the merge pipeline from the cursor manager and run it, the $limit is pointing to nothing and segfaults. The reason that this bug wasn't picked up by the integration tests previously is that we used a $limit of 50. When we execute this on mongoS, we hit EOF before the first batch of 101 is filled, so we return those results and we don't register the cursor with the CursorManager. The merge pipeline therefore doesn't outlive the ClusterAggregate::runAggregate method and is destroyed along with the shard pipeline. The bug only manifests in cases where we split the pipeline at a $limit stage of at least 102 and the resulting merge pipeline is executable on mongoS. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by James O'Leary [ 30/Aug/17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
david.storch, I agree, we can only focus on the current error for the moment. Lets find and fix the issue with the Once this is resolved, it will likely be necessary to investigate the original issue (possibly with an older build) even if it appears to be gone from the current version. Unfortunately, since the original stacktrace doesn't appear to be very informative, this may require coordination between us to get whatever information is required (maybe a core dump would be more useful). In the meantime, I'll see if the original backtrace is more useful if line numbers are generated. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by David Storch [ 29/Aug/17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
jim.oleary, ah, you're right, there does appear to be two issues at play here. I propose that we focus our efforts for now on the issue related to | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by James O'Leary [ 29/Aug/17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
From the commit date of August 8th for The original trace and date are before this change was committed. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by David Storch [ 23/Aug/17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
This looks like it was introduced by bernard.gorman's work in | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by James O'Leary [ 23/Aug/17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
nathan.myers The test continues to persistently fail, the latest is on v3.5.12-20-g1602ed3. This stack trace looks more useful, the full backtrace with line numbers:
The useful links are :
If this isn't sufficient then let me know what you need to progress the issue. The raw trace looks like:
Demangled:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Nathan Myers [ 17/Aug/17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The backtrace attached identifies only addresses in the signal handling code. Studying the logs, the only events logged, other than periodic connections, disconnections, and metadata refreshes, occurred more than an hour before the crash. The events prior to that point were a series of autosplit operations not followed by chunk migrations ("but no migrations allowed"), and then, 33 seconds later,
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Pasette (Inactive) [ 31/Jul/17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
From the mongos log file:
|