[SERVER-36874] Fatal Assertion 40526 while migrating chunks Created: 26/Aug/18 Updated: 20/Mar/19 Resolved: 31/Oct/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 3.6.5, 3.6.7 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Elan Kugelmass | Assignee: | Kelsey Schubert |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Ubuntu 16.04.4 LTS (GNU/Linux 4.4.0-1065-aws x86_64) |
||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Operating System: | ALL | ||||||||||||
| Participants: | |||||||||||||
| Description |
|
We have a four shard cluster undergoing balancing. Each shard is a three member replica set. Right after a migration occurs, the to-shard's primary occassionally crashes. On 3.6.7, we estimate that this happens in 1 out of 5,000 migrations. Once a single crash happens, the next migration to the same shard (with a different replica node as primary) is likely to lead to a crash as well. We experienced this problem on 3.6.5 as well, but because Log from the to-shard primary:
|
| Comments |
| Comment by Kelsey Schubert [ 20/Mar/19 ] | |||||||
|
Hi zhicheng, I've looked the logs you've provided. It appears that you're encountering Going forward, it's generally best to open a new ticket so we can track issue appropriately. Kind regards, | |||||||
| Comment by Zhicheng Long [ 20/Mar/19 ] | |||||||
|
Hi kelsey.schubert , Appreciate all the efforts on resolving this issue. We are currently using MongoDB 4.0.5 but seems same issue still occurs. Below I attach the log message. Is it due to same cause or another bug?
| |||||||
| Comment by Kelsey Schubert [ 31/Oct/18 ] | |||||||
|
Hi epkugelmass, Thanks for the additional information. I'm reopening this ticket as we've identified root cause issue and the fixes will be in the next release of MongoDB (3.6.9 and 4.0.4). If you would like to reenable retryable writes, please be sure to upgrade first to take advantage of these fixes. For others encountering this issue, I would recommend first disabling the balancer as a workaround. As this issue requires both retryable writes and active migrations to trigger. Kind regards, | |||||||
| Comment by Elan Kugelmass [ 26/Oct/18 ] | |||||||
|
Hi Kelsey, We turned off retryable writes and this problem stopped occurring. It's a reasonable workaround for our use case, so we're going to stop digging at the actual problem. Elan | |||||||
| Comment by Kelsey Schubert [ 25/Oct/18 ] | |||||||
|
Hi epkugelmass, We haven’t heard back from you for some time, so I’m going to mark this ticket as resolved. If this is still an issue for you, please provide additional information and we will reopen the ticket. Regards, | |||||||
| Comment by Nick Brewer [ 17/Sep/18 ] | |||||||
|
epkugelmass Sounds good, thanks. -Nick | |||||||
| Comment by Elan Kugelmass [ 17/Sep/18 ] | |||||||
|
Hi Nick, I'm taking this ticket back over. I need some time to generate the data you've requested. Should be no more than a couple of days. Elan | |||||||
| Comment by Nick Brewer [ 12/Sep/18 ] | |||||||
|
vkreddy Unfortunately we can't match these oplog entries up with the occurrence that was described, as those oplog entries have since been removed. However if you've experienced the issue on this node since then, we may be able to match it up to a time in the oplog data that we have available; in this case it would also be useful to see the mongod log from that time. Thanks, | |||||||
| Comment by Karthik Reddy Vadde [ 10/Sep/18 ] | |||||||
|
Sorry about the delay. Thanks for the commands to retrieve oplog. I have uploaded the oplog and diagnostics data to the secure portal. Please let me know if you need more information. | |||||||
| Comment by Nick Brewer [ 30/Aug/18 ] | |||||||
|
vkreddy I apologize, exporting a view in this way isn't going to work on a sharded cluster. However there's another way we can sanitize the oplog information before exporting it:
Move the collection from the "local" database to a new database (note that this needs to be run on the mongod, from the "admin" database):
Then run mongodump against the mongos, specifying something like:
Thanks, | |||||||
| Comment by Karthik Reddy Vadde [ 28/Aug/18 ] | |||||||
|
Nick, I am Elan's colleague and I am working with him on sharding mongo. I created the view and tried to get the dump as suggested using the mongodump -h <hostname_of_toshard_primary>:27018 -d local -c redactedView -o oplog_dump --viewsAsCollections However I ran to following error: 2018-08-28T17:56:27.969+0000 writing local.redactedView to 2018-08-28T17:56:27.969+0000 Failed: error reading from db: Command on view must be executed by mongos. We are not sure how to proceed. Appreciate your help on this. | |||||||
| Comment by Nick Brewer [ 27/Aug/18 ] | |||||||
|
epkugelmass You can create a view that redacts the "o" field:
This creates a new redactedView collection with the specified fields removed - you can then perform the mongodump on this collection instead. The diagnostic.data directory does not contain database/collection fields or details, it merely collects performance information that can be used by our internal tooling to identify issues. -Nick | |||||||
| Comment by Elan Kugelmass [ 27/Aug/18 ] | |||||||
|
Hi nick.brewer, I'd be happy to get you that data. Do you have any suggestions on a way to redact pieces of the oplog? I see an option in mongodump to add a filter/where clause, but not one to do a projection. I need to remove "o": {my employer's data}before I can share the file. Can you also tell me what's collected in diagnostic.data? Quick update on the issue: This is now happening around once an hour (might correspond to a cyclical increase in application activity for us). We've turned off balancing as a result. We're considering turning off retryable writes and then seeing what happens. | |||||||
| Comment by Nick Brewer [ 27/Aug/18 ] | |||||||
|
epkugelmass We'd like to see the oplog data for the crashing to-shard. If you could run the following command, and supply us with the outputted file:
Additionally, if you could archive (tar or zip) the dbpath/diagnostic.data directory for this mongod, along with the log file from the time this behavior occurs, and upload them as well. You can upload this information to our secure portal. Information shared there is only available to MongoDB employees, and is automatically removed after a period of time. Thanks, |