[SERVER-14389] segmentation fault in RangeDeleter::canEnqueue_inlock Created: 30/Jun/14 Updated: 10/Dec/14 Resolved: 08/Jul/14 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 2.6.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Kay Agahd | Assignee: | Ger Hartnett |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Operating System: | ALL | ||||||||
| Participants: | |||||||||
| Description |
|
One of our mongod replicaset member, which is part of a cluster consisting of 3 shards, went down due to a segmentation fault:
We are running mongodb-linux-x86_64-2.6.1. It might be related to a this issue: https://jira.mongodb.org/browse/SERVER-14261 |
| Comments |
| Comment by Randolph Tan [ 06/Aug/14 ] | ||||||||||||||||||||||||||||||||||
|
This one has been fixed in | ||||||||||||||||||||||||||||||||||
| Comment by Kay Agahd [ 06/Aug/14 ] | ||||||||||||||||||||||||||||||||||
|
renctan, while we had the balancer disabled and no chunk migrations running, mongo crashed at the very end of compaction again:
This happens in round about 1 of 2 cases of compact, since we are using v2.6. Former mongodb versions didn't have this problem. Instead to use compact in the future, it might be better for us to empty the db folder and to do an initial sync. It takes almost the same time than a compact and it does not crash mongodb. Looking forward to see a more stable, less quirky, better maintainable mongodb version in the next release. | ||||||||||||||||||||||||||||||||||
| Comment by Kay Agahd [ 05/Aug/14 ] | ||||||||||||||||||||||||||||||||||
|
Thanks renctan, we have disabled the balancer so far because we had too much hassle with it since v2.6. We wrote our own balancer which takes into account also the amount of RAM because with growing number of shards, amount of RAM is not always the same. It would be nice to see a smarter balancer in the upcoming mongodb version. At least one which does not block, get stuck or even crashes mongod. | ||||||||||||||||||||||||||||||||||
| Comment by Randolph Tan [ 05/Aug/14 ] | ||||||||||||||||||||||||||||||||||
|
kay.agahd@idealo.de Running compact during migration on v2.6 is not advised due to this bug: | ||||||||||||||||||||||||||||||||||
| Comment by Kay Agahd [ 28/Jul/14 ] | ||||||||||||||||||||||||||||||||||
|
Good to know that data may be corrupt even if mongo starts without any complaint. However, the all the last crashes happened exactly at the very end of a compact. Do you really think that's due to corrupt data which made crash mongo only at the the end of the compact process?
On this cluster we are using only one database which has only one collection (offerStore.offer).
Yes, I think the same but it's very difficult to implement this by all clients. | ||||||||||||||||||||||||||||||||||
| Comment by Asya Kamsky [ 28/Jul/14 ] | ||||||||||||||||||||||||||||||||||
Correct - there is no check of existing data files when starting up - it would be equivalent of db.collection.validate(true) on every collection which would take a very long time. The only indication of a problem may be when you get invalid BSON error, or if you are unlucky then some more subtle sign of corruption. Rather than running validate on each node, why not just resync them (or at least the ones that have crashed and haven't been sync'ed since?) Full resync is simpler than running compact on multiple collections time after time. I understand about journaling and needing separate device to minimize performance, but I think if you fixed your writes not to grow documents on update, you would see much better performance without having to worry about so many other things. Asya | ||||||||||||||||||||||||||||||||||
| Comment by Kay Agahd [ 28/Jul/14 ] | ||||||||||||||||||||||||||||||||||
It doesn't show the number of documents in each chunk. Perhaps you just made a typo? It shows the number of documents on each shard. It's just a count on the primary of each replSet. We disabled journaling because if a server fails, which is very rare, it is able to come back in sync quite fastly because each server stores only 200-300 GB of data. In another cluster, where each node holds terabytes of data, we have journaling enabled because it would take ages to get in sync again. | ||||||||||||||||||||||||||||||||||
| Comment by Asya Kamsky [ 28/Jul/14 ] | ||||||||||||||||||||||||||||||||||
|
Kay, I want to make a comment regarding the earlier crashes. Because you are running with journaling disabled, if mongod ever crashes in any way (i.e. goes down in any way other than normal shutdown) you must delete the db/path contents and resync it from another node. When you disable journaling, it means you lose single node durability - on restart, there is no way to guarantee that all the data file contents are in a consistent state. This is particularly likely to be a problem for you since your background flush times (according to MMS) are in the multiple seconds and as it happens every 60 seconds, if mongod goes down while the data files are partially flushed to disk, you will end up with inconsistent (basically unusable/corrupt data files). We strongly caution you to *not* disable journaling, however, given that you have, that basically commits you to resyncing any node that goes down abnormally. In addition, if all the nodes in a replica set are in the same physical location and can all be affected by the same adverse event (i.e. power outage) you could end up with corrupt files on every member of the replica set. Note, I'm not saying that any of the problems you are seeing now are related to these crashes/nojournal, however, it's something I need to keep in mind if anything starts looking "strange". Asya | ||||||||||||||||||||||||||||||||||
| Comment by Asya Kamsky [ 28/Jul/14 ] | ||||||||||||||||||||||||||||||||||
|
The screenshot looks like monitoring that you run - how does it determine how many documents are in each chunks? | ||||||||||||||||||||||||||||||||||
| Comment by Kay Agahd [ 27/Jul/14 ] | ||||||||||||||||||||||||||||||||||
|
screenshot of chunk move attached | ||||||||||||||||||||||||||||||||||
| Comment by Kay Agahd [ 27/Jul/14 ] | ||||||||||||||||||||||||||||||||||
|
asya, the balancer is counterproductive for us because we pre-split, so new documents will be inserted already on the right shard. Also, for some of our clusters we want to load more documents on shards which are equipped with more RAM. Moreover, chunks may become empty in time, so it's a bad idea to just keep the number of chunks equal between shards. Yes, we don't run the balancer nor moveChunks at the same time we run the chunk-checker.js script. I'll attach a screenshot which shows you the number of documents of each shard. Shard 1 und 5 are equipped with more RAM. This is why they have more documents than the other shards. Arrow 1 shows the moment when we moved chunks from shard 2 to shard 1 and 5. You see that the number of documents are increasing on shard 1 and 5 but they are not decreasing on shard 2, even after having waited for 3 days! Concerning the noTimeout cursors, yes, we still have applications which are using them because they iterate over very large record sets and don't know how fast the result can be consumed by the client. We would like to set a cursor timeout on the server instead to have the default one of 10 minutes. It's very difficult to know the optimal batchsize: if its too big, the server closes the cursor, if its too small, the iteration over the large result set takes too much time. If we could set the cursor timeout to, say one hour or two, our application could throw away the noTimeOut cursors. Thank you for your help! | ||||||||||||||||||||||||||||||||||
| Comment by Asya Kamsky [ 27/Jul/14 ] | ||||||||||||||||||||||||||||||||||
|
Can you clarify this question:
If the balancer is disabled then there will not be any chunk moves done by the system. If you mean that you are running moveChunk command manually (or programmatically) and you are seeing the number of documents on the "from" shard stay the same, that indicates that the delete part of moveChunk is still to run or is running. Make sure that you are not using https://github.com/mongodb/mongo-snippets/blob/master/sharding/chunk-checker.js unless you have made sure that your balancer cannot possibly be running during the time you are running the script and unless you've made sure that you yourselves are not running moveChunk commands at that time. You said you disabled the balancer but in an earlier comment on the ticket it looks like the config settings have it enabled:
As far as cleanupOrphaned - what were you using to determine that it didn't delete any documents? If there are no orphans then of course it would not delete anything, but note that it does not print anything about how many documents it cleaned up, but you can see the number in the mongod logs of the mongod you run the command on. If you are seeing incorrect behavior by cleanupOrphaned, please open a separate bug report for it so we can diagnose it specifically. Btw, looking in MMS I see recent numbers for open cursors like ""noTimeout": 279901," which is a huge number - is your application still opening noTimeout cursors? This may cause the deletes to stall as discussed in We should probably move this discussion out of SERVER project as it's not directly related to this bug, although we may determine that some enhancements can be made in the future to make it easier to diagnose what the root cause is for these situations. Asya | ||||||||||||||||||||||||||||||||||
| Comment by Kay Agahd [ 27/Jul/14 ] | ||||||||||||||||||||||||||||||||||
|
asya yes, you are right. I didn't explain correctly. What I meant was that after running cleanupOrphaned, the number of documents were still the same on the shard. However, after having run the chunk-checker.js script, the number of documents dropped significantly on this shard. | ||||||||||||||||||||||||||||||||||
| Comment by Asya Kamsky [ 27/Jul/14 ] | ||||||||||||||||||||||||||||||||||
|
The cleanupOrphaned cleans up documents (ie removes them) but just like any other delete the space goes on the free list to be used for future inserts. It does not get reclaimed to the OS unless you run repairDatabase - compact command will locate the remaining data closer together but won't return space to the OS either. This is known and expected behavior. | ||||||||||||||||||||||||||||||||||
| Comment by Kay Agahd [ 27/Jul/14 ] | ||||||||||||||||||||||||||||||||||
|
Hi thomasr, the issue occured several times again. It happens at the very end of a compact. Here are the logs of the last crashed mongod:
Just in case that the info might be helpful: we have disabled the balancer since we are using v2.6. We do pre-splitting to have well balanced shards. It may happen that we need to move chunks to servers that are equipped with better hardware because the maxSize Parameter does not what's should be good for, see also: https://jira.mongodb.org/browse/SERVER-11441 Btw. the moved chunks are not deleted anymore. Is this a bug in v2.6 or is it because we have disabled the balancer? The suggested cleanupOrphaned command does not seem to work since no space was freed up. However, after executing the script https://github.com/mongodb/mongo-snippets/blob/master/sharding/chunk-checker.js multiple GB's were retrieved. This is very important for us because our db needs to fit in RAM completely for best performance (see also my comment 10 minutes ago at https://jira.mongodb.org/browse/SERVER-5931). Should I open a new bug report for this? | ||||||||||||||||||||||||||||||||||
| Comment by Thomas Rueckstiess [ 08/Jul/14 ] | ||||||||||||||||||||||||||||||||||
|
Hi Kay, We were not able to reproduce this segmentation fault and inspection of the code path that the stack trace provides didn't turn out anything useful either. It seems that a null pointer is being accessed, but we can't see where the RangeDeleter would do that. This may be a very rare race condition. We also don't see an obvious link to At this stage I think we're out of luck trying to get to the bottom of this without further data. I'll close the ticket as "cannot reproduce", but we've modified the title to reflect the issue better, so it will be easier to search for in the future. Please let us know if you run into the same issue again. Regards, | ||||||||||||||||||||||||||||||||||
| Comment by Kay Agahd [ 04/Jul/14 ] | ||||||||||||||||||||||||||||||||||
|
This happened only once until now. | ||||||||||||||||||||||||||||||||||
| Comment by Randolph Tan [ 03/Jul/14 ] | ||||||||||||||||||||||||||||||||||
|
HI, How many times have this occurred? We are still investigating this issue and nothing obvious came out from just the logs. Thanks! | ||||||||||||||||||||||||||||||||||
| Comment by Kay Agahd [ 30/Jun/14 ] | ||||||||||||||||||||||||||||||||||
|
I forgot to mention that we followed Randolph Tan's suggestion made in https://jira.mongodb.org/browse/SERVER-14261 by by setting the _waitForDelete field in config.settings:
Nevertheless, the node went down. Perhaps both issues are not related. |