[SERVER-14389] segmentation fault in RangeDeleter::canEnqueue_inlock Created: 30/Jun/14  Updated: 10/Dec/14  Resolved: 08/Jul/14

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 2.6.1
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Kay Agahd Assignee: Ger Hartnett
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: JPEG File chunkMove.JPG    
Issue Links:
Related
is related to SERVER-14261 stepdown during migration range delet... Closed
Operating System: ALL
Participants:

 Description   

One of our mongod replicaset member, which is part of a cluster consisting of 3 shards, went down due to a segmentation fault:

2014-06-29T22:14:06.767+0200 [conn199059]  ntoskip:0 ntoreturn:1
2014-06-29T22:14:06.767+0200 [conn199059] stale version detected during query over offerStore.$cmd : { $err: "[offerStore.offer] shard version not ok in Client::Context: version mismatch detected for offerStore.offer, stored major version 20016 does not match ...", code: 13388, ns: "offerStore.offer", vReceived: Timestamp 20015000|85, vReceivedEpoch: ObjectId('538f1c07b86632c2d721e203'), vWanted: Timestamp 20016000|0, vWantedEpoch: ObjectId('538f1c07b86632c2d721e203') }
2014-06-29T22:14:06.767+0200 [conn198333] end connection 172.16.65.202:43434 (1166 connections now open)
2014-06-29T22:14:06.767+0200 [conn199059] end connection 172.16.65.202:43728 (1165 connections now open)
2014-06-29T22:14:06.812+0200 [conn199123] moveChunk migrate commit accepted by TO-shard: { active: false, ns: "offerStore.offer", from: "offerStoreUK/s128:27017,s137:27017,s227:27017", min: { _id: 99144222 }, max: { _id: 129281657 }, shardKeyPattern: { _id: 1.0 }, state: "done", counts: { cloned: 2843, clonedBytes: 3969903, catchup: 0, steady: 0 }, ok: 1.0 }
2014-06-29T22:14:06.812+0200 [conn199123] moveChunk updating self version to: 20016|1||538f1c07b86632c2d721e203 through { _id: 129281657 } -> { _id: 131845582 } for collection 'offerStore.offer'
2014-06-29T22:14:06.812+0200 [conn199123] SyncClusterConnection connecting to [sx350:20019]
2014-06-29T22:14:06.814+0200 [conn199123] SyncClusterConnection connecting to [sx351:20019]
2014-06-29T22:14:06.816+0200 [conn199123] SyncClusterConnection connecting to [sx352:20019]
2014-06-29T22:14:07.147+0200 [conn199123] about to log metadata event: { _id: "s128-2014-06-29T20:14:07-53b0738fe4db6482ab714a67", server: "s128", clientAddr: "172.16.65.202:43756", time: new Date(1404072847147), what: "moveChunk.commit", ns: "offerStore.offer", details: { min: { _id: 99144222 }, max: { _id: 129281657 }, from: "offerStoreUK", to: "offerStoreUK3", cloned: 2843, clonedBytes: 3969903, catchup: 0, steady: 0 } }
2014-06-29T22:14:07.337+0200 [conn201544] end connection 172.16.64.98:54303 (1164 connections now open)
2014-06-29T22:14:07.338+0200 [initandlisten] connection accepted from 172.16.64.98:54305 #201548 (1165 connections now open)
2014-06-29T22:14:07.339+0200 [conn201548]  authenticate db: local { authenticate: 1, nonce: "xxx", user: "__system", key: "xxx" }
2014-06-29T22:14:07.350+0200 [conn199123] MigrateFromStatus::done About to acquire global write lock to exit critical section
2014-06-29T22:14:07.350+0200 [conn199123] MigrateFromStatus::done Global lock acquired
2014-06-29T22:14:07.361+0200 [conn199123] doing delete inline for cleanup of chunk data
2014-06-29T22:14:07.361+0200 [conn199123] SEVERE: Invalid access at address: 0
2014-06-29T22:14:07.460+0200 [conn199123] SEVERE: Got signal: 11 (Segmentation fault).
Backtrace:0x11c0e91 0x11c026e 0x11c035f 0x7f6a68197030 0xdd7996 0xdd96cc 0xdd9c1a 0xfd21a3 0xa1e85a 0xa1f8ce 0xa21086 0xd4dae7 0xb97322 0xb99902 0x76b6af 0x117720b 0x7f6a6818eb50 0x7f6a675320ed
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0x11c0e91]
 /usr/bin/mongod() [0x11c026e]
 /usr/bin/mongod() [0x11c035f]
 /lib/x86_64-linux-gnu/libpthread.so.0(+0xf030) [0x7f6a68197030]
 /usr/bin/mongod(_ZNK5mongo12RangeDeleter11NSMinMaxCmpclEPKNS0_8NSMinMaxES4_+0x26) [0xdd7996]
 /usr/bin/mongod(_ZNK5mongo12RangeDeleter17canEnqueue_inlockERKNS_10StringDataERKNS_7BSONObjES6_PSs+0x1fc) [0xdd96cc]
 /usr/bin/mongod(_ZN5mongo12RangeDeleter9deleteNowERKSsRKNS_7BSONObjES5_S5_bPSs+0x22a) [0xdd9c1a]
 /usr/bin/mongod(_ZN5mongo16MoveChunkCommand3runERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0xea73) [0xfd21a3]
 /usr/bin/mongod(_ZN5mongo12_execCommandEPNS_7CommandERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0x3a) [0xa1e85a]
 /usr/bin/mongod(_ZN5mongo7Command11execCommandEPS0_RNS_6ClientEiPKcRNS_7BSONObjERNS_14BSONObjBuilderEb+0xd5e) [0xa1f8ce]
 /usr/bin/mongod(_ZN5mongo12_runCommandsEPKcRNS_7BSONObjERNS_11_BufBuilderINS_16TrivialAllocatorEEERNS_14BSONObjBuilderEbi+0x6c6) [0xa21086]
 /usr/bin/mongod(_ZN5mongo11newRunQueryERNS_7MessageERNS_12QueryMessageERNS_5CurOpES1_+0x2307) [0xd4dae7]
 /usr/bin/mongod() [0xb97322]
 /usr/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0x442) [0xb99902]
 /usr/bin/mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x9f) [0x76b6af]
 /usr/bin/mongod(_ZN5mongo17PortMessageServer17handleIncomingMsgEPv+0x4fb) [0x117720b]
 /lib/x86_64-linux-gnu/libpthread.so.0(+0x6b50) [0x7f6a6818eb50]
 /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f6a675320ed]


We are running mongodb-linux-x86_64-2.6.1.
It might be related to a this issue:
https://jira.mongodb.org/browse/SERVER-14261



 Comments   
Comment by Randolph Tan [ 06/Aug/14 ]

This one has been fixed in SERVER-14261.

Comment by Kay Agahd [ 06/Aug/14 ]

renctan, while we had the balancer disabled and no chunk migrations running, mongo crashed at the very end of compaction again:

2014-08-05T21:49:21.656+0200 [conn109521] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [172.16.64.36:33179]
2014-08-05T21:49:21.774+0200 [RangeDeleter] 0x11c0e91 0x1163109 0x11477e6 0xebb7ab 0xeb5cfd 0xde2830 0xddd2dd 0x1205829 0x7f060291e8ba 0x7f0601cd202d
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0x11c0e91]
 /usr/bin/mongod(_ZN5mongo10logContextEPKc+0x159) [0x1163109]
 /usr/bin/mongod(_ZN5mongo11msgassertedEiPKc+0xe6) [0x11477e6]
 /usr/bin/mongod(_ZN5mongo13SlaveTracking18opReplicatedEnoughENS_6OpTimeERKSs+0x8b) [0xebb7ab]
 /usr/bin/mongod(_ZN5mongo18opReplicatedEnoughENS_6OpTimeENS_11BSONElementE+0xdd) [0xeb5cfd]
 /usr/bin/mongod(_ZN5mongo17RangeDeleterDBEnv11deleteRangeERKNS_10StringDataERKNS_7BSONObjES6_S6_bPSs+0xbb0) [0xde2830]
 /usr/bin/mongod(_ZN5mongo12RangeDeleter6doWorkEv+0x86d) [0xddd2dd]
 /usr/bin/mongod() [0x1205829]
 /lib/libpthread.so.0(+0x68ba) [0x7f060291e8ba]
 /lib/libc.so.6(clone+0x6d) [0x7f0601cd202d]
2014-08-05T21:49:22.247+0200 [RangeDeleter] SEVERE: terminate() called, printing stack (if implemented for platform):
0x11c0e91 0x11c0118 0x7f06024c7166 0x7f06024c7193 0x7f06024c728e 0x114788e 0xebb7ab 0xeb5cfd 0xde2830 0xddd2dd 0x1205829 0x7f060291e8ba 0x7f0601cd202d
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0x11c0e91]
 /usr/bin/mongod() [0x11c0118]
 /usr/lib/libstdc++.so.6(+0xcb166) [0x7f06024c7166]
 /usr/lib/libstdc++.so.6(+0xcb193) [0x7f06024c7193]
 /usr/lib/libstdc++.so.6(+0xcb28e) [0x7f06024c728e]
 /usr/bin/mongod(_ZN5mongo11msgassertedEiPKc+0x18e) [0x114788e]
 /usr/bin/mongod(_ZN5mongo13SlaveTracking18opReplicatedEnoughENS_6OpTimeERKSs+0x8b) [0xebb7ab]
 /usr/bin/mongod(_ZN5mongo18opReplicatedEnoughENS_6OpTimeENS_11BSONElementE+0xdd) [0xeb5cfd]
 /usr/bin/mongod(_ZN5mongo17RangeDeleterDBEnv11deleteRangeERKNS_10StringDataERKNS_7BSONObjES6_S6_bPSs+0xbb0) [0xde2830]
 /usr/bin/mongod(_ZN5mongo12RangeDeleter6doWorkEv+0x86d) [0xddd2dd]
 /usr/bin/mongod() [0x1205829]
 /lib/libpthread.so.0(+0x68ba) [0x7f060291e8ba]
 /lib/libc.so.6(clone+0x6d) [0x7f0601cd202d]

This happens in round about 1 of 2 cases of compact, since we are using v2.6. Former mongodb versions didn't have this problem.

Instead to use compact in the future, it might be better for us to empty the db folder and to do an initial sync. It takes almost the same time than a compact and it does not crash mongodb.

Looking forward to see a more stable, less quirky, better maintainable mongodb version in the next release.

Comment by Kay Agahd [ 05/Aug/14 ]

Thanks renctan, we have disabled the balancer so far because we had too much hassle with it since v2.6. We wrote our own balancer which takes into account also the amount of RAM because with growing number of shards, amount of RAM is not always the same.

It would be nice to see a smarter balancer in the upcoming mongodb version. At least one which does not block, get stuck or even crashes mongod.

Comment by Randolph Tan [ 05/Aug/14 ]

kay.agahd@idealo.de Running compact during migration on v2.6 is not advised due to this bug: SERVER-14674

Comment by Kay Agahd [ 28/Jul/14 ]

Good to know that data may be corrupt even if mongo starts without any complaint. However, the all the last crashes happened exactly at the very end of a compact. Do you really think that's due to corrupt data which made crash mongo only at the the end of the compact process?

Full resync is simpler than running compact on multiple collections time after time.

On this cluster we are using only one database which has only one collection (offerStore.offer).

fixed your writes not to grow documents on update

Yes, I think the same but it's very difficult to implement this by all clients.

Comment by Asya Kamsky [ 28/Jul/14 ]

Do you mean that mongo wasn't aware about corrupted data upon restart and went down a few days later because of that?

Correct - there is no check of existing data files when starting up - it would be equivalent of db.collection.validate(true) on every collection which would take a very long time.

The only indication of a problem may be when you get invalid BSON error, or if you are unlucky then some more subtle sign of corruption.

Rather than running validate on each node, why not just resync them (or at least the ones that have crashed and haven't been sync'ed since?) Full resync is simpler than running compact on multiple collections time after time.

I understand about journaling and needing separate device to minimize performance, but I think if you fixed your writes not to grow documents on update, you would see much better performance without having to worry about so many other things.

Asya

Comment by Kay Agahd [ 28/Jul/14 ]

how does it determine how many documents are in each chunks?

It doesn't show the number of documents in each chunk. Perhaps you just made a typo? It shows the number of documents on each shard. It's just a count on the primary of each replSet.

We disabled journaling because if a server fails, which is very rare, it is able to come back in sync quite fastly because each server stores only 200-300 GB of data. In another cluster, where each node holds terabytes of data, we have journaling enabled because it would take ages to get in sync again.
Also, we remarked that most servers restarted without indicating any problem after a crash. Do you mean that mongo wasn't aware about corrupted data upon restart and went down a few days later because of that? If it's true, we would need to add an external device for journaling in order to keep the performance drop as low as possible.

Comment by Asya Kamsky [ 28/Jul/14 ]

Kay,

I want to make a comment regarding the earlier crashes. Because you are running with journaling disabled, if mongod ever crashes in any way (i.e. goes down in any way other than normal shutdown) you must delete the db/path contents and resync it from another node.

When you disable journaling, it means you lose single node durability - on restart, there is no way to guarantee that all the data file contents are in a consistent state. This is particularly likely to be a problem for you since your background flush times (according to MMS) are in the multiple seconds and as it happens every 60 seconds, if mongod goes down while the data files are partially flushed to disk, you will end up with inconsistent (basically unusable/corrupt data files).

We strongly caution you to *not* disable journaling, however, given that you have, that basically commits you to resyncing any node that goes down abnormally.

In addition, if all the nodes in a replica set are in the same physical location and can all be affected by the same adverse event (i.e. power outage) you could end up with corrupt files on every member of the replica set.

Note, I'm not saying that any of the problems you are seeing now are related to these crashes/nojournal, however, it's something I need to keep in mind if anything starts looking "strange".

Asya

Comment by Asya Kamsky [ 28/Jul/14 ]

The screenshot looks like monitoring that you run - how does it determine how many documents are in each chunks?

Comment by Kay Agahd [ 27/Jul/14 ]

screenshot of chunk move attached

Comment by Kay Agahd [ 27/Jul/14 ]

asya, the balancer is counterproductive for us because we pre-split, so new documents will be inserted already on the right shard. Also, for some of our clusters we want to load more documents on shards which are equipped with more RAM. Moreover, chunks may become empty in time, so it's a bad idea to just keep the number of chunks equal between shards.
Furthermore, since v2.6 we encountered more problems than before when we wanted to pre-split due to locks hold forever by the balancer or range deleter or something else. The workaround "waitForDelete=true" didn't help. We had to kill the mongod eventually. So finally, during the lifetime of this ticket, we decided to shut down the balancer to get around of these problems. That's why you see that we had the balancer running in the beginning of this ticket.

Yes, we don't run the balancer nor moveChunks at the same time we run the chunk-checker.js script.

I'll attach a screenshot which shows you the number of documents of each shard. Shard 1 und 5 are equipped with more RAM. This is why they have more documents than the other shards. Arrow 1 shows the moment when we moved chunks from shard 2 to shard 1 and 5. You see that the number of documents are increasing on shard 1 and 5 but they are not decreasing on shard 2, even after having waited for 3 days!
2 days later we issued cleanupOrphaned command - without any visible result in the document count. 3 days later (see arrow 2), we executed the chunk-checker.js script on shard 2. As you can see, the number of documents on shard 2 dropped immediately by the amount of documents that were moved 3 days before.

Concerning the noTimeout cursors, yes, we still have applications which are using them because they iterate over very large record sets and don't know how fast the result can be consumed by the client. We would like to set a cursor timeout on the server instead to have the default one of 10 minutes. It's very difficult to know the optimal batchsize: if its too big, the server closes the cursor, if its too small, the iteration over the large result set takes too much time. If we could set the cursor timeout to, say one hour or two, our application could throw away the noTimeOut cursors.

Thank you for your help!

Comment by Asya Kamsky [ 27/Jul/14 ]

Can you clarify this question:

Btw. the moved chunks are not deleted anymore. Is this a bug in v2.6 or is it because we have disabled the balancer?

If the balancer is disabled then there will not be any chunk moves done by the system. If you mean that you are running moveChunk command manually (or programmatically) and you are seeing the number of documents on the "from" shard stay the same, that indicates that the delete part of moveChunk is still to run or is running.

Make sure that you are not using https://github.com/mongodb/mongo-snippets/blob/master/sharding/chunk-checker.js unless you have made sure that your balancer cannot possibly be running during the time you are running the script and unless you've made sure that you yourselves are not running moveChunk commands at that time.

You said you disabled the balancer but in an earlier comment on the ticket it looks like the config settings have it enabled:

{ "_id" : "balancer", "stopped" : false, "_waitForDelete" : true }

As far as cleanupOrphaned - what were you using to determine that it didn't delete any documents? If there are no orphans then of course it would not delete anything, but note that it does not print anything about how many documents it cleaned up, but you can see the number in the mongod logs of the mongod you run the command on. If you are seeing incorrect behavior by cleanupOrphaned, please open a separate bug report for it so we can diagnose it specifically.

Btw, looking in MMS I see recent numbers for open cursors like ""noTimeout": 279901," which is a huge number - is your application still opening noTimeout cursors? This may cause the deletes to stall as discussed in SERVER-14047 previously.

We should probably move this discussion out of SERVER project as it's not directly related to this bug, although we may determine that some enhancements can be made in the future to make it easier to diagnose what the root cause is for these situations.

Asya

Comment by Kay Agahd [ 27/Jul/14 ]

asya yes, you are right. I didn't explain correctly. What I meant was that after running cleanupOrphaned, the number of documents were still the same on the shard. However, after having run the chunk-checker.js script, the number of documents dropped significantly on this shard.
My (perhaps wrong) conclusion was that compacting less documents would result in less memory consumption. We need to run compact every few weeks because the fragemented mapped datafiles doesn't fit in RAM but the compacted do. We rather don't care about disk space (even if we're using SSD's). If we didn't compact, database performance would drop significantly within time.

Comment by Asya Kamsky [ 27/Jul/14 ]

The cleanupOrphaned cleans up documents (ie removes them) but just like any other delete the space goes on the free list to be used for future inserts. It does not get reclaimed to the OS unless you run repairDatabase - compact command will locate the remaining data closer together but won't return space to the OS either.

This is known and expected behavior.

Comment by Kay Agahd [ 27/Jul/14 ]

Hi thomasr, the issue occured several times again. It happens at the very end of a compact. Here are the logs of the last crashed mongod:

2014-07-27T01:14:44.042+0200 [conn1330414] end connection 172.16.64.36:56946 (37 connections now open)
2014-07-27T01:14:44.043+0200 [RangeDeleter] Deleter starting delete for: offerStore.offer from { _id: 2863751006 } -> { _id: 2863831101 }, with opId: 7
2014-07-27T01:14:44.043+0200 [conn1331176] end connection 172.16.64.36:47881 (36 connections now open)
2014-07-27T01:14:44.046+0200 [conn1331355] end connection 172.16.64.36:37750 (35 connections now open)
2014-07-27T01:14:44.062+0200 [rsSync] replSet SECONDARY
2014-07-27T01:14:44.062+0200 [RangeDeleter] rangeDeleter deleted 0 documents for offerStore.offer from { _id: 2863751006 } -> { _id: 2863831101 }
2014-07-27T01:14:44.064+0200 [RangeDeleter] Assertion: 10107:replicatedToNum called but not master anymore
2014-07-27T01:14:44.065+0200 [rsBackgroundSync] repl: old cursor isDead, will initiate a new one
2014-07-27T01:14:44.160+0200 [RangeDeleter] 0x11c0e91 0x1163109 0x11477e6 0xebb7ab 0xeb5cfd 0xde2830 0xddd2dd 0x1205829 0x7f86c0720b50 0x7f86bfac3a7d
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0x11c0e91]
 /usr/bin/mongod(_ZN5mongo10logContextEPKc+0x159) [0x1163109]
 /usr/bin/mongod(_ZN5mongo11msgassertedEiPKc+0xe6) [0x11477e6]
 /usr/bin/mongod(_ZN5mongo13SlaveTracking18opReplicatedEnoughENS_6OpTimeERKSs+0x8b) [0xebb7ab]
 /usr/bin/mongod(_ZN5mongo18opReplicatedEnoughENS_6OpTimeENS_11BSONElementE+0xdd) [0xeb5cfd]
 /usr/bin/mongod(_ZN5mongo17RangeDeleterDBEnv11deleteRangeERKNS_10StringDataERKNS_7BSONObjES6_S6_bPSs+0xbb0) [0xde2830]
 /usr/bin/mongod(_ZN5mongo12RangeDeleter6doWorkEv+0x86d) [0xddd2dd]
 /usr/bin/mongod() [0x1205829]
 /lib/x86_64-linux-gnu/libpthread.so.0(+0x6b50) [0x7f86c0720b50]
 /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f86bfac3a7d]
2014-07-27T01:14:44.185+0200 [RangeDeleter] SEVERE: terminate() called, printing stack (if implemented for platform):
0x11c0e91 0x11c0118 0x7f86c026e996 0x7f86c026e9c3 0x7f86c026ebee 0x114788e 0xebb7ab 0xeb5cfd 0xde2830 0xddd2dd 0x1205829 0x7f86c0720b50 0x7f86bfac3a7d
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0x11c0e91]
 /usr/bin/mongod() [0x11c0118]
 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x63996) [0x7f86c026e996]
 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x639c3) [0x7f86c026e9c3]
 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x63bee) [0x7f86c026ebee]
 /usr/bin/mongod(_ZN5mongo11msgassertedEiPKc+0x18e) [0x114788e]
 /usr/bin/mongod(_ZN5mongo13SlaveTracking18opReplicatedEnoughENS_6OpTimeERKSs+0x8b) [0xebb7ab]
 /usr/bin/mongod(_ZN5mongo18opReplicatedEnoughENS_6OpTimeENS_11BSONElementE+0xdd) [0xeb5cfd]
 /usr/bin/mongod(_ZN5mongo17RangeDeleterDBEnv11deleteRangeERKNS_10StringDataERKNS_7BSONObjES6_S6_bPSs+0xbb0) [0xde2830]
 /usr/bin/mongod(_ZN5mongo12RangeDeleter6doWorkEv+0x86d) [0xddd2dd]
 /usr/bin/mongod() [0x1205829]
 /lib/x86_64-linux-gnu/libpthread.so.0(+0x6b50) [0x7f86c0720b50]
 /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f86bfac3a7d]

Just in case that the info might be helpful: we have disabled the balancer since we are using v2.6. We do pre-splitting to have well balanced shards. It may happen that we need to move chunks to servers that are equipped with better hardware because the maxSize Parameter does not what's should be good for, see also: https://jira.mongodb.org/browse/SERVER-11441

Btw. the moved chunks are not deleted anymore. Is this a bug in v2.6 or is it because we have disabled the balancer? The suggested cleanupOrphaned command does not seem to work since no space was freed up. However, after executing the script https://github.com/mongodb/mongo-snippets/blob/master/sharding/chunk-checker.js multiple GB's were retrieved. This is very important for us because our db needs to fit in RAM completely for best performance (see also my comment 10 minutes ago at https://jira.mongodb.org/browse/SERVER-5931). Should I open a new bug report for this?

Comment by Thomas Rueckstiess [ 08/Jul/14 ]

Hi Kay,

We were not able to reproduce this segmentation fault and inspection of the code path that the stack trace provides didn't turn out anything useful either. It seems that a null pointer is being accessed, but we can't see where the RangeDeleter would do that. This may be a very rare race condition. We also don't see an obvious link to SERVER-14261, which you raised previously.

At this stage I think we're out of luck trying to get to the bottom of this without further data. I'll close the ticket as "cannot reproduce", but we've modified the title to reflect the issue better, so it will be easier to search for in the future.

Please let us know if you run into the same issue again.

Regards,
Thomas

Comment by Kay Agahd [ 04/Jul/14 ]

This happened only once until now.

Comment by Randolph Tan [ 03/Jul/14 ]

HI,

How many times have this occurred? We are still investigating this issue and nothing obvious came out from just the logs.

Thanks!

Comment by Kay Agahd [ 30/Jun/14 ]

I forgot to mention that we followed Randolph Tan's suggestion made in https://jira.mongodb.org/browse/SERVER-14261 by by setting the _waitForDelete field in config.settings:

mongos> use config
switched to db config
mongos> db.settings.find()
{ "_id" : "chunksize", "value" : 64 }
{ "_id" : "balancer", "stopped" : false, "_waitForDelete" : true }

Nevertheless, the node went down. Perhaps both issues are not related.

Generated at Thu Feb 08 03:34:42 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.