[SERVER-11697] Mongos crash when moveChunk Created: 14/Nov/13  Updated: 10/Dec/14  Resolved: 21/May/14

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 2.2.4
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: chensi Assignee: Unassigned
Resolution: Duplicate Votes: 3
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Linux 2.6


Attachments: Text File core.log    
Issue Links:
Duplicate
is duplicated by SERVER-13089 setShardVersion failed host Closed
Related
related to SERVER-13089 setShardVersion failed host Closed
Operating System: ALL
Participants:

 Description   

Mongos crash when we are moveChunk. Caused by the following log:
Tue Nov 12 20:01:12 [conn611979] Assertion: 10429:setShardVersion failed host: 10.38.171.25:7111

{ oldVersion: Timestamp 29317000|0, oldVersionEpoch: ObjectId('522ad499e1814e603d11be30'), ns: "appid250528.meta_infos0", version: Timestamp 2932 0000|0, versionEpoch: ObjectId('522ad499e1814e603d11be30'), globalVersion: Timestamp 29321000|0, globalVersionEpoch: Obje ctId('522ad499e1814e603d11be30'), reloadConfig: true, errmsg: "shard global version for collection is higher than trying to set to 'appid250528.meta_infos0'", ok: 0.0 }

When I use gdb debug the coredump file, something must be wrong. The details are:
1. chunk X move from shardA to shardB, chunkA's version update to 29363|0, and shardB's local version ChunkY update to 29363|1. And these updates will send to config server when moveChunk is finished.
2. When use gdb, the ChunkManager object of this ns is displayed as:
_version = 29363|1
_shardVersion[shardA] = 29363|1
_shardVersion[shardB] = 29362|0
3. So, when CheckShardVersion for shardB, since _version is the newest, the retry of 'conf->getChunkManager( ns , true )' will skip reload process.

The key reason is:
_version updated to the newest, but _shardVersion[shardB] is the old version.

I doubt :
When we update ChunkManage to version 29363|1, calculateConfigDiff read chunks info from config server , But first read the old version of ChunkX 29362|0, then the updates of step 1 happens, then read the new version of ChunkY 29363|1.

How to resolve:
when retry 3 times of 'conf->getChunkManager( ns , true )' in CheckShardVersion, use forceload for getChunkManager.

Hope your replies. Thank you!



 Comments   
Comment by Ramon Fernandez Marina [ 21/May/14 ]

hustchensi, in SERVER-13089 an issue that looks very similar to this one is reported as fixed after an upgrade to 2.4.9, so I would recommend you to upgrade to at least 2.4.9, but preferably to 2.4.10.

I am going to close this ticket now, but if the issue persists after upgrading to a 2.4 release, please re-open this ticket or open a new one.

Comment by liao [ 03/Mar/14 ]

HI,I also face this issue。It happened for four times this month。
this error msg:
setShardVersion failed host: mongo4.mcloud.139.com:20004

{ oldVersion: Timestamp 0|0, oldVersionEpoch: ObjectId('000000000000000000000000'), ns: "mcloud.m_iosyncdetaillog", version: Timestamp 560000|381, versionEpoch: ObjectId('51d37afa0b8c2dd3569f5ed6'), globalVersion: Timestamp 561000|0, globalVersionEpoch: ObjectId('51d37afa0b8c2dd3569f5ed6'), reloadConfig: true, errmsg: "shard global version for collection is higher than trying to set to 'mcloud.m_iosyncdetaillog'", ok: 0.0 }

cause by "One mongos crashed when moveChunk jobs were running. The moveChunk jobs were created one by one on the another mongos node manually" ??? who can reply it?
thank you...

Comment by Cen Li [ 19/Nov/13 ]

@Eliot, yes, the mongos crashed. Here are some additional infos of this issue:

1. the cluster topology for this issue: 2 mongos nodes, several replica sets.

2. One mongos crashed when moveChunk jobs were running. The moveChunk jobs were created one by one on the another mongos node manually.

3. It happened for several times, and should be reproduceable.

4. chensi has attached the mongos log and the gdb bt info. And if needed we can provide you the coredump file. Let me know if you want any other detail.

appreciate your reply. thanks a lot

Comment by chensi [ 19/Nov/13 ]

Crashed with a coredump file

Comment by Eliot Horowitz (Inactive) [ 18/Nov/13 ]

Did the mongos process actually crash, or just have an error?

Comment by chensi [ 18/Nov/13 ]

I guess one of the reason is :
When calculateConfigDiff read chunks info from config server , mongod is updating chunks ( two chunks for moveChunk).

Comment by chensi [ 18/Nov/13 ]

the log around the crash time is attached above.

Comment by chensi [ 18/Nov/13 ]

gdb bt output:
#0 0x000000302c805f4f in ?? () from /lib64/libgcc_s.so.1
#1 0x000000302c806df7 in _Unwind_Backtrace () from /lib64/libgcc_s.so.1
#2 0x000000302afd73cf in backtrace () from /lib64/tls/libc.so.6
#3 0x000000000074d290 in formattedBacktrace (signalNum=11) at src/mongo/util/signal_handlers.cpp:93
#4 mongo::printStackAndExit (signalNum=11) at src/mongo/util/signal_handlers.cpp:115
#5 <signal handler called>
#6 0x000000302c805f4f in ?? () from /lib64/libgcc_s.so.1
#7 0x000000302c806df7 in _Unwind_Backtrace () from /lib64/libgcc_s.so.1
#8 0x000000302afd73cf in backtrace () from /lib64/tls/libc.so.6
#9 0x000000000074e663 in mongo::printStackTrace (os=...) at src/mongo/util/stacktrace.cpp:38
#10 0x000000000071af94 in mongo::msgasserted (msgid=10429,
msg=0x7eff4c816a18 "setShardVersion failed host: 10.38.171.25:7111 { oldVersion: Timestamp 29362000|0, oldVersionEpoch: ObjectId('522ad499e1814e603d11be30'), ns: \"appid250528.meta_infos0\", version: Timestamp 29362000|0, "...)
at src/mongo/util/assert_util.cpp:153
#11 0x000000000071b01c in mongo::msgasserted (msgid=<value optimized out>, msg=<value optimized out>)
at src/mongo/util/assert_util.cpp:145
#12 0x00000000006cd579 in mongo::checkShardVersion (conn_in=0x7eff59195b80, ns=..., refManager=...,
authoritative=true, tryNumber=7) at src/mongo/s/shard_version.cpp:285
#13 0x00000000006ccd04 in mongo::checkShardVersion (conn_in=0x7eff59195b80, ns=..., refManager=...,
authoritative=true, tryNumber=6) at src/mongo/s/shard_version.cpp:279
#14 0x00000000006ccd04 in mongo::checkShardVersion (conn_in=0x7eff59195b80, ns=..., refManager=...,
authoritative=true, tryNumber=5) at src/mongo/s/shard_version.cpp:279
#15 0x00000000006ccd04 in mongo::checkShardVersion (conn_in=0x7eff59195b80, ns=..., refManager=...,
authoritative=true, tryNumber=4) at src/mongo/s/shard_version.cpp:279
#16 0x00000000006ccd04 in mongo::checkShardVersion (conn_in=0x7eff59195b80, ns=..., refManager=...,
authoritative=true, tryNumber=3) at src/mongo/s/shard_version.cpp:279
#17 0x00000000006ccd04 in mongo::checkShardVersion (conn_in=0x7eff59195b80, ns=..., refManager=...,
authoritative=true, tryNumber=2) at src/mongo/s/shard_version.cpp:279
#18 0x00000000006ccbdb in mongo::checkShardVersion (conn_in=0x7eff59195b80, ns=..., refManager=...,
authoritative=false, tryNumber=1) at src/mongo/s/shard_version.cpp:253
#19 0x00000000006cd6bc in mongo::VersionManager::checkShardVersionCB (this=<value optimized out>,
conn_in=0x7eff54b81900, authoritative=false, tryNumber=1) at src/mongo/s/shard_version.cpp:294
#20 0x00000000006ced23 in mongo::ShardConnection::_finishInit (this=0x7eff54b81900)
at src/mongo/s/shardconnection.cpp:336
#21 0x000000000058c48b in setVersion (this=0x7eff50acc380, state=..., shard=<value optimized out>,
primary=<value optimized out>, ns=<value optimized out>, vinfo=..., manager=...)
at src/mongo/client/../s/shard.h:266
#22 mongo::ParallelSortClusteredCursor::setupVersionAndHandleSlaveOk (this=0x7eff50acc380, state=...,
shard=<value optimized out>, primary=<value optimized out>, ns=<value optimized out>, vinfo=..., manager=...)
at src/mongo/client/parallel.cpp:736
#23 0x000000000059a15c in mongo::ParallelSortClusteredCursor::startInit (this=0x7eff50acc380)
at src/mongo/client/parallel.cpp:894
#24 0x000000000059e2c9 in mongo::ParallelSortClusteredCursor::fullInit (this=0x7eff50acc380)
at src/mongo/client/parallel.cpp:654
#25 0x00000000006def9c in mongo::ShardStrategy::queryOp(mongo::Request&) ()
#26 0x00000000006c3918 in mongo::Request::process (this=0x674cae20, attempt=0) at src/mongo/s/request.cpp:140
#27 0x000000000053ed92 in mongo::ShardedMessageHandler::process (this=<value optimized out>, m=..., p=0x7eff55e6c700,
le=0x7eff55cc3380) at src/mongo/s/server.cpp:104
#28 0x000000000073cc31 in mongo::pms::threadRun (inPort=0x7eff55e6c700)
at src/mongo/util/net/message_server_port.cpp:85
#29 0x000000302b80610a in start_thread () from /lib64/tls/libpthread.so.0
#30 0x000000302afc6003 in clone () from /lib64/tls/libc.so.6
#31 0x0000000000000000 in ?? ()

Comment by Eliot Horowitz (Inactive) [ 18/Nov/13 ]

Is there a stack trace or something else you can send?
Hard to tell much from this.
The full log would be great.

Comment by chensi [ 18/Nov/13 ]

can anybody help me?

Generated at Thu Feb 08 03:26:31 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.