Details
-
Bug
-
Resolution: Done
-
Major - P3
-
None
-
2.8.0-rc3, 2.8.0-rc4
-
None
-
Fully Compatible
-
ALL
Description
related to SERVER-16763
found following entry in server log during longevity test, and eventually lead to server crash
2015-01-05T17:47:22.602+0000 E SHARDING [conn60] moveChunk cannot enter critical section before all data is cloned, 81584 locs were not transferred but to-shard reported { active: true, ns: "sbtest.sbtest1", from: "rs2/172.31.32.214:27017,ip-172-31-35-229:27017", min: { _id: -7816322693657637576 }, max: { _id: -7672769179660119751 }, shardKeyPattern: { _id: "hashed" }, state: "clone", counts: { cloned: 1480, clonedBytes: 321160, catchup: 0, steady: 0 }, ok: 1.0 }
|
SERVER-16763 addressed issue related to system clock drifting may cause lock timeout issue.
For the moveChunk message, this could be a separate issue to be fixed.
I looked up this message "moveChunk cannot enter critical section before all data is cloned, 81584 locs were not transferred but to-shard reported ", which is the last message before the thread's long wait and eventually crash, it point to here https://github.com/mongodb/mongo/blob/master/src/mongo/s/d_migrate.cpp#L1372-L1380 the comment there says:
// Should never happen, but safe to abort before critical section
|
mongod then crashes after a while when wait for https://github.com/mongodb/mongo/blob/master/src/mongo/s/d_migrate.cpp#L307 (which shall be fixed by SERVER-16763)
Not sure what condition could trigger migrateFromStatus.cloneLocsRemaining() not 0 here since we think this condition shall not happen?