[SERVER-20118] Race condition in AsyncResultsMerger can cause find/getMore commands to hang on mongos Created: 21/Aug/15  Updated: 19/Sep/15  Resolved: 25/Aug/15

Status: Closed
Project: Core Server
Component/s: Querying, Sharding
Affects Version/s: None
Fix Version/s: 3.1.8

Type: Bug Priority: Major - P3
Reporter: Benety Goh Assignee: David Storch
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: HTML File log     File mci.log     File timeout.log    
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: QuInt 8 08/28/15
Participants:

 Description   

7d7f4fb3b6 z Clang/libc++ Ubuntu 1410 64-bit SSL sharding

https://evergreen.mongodb.com/task/mongodb_mongo_master_ubuntu1410_fast_clang_sharding_7d7f4fb3b6f6a171eacf53384053df0fe728db42_15_08_20_21_16_10

sharding/find_getmore_cmd.js stopped making progress during/after a move chunk:

[js_test:find_getmore_cmd] 2015-08-20T23:40:07.197+0000  m30000| 2015-08-20T23:40:07.197+0000 I SHARDING [conn4] About to enter migrate critical section
[js_test:find_getmore_cmd] 2015-08-20T23:40:07.197+0000  m30000| 2015-08-20T23:40:07.197+0000 I SHARDING [conn4] moveChunk setting version to: 2|0||55d66557641b28d3f3ff1cda
[js_test:find_getmore_cmd] 2015-08-20T23:40:07.197+0000  m30001| 2015-08-20T23:40:07.197+0000 I NETWORK  [initandlisten] connection accepted from 10.63.24.107:36101 #4 (4 connections now open)
[js_test:find_getmore_cmd] 2015-08-20T23:40:07.203+0000  m30001| 2015-08-20T23:40:07.203+0000 I SHARDING [migrateThread] migrate commit succeeded flushing to secondaries for 'test.find_getmore_cmd' { _id: 0.0 } -> { _id: MaxKey }
[js_test:find_getmore_cmd] 2015-08-20T23:40:07.203+0000  m30001| 2015-08-20T23:40:07.203+0000 I SHARDING [migrateThread] migrate commit flushed to journal for 'test.find_getmore_cmd' { _id: 0.0 } -> { _id: MaxKey }
[js_test:find_getmore_cmd] 2015-08-20T23:40:07.203+0000  m30001| 2015-08-20T23:40:07.203+0000 I SHARDING [migrateThread] about to log metadata event: { _id: "ip-10-63-24-107-2015-08-20T23:40:07.203+0000-55d665570b065ef09d3d3bc9", server: "ip-10-63-24-107", clientAddr: "", time: new Date(1440114007203), what: "moveChunk.to", ns: "test.find_getmore_cmd", details: { min: { _id: 0.0 }, max: { _id: MaxKey }, step 1 of 5: 124, step 2 of 5: 0, step 3 of 5: 0, step 4 of 5: 0, step 5 of 5: 11, note: "success" } }
[js_test:find_getmore_cmd] 2015-08-20T23:40:07.204+0000  m29000| 2015-08-20T23:40:07.204+0000 I COMMAND  [conn12] command local.oplog.rs command: getMore { getMore: 22754927504, collection: "oplog.rs", maxTimeMS: 2000 } ntoreturn:1 ntoskip:0 keyUpdates:0 writeConflicts:0 numYields:1 reslen:513 locks:{ Global: { acquireCount: { r: 6 } }, Database: { acquireCount: { r: 3 } }, oplog: { acquireCount: { r: 3 } } } protocol:op_command 140ms
[js_test:find_getmore_cmd] 2015-08-20T23:40:07.205+0000  m30000| 2015-08-20T23:40:07.204+0000 I SHARDING [conn4] moveChunk migrate commit accepted by TO-shard: { active: false, ns: "test.find_getmore_cmd", from: "ip-10-63-24-107:30000", min: { _id: 0.0 }, max: { _id: MaxKey }, shardKeyPattern: { _id: 1.0 }, state: "done", counts: { cloned: 3, clonedBytes: 54, catchup: 0, steady: 0 }, ok: 1.0 }
[js_test:find_getmore_cmd] 2015-08-20T23:40:07.205+0000  m30000| 2015-08-20T23:40:07.204+0000 I SHARDING [conn4] moveChunk updating self version to: 2|1||55d66557641b28d3f3ff1cda through { _id: MinKey } -> { _id: 0.0 } for collection 'test.find_getmore_cmd'
[js_test:find_getmore_cmd] 2015-08-20T23:40:07.205+0000  m30000| 2015-08-20T23:40:07.205+0000 I SHARDING [conn4] about to log metadata event: { _id: "ip-10-63-24-107-2015-08-20T23:40:07.205+0000-55d6655717612145defcae7f", server: "ip-10-63-24-107", clientAddr: "10.63.24.107:48451", time: new Date(1440114007205), what: "moveChunk.commit", ns: "test.find_getmore_cmd", details: { min: { _id: 0.0 }, max: { _id: MaxKey }, from: "shard0000", to: "shard0001", cloned: 3, clonedBytes: 54, catchup: 0, steady: 0 } }
[js_test:find_getmore_cmd] 2015-08-20T23:40:07.207+0000  m30000| 2015-08-20T23:40:07.207+0000 I SHARDING [conn4] MigrateFromStatus::done About to acquire global lock to exit critical section
[js_test:find_getmore_cmd] 2015-08-20T23:40:07.207+0000  m30000| 2015-08-20T23:40:07.207+0000 I SHARDING [conn4] forking for cleanup of chunk data
[js_test:find_getmore_cmd] 2015-08-20T23:40:07.207+0000  m30000| 2015-08-20T23:40:07.207+0000 I SHARDING [conn4] MigrateFromStatus::done About to acquire global lock to exit critical section
[js_test:find_getmore_cmd] 2015-08-20T23:40:07.207+0000  m30000| 2015-08-20T23:40:07.207+0000 I SHARDING [RangeDeleter] Deleter starting delete for: test.find_getmore_cmd from { _id: 0.0 } -> { _id: MaxKey }, with opId: 163
[js_test:find_getmore_cmd] 2015-08-20T23:40:07.208+0000  m30000| 2015-08-20T23:40:07.207+0000 I SHARDING [RangeDeleter] rangeDeleter deleted 3 documents for test.find_getmore_cmd from { _id: 0.0 } -> { _id: MaxKey }
[js_test:find_getmore_cmd] 2015-08-20T23:40:07.209+0000  m30000| 2015-08-20T23:40:07.209+0000 I SHARDING [conn4] distributed lock with ts: 55d6655717612145defcae7d' unlocked.
[js_test:find_getmore_cmd] 2015-08-20T23:40:07.209+0000  m30000| 2015-08-20T23:40:07.209+0000 I SHARDING [conn4] about to log metadata event: { _id: "ip-10-63-24-107-2015-08-20T23:40:07.209+0000-55d6655717612145defcae80", server: "ip-10-63-24-107", clientAddr: "10.63.24.107:48451", time: new Date(1440114007209), what: "moveChunk.from", ns: "test.find_getmore_cmd", details: { min: { _id: 0.0 }, max: { _id: MaxKey }, step 1 of 6: 0, step 2 of 6: 3, step 3 of 6: 5, step 4 of 6: 129, step 5 of 6: 10, step 6 of 6: 0, to: "shard0001", from: "shard0000", note: "success" } }
[js_test:find_getmore_cmd] 2015-08-20T23:40:07.211+0000  m30000| 2015-08-20T23:40:07.211+0000 I COMMAND  [conn4] command test.find_getmore_cmd command: moveChunk { moveChunk: "test.find_getmore_cmd", from: "ip-10-63-24-107:30000", to: "ip-10-63-24-107:30001", fromShard: "shard0000", toShard: "shard0001", min: { _id: 0.0 }, max: { _id: MaxKey }, maxChunkSizeBytes: 52428800, configdb: "test-configRS/ip-10-63-24-107:29000,ip-10-63-24-107:29001,ip-10-63-24-107:29002", secondaryThrottle: true, waitForDelete: false, maxTimeMS: 0, epoch: ObjectId('55d66557641b28d3f3ff1cda') } ntoreturn:1 ntoskip:0 keyUpdates:0 writeConflicts:0 numYields:0 reslen:22 locks:{ Global: { acquireCount: { r: 9, w: 2, R: 3 } }, MMAPV1Journal: { acquireCount: { r: 5, w: 2 } }, Database: { acquireCount: { r: 2, w: 2 } }, Collection: { acquireCount: { R: 2, W: 2 } } } protocol:op_command 153ms
[js_test:find_getmore_cmd] 2015-08-20T23:40:07.212+0000  m30999| 2015-08-20T23:40:07.212+0000 I SHARDING [conn1] ChunkManager: time to load chunks for test.find_getmore_cmd: 0ms sequenceNumber: 5 version: 2|1||55d66557641b28d3f3ff1cda based on: 1|2||55d66557641b28d3f3ff1cda
[js_test:find_getmore_cmd] 2015-08-20T23:40:07.213+0000  m30000| 2015-08-20T23:40:07.212+0000 I SHARDING [conn2] remotely refreshing metadata for test.find_getmore_cmd with requested shard version 2|1||55d66557641b28d3f3ff1cda based on current shard version 2|0||55d66557641b28d3f3ff1cda, current metadata version is 2|0||55d66557641b28d3f3ff1cda
[js_test:find_getmore_cmd] 2015-08-20T23:40:07.213+0000  m30001| 2015-08-20T23:40:07.212+0000 I SHARDING [conn2] remotely refreshing metadata for test.find_getmore_cmd with requested shard version 2|0||55d66557641b28d3f3ff1cda based on current shard version 0|0||55d66557641b28d3f3ff1cda, current metadata version is 1|2||55d66557641b28d3f3ff1cda
[js_test:find_getmore_cmd] 2015-08-20T23:40:07.213+0000  m30000| 2015-08-20T23:40:07.213+0000 I SHARDING [conn2] updating metadata for test.find_getmore_cmd from shard version 2|0||55d66557641b28d3f3ff1cda to shard version 2|1||55d66557641b28d3f3ff1cda
[js_test:find_getmore_cmd] 2015-08-20T23:40:07.213+0000  m30000| 2015-08-20T23:40:07.213+0000 I SHARDING [conn2] collection version was loaded at version 2|1||55d66557641b28d3f3ff1cda, took 0ms
[js_test:find_getmore_cmd] 2015-08-20T23:40:07.213+0000  m30001| 2015-08-20T23:40:07.213+0000 I SHARDING [conn2] updating metadata for test.find_getmore_cmd from shard version 0|0||55d66557641b28d3f3ff1cda to shard version 2|0||55d66557641b28d3f3ff1cda
[js_test:find_getmore_cmd] 2015-08-20T23:40:07.214+0000  m30001| 2015-08-20T23:40:07.213+0000 I SHARDING [conn2] collection version was loaded at version 2|1||55d66557641b28d3f3ff1cda, took 0ms
[js_test:find_getmore_cmd] 2015-08-20T23:40:07.550+0000  m29002| 2015-08-20T23:40:07.550+0000 I REPL     [ReplicationExecutor] syncing from: ip-10-63-24-107:29000
[js_test:find_getmore_cmd] 2015-08-20T23:40:07.550+0000  m29000| 2015-08-20T23:40:07.550+0000 I NETWORK  [initandlisten] connection accepted from 10.63.24.107:52630 #19 (14 connections now open)
[js_test:find_getmore_cmd] 2015-08-20T23:40:07.551+0000  m29000| 2015-08-20T23:40:07.551+0000 I NETWORK  [conn19] end connection 10.63.24.107:52630 (13 connections now open)
[js_test:find_getmore_cmd] 2015-08-20T23:40:07.551+0000  m29000| 2015-08-20T23:40:07.551+0000 I NETWORK  [initandlisten] connection accepted from 10.63.24.107:52631 #20 (14 connections now open)
[js_test:find_getmore_cmd] 2015-08-20T23:40:07.570+0000  m29002| 2015-08-20T23:40:07.569+0000 I INDEX    [repl writer worker 4] build index on: config.chunks properties: { v: 1, unique: true, key: { ns: 1, min: 1 }, name: "ns_1_min_1", ns: "config.chunks" }
[js_test:find_getmore_cmd] 2015-08-20T23:40:07.570+0000  m29002| 2015-08-20T23:40:07.569+0000 I INDEX    [repl writer worker 4] 	 building index using bulk method
[js_test:find_getmore_cmd] 2015-08-20T23:40:07.570+0000  m29002| 2015-08-20T23:40:07.570+0000 I INDEX    [repl writer worker 4] build index done.  scanned 0 total records. 0 secs
[js_test:find_getmore_cmd] 2015-08-20T23:40:07.574+0000  m29002| 2015-08-20T23:40:07.574+0000 I INDEX    [repl writer worker 5] build index on: config.chunks properties: { v: 1, unique: true, key: { ns: 1, shard: 1, min: 1 }, name: "ns_1_shard_1_min_1", ns: "config.chunks" }
[js_test:find_getmore_cmd] 2015-08-20T23:40:07.574+0000  m29002| 2015-08-20T23:40:07.574+0000 I INDEX    [repl writer worker 5] 	 building index using bulk method
[js_test:find_getmore_cmd] 2015-08-20T23:40:07.575+0000  m29002| 2015-08-20T23:40:07.575+0000 I INDEX    [repl writer worker 5] build index done.  scanned 0 total records. 0 secs
[js_test:find_getmore_cmd] 2015-08-20T23:40:07.577+0000  m29002| 2015-08-20T23:40:07.577+0000 I INDEX    [repl writer worker 6] build index on: config.chunks properties: { v: 1, unique: true, key: { ns: 1, lastmod: 1 }, name: "ns_1_lastmod_1", ns: "config.chunks" }
[js_test:find_getmore_cmd] 2015-08-20T23:40:07.577+0000  m29002| 2015-08-20T23:40:07.577+0000 I INDEX    [repl writer worker 6] 	 building index using bulk method
[js_test:find_getmore_cmd] 2015-08-20T23:40:07.578+0000  m29002| 2015-08-20T23:40:07.578+0000 I INDEX    [repl writer worker 6] build index done.  scanned 0 total records. 0 secs
[js_test:find_getmore_cmd] 2015-08-20T23:40:07.585+0000  m29002| 2015-08-20T23:40:07.585+0000 I INDEX    [repl writer worker 8] build index on: config.shards properties: { v: 1, unique: true, key: { host: 1 }, name: "host_1", ns: "config.shards" }
[js_test:find_getmore_cmd] 2015-08-20T23:40:07.585+0000  m29002| 2015-08-20T23:40:07.585+0000 I INDEX    [repl writer worker 8] 	 building index using bulk method
[js_test:find_getmore_cmd] 2015-08-20T23:40:07.586+0000  m29002| 2015-08-20T23:40:07.586+0000 I INDEX    [repl writer worker 8] build index done.  scanned 0 total records. 0 secs
[js_test:find_getmore_cmd] 2015-08-20T23:40:07.593+0000  m29002| 2015-08-20T23:40:07.593+0000 I INDEX    [repl writer worker 10] build index on: config.locks properties: { v: 1, key: { ts: 1 }, name: "ts_1", ns: "config.locks" }
[js_test:find_getmore_cmd] 2015-08-20T23:40:07.593+0000  m29002| 2015-08-20T23:40:07.593+0000 I INDEX    [repl writer worker 10] 	 building index using bulk method
[js_test:find_getmore_cmd] 2015-08-20T23:40:07.594+0000  m29002| 2015-08-20T23:40:07.594+0000 I INDEX    [repl writer worker 10] build index done.  scanned 0 total records. 0 secs
[js_test:find_getmore_cmd] 2015-08-20T23:40:07.596+0000  m29002| 2015-08-20T23:40:07.596+0000 I INDEX    [repl writer worker 11] build index on: config.locks properties: { v: 1, key: { state: 1, process: 1 }, name: "state_1_process_1", ns: "config.locks" }
[js_test:find_getmore_cmd] 2015-08-20T23:40:07.596+0000  m29002| 2015-08-20T23:40:07.596+0000 I INDEX    [repl writer worker 11] 	 building index using bulk method
[js_test:find_getmore_cmd] 2015-08-20T23:40:07.597+0000  m29002| 2015-08-20T23:40:07.597+0000 I INDEX    [repl writer worker 11] build index done.  scanned 0 total records. 0 secs
[js_test:find_getmore_cmd] 2015-08-20T23:40:07.604+0000  m29002| 2015-08-20T23:40:07.604+0000 I INDEX    [repl writer worker 13] build index on: config.lockpings properties: { v: 1, key: { ping: 1 }, name: "ping_1", ns: "config.lockpings" }
[js_test:find_getmore_cmd] 2015-08-20T23:40:07.604+0000  m29002| 2015-08-20T23:40:07.604+0000 I INDEX    [repl writer worker 13] 	 building index using bulk method
[js_test:find_getmore_cmd] 2015-08-20T23:40:07.605+0000  m29002| 2015-08-20T23:40:07.605+0000 I INDEX    [repl writer worker 13] build index done.  scanned 0 total records. 0 secs
[js_test:find_getmore_cmd] 2015-08-20T23:40:07.612+0000  m29002| 2015-08-20T23:40:07.612+0000 I INDEX    [repl writer worker 15] build index on: config.tags properties: { v: 1, unique: true, key: { ns: 1, min: 1 }, name: "ns_1_min_1", ns: "config.tags" }
[js_test:find_getmore_cmd] 2015-08-20T23:40:07.612+0000  m29002| 2015-08-20T23:40:07.612+0000 I INDEX    [repl writer worker 15] 	 building index using bulk method
[js_test:find_getmore_cmd] 2015-08-20T23:40:07.613+0000  m29002| 2015-08-20T23:40:07.613+0000 I INDEX    [repl writer worker 15] build index done.  scanned 0 total records. 0 secs
[js_test:find_getmore_cmd] 2015-08-20T23:40:07.651+0000  m29002| 2015-08-20T23:40:07.651+0000 I REPL     [SyncSourceFeedback] setting syncSourceFeedback to ip-10-63-24-107:29000
[js_test:find_getmore_cmd] 2015-08-20T23:40:07.652+0000  m29000| 2015-08-20T23:40:07.651+0000 I NETWORK  [initandlisten] connection accepted from 10.63.24.107:52632 #21 (15 connections now open)
2015-08-20 23:40:09 +0000	
[js_test:find_getmore_cmd] 2015-08-20T23:40:09.214+0000  m29000| 2015-08-20T23:40:09.213+0000 I COMMAND  [conn12] command local.oplog.rs command: getMore { getMore: 22754927504, collection: "oplog.rs", maxTimeMS: 2000 } ntoreturn:1 ntoskip:0 keyUpdates:0 writeConflicts:0 numYields:1 reslen:86 locks:{ Global: { acquireCount: { r: 6 } }, Database: { acquireCount: { r: 3 } }, oplog: { acquireCount: { r: 3 } } } protocol:op_command 2000ms



 Comments   
Comment by Githook User [ 25/Aug/15 ]

Author:

{u'username': u'dstorch', u'name': u'David Storch', u'email': u'david.storch@10gen.com'}

Message: SERVER-20118 fix hang in AsyncResultsMerger
Branch: master
https://github.com/mongodb/mongo/commit/3b22f6d2a55aed5a8a697283827046471354a9a6

Comment by David Storch [ 25/Aug/15 ]

The hang occurs in the following scenario:

  1. AsyncResultsMerger::ready() reports to the caller that it has no ready results and must wait on the network for more results to arrive from the shards.
  2. After AsyncResultsMerger::ready() returns but before the caller calls AsyncResultsMerger::nextEvent(), further results come in from the network and are buffered inside the ARM.
  3. The caller gets an event from AsyncResultsMerger::nextEvent() and blocks on it. However, the results that are being awaited are already buffered inside the ARM, so we're stuck.

Applying the patch below makes the hang more reproducible by increasing the amount of time spent in between the call to ready() and the call to nextEvent():

diff --git a/src/mongo/s/query/router_stage_merge.cpp b/src/mongo/s/query/router_stage_merge.cpp
index 69dbefa..48df48a 100644
--- a/src/mongo/s/query/router_stage_merge.cpp
+++ b/src/mongo/s/query/router_stage_merge.cpp
@@ -45,6 +45,9 @@ StatusWith<boost::optional<BSONObj>> RouterStageMerge::next() {
     auto killer = MakeGuard(&RouterStageMerge::kill, this);
 
     while (!_arm.ready()) {
+        // Makes the hang more reproducible.
+        sleepsecs(1);
+
         auto nextEventStatus = _arm.nextEvent();
         if (!nextEventStatus.isOK()) {
             return nextEventStatus.getStatus();

The fix is to signal the event from nextEvent() before returning it, if there are already results available to return to the caller.

Comment by Randolph Tan [ 24/Aug/15 ]

Notes:

The 2 shards appear to be idle and mongos appears to be blocked here:

https://github.com/mongodb/mongo/blob/r3.1.7/src/mongo/s/query/router_stage_merge.cpp#L55

The older log also had a stacktrace of where it got stuck:

[js_test:find_getmore_cmd] 2015-08-21T01:40:01.657+0000 DB.prototype.runCommand@src/mongo/shell/db.js:124:1
[js_test:find_getmore_cmd] 2015-08-21T01:40:01.657+0000 @jstests/sharding/find_getmore_cmd.js:69:14
[js_test:find_getmore_cmd] 2015-08-21T01:40:01.657+0000 @jstests/sharding/find_getmore_cmd.js:4:2
[js_test:find_getmore_cmd] 2015-08-21T01:40:01.657+0000 
[js_test:find_getmore_cmd] 2015-08-21T01:40:01.657+0000 failed to load: jstests/sharding/find_getmore_cmd.js

Which point to this getMore command:

https://github.com/mongodb/mongo/blob/r3.1.7/jstests/sharding/find_getmore_cmd.js#L69

Comment by Benety Goh [ 22/Aug/15 ]

169a7ca486 Enterprise RHEL 5.7 64-bit sharding_auth

https://evergreen.mongodb.com/task/mongodb_mongo_master_enterprise_rhel_57_64_bit_sharding_auth_169a7ca486099a93e8d3ea19cc556dc2c55fcd0e_15_08_22_02_22_16

https://logkeeper.mongodb.org/build/55d8881e90413011a282579a/test/55d88cac90413011a282ba23

https://evergreen.mongodb.com/task_log_raw/mongodb_mongo_master_enterprise_rhel_57_64_bit_sharding_auth_169a7ca486099a93e8d3ea19cc556dc2c55fcd0e_15_08_22_02_22_16/0?type=T

[js_test:find_getmore_cmd] 2015-08-22T14:52:33.918+0000  m30000| 2015-08-22T14:52:33.913+0000 I SHARDING [conn4] moveChunk migrate commit accepted by TO-shard: { active: false, ns: "test.find_getmore_cmd", from: "ip-10-136-56-106:30000", min: { _id: 0.0 }, max: { _id: MaxKey }, shardKeyPattern: { _id: 1.0 }, state: "done", counts: { cloned: 3, clonedBytes: 54, catchup: 0, steady: 0 }, ok: 1.0 }
[js_test:find_getmore_cmd] 2015-08-22T14:52:33.918+0000  m30000| 2015-08-22T14:52:33.913+0000 I SHARDING [conn4] moveChunk updating self version to: 2|1||55d88cb1272957d2ef80c595 through { _id: MinKey } -> { _id: 0.0 } for collection 'test.find_getmore_cmd'
[js_test:find_getmore_cmd] 2015-08-22T14:52:33.919+0000  m30000| 2015-08-22T14:52:33.914+0000 I SHARDING [conn4] about to log metadata event: { _id: "ip-10-136-56-106-2015-08-22T14:52:33.914+0000-55d88cb1add6143dfe0fc494", server: "ip-10-136-56-106", clientAddr: "10.136.56.106:40826", time: new Date(1440255153914), what: "moveChunk.commit", ns: "test.find_getmore_cmd", details: { min: { _id: 0.0 }, max: { _id: MaxKey }, from: "shard0000", to: "shard0001", cloned: 3, clonedBytes: 54, catchup: 0, steady: 0 } }
[js_test:find_getmore_cmd] 2015-08-22T14:52:33.919+0000  m30000| 2015-08-22T14:52:33.917+0000 I SHARDING [conn4] MigrateFromStatus::done About to acquire global lock to exit critical section
[js_test:find_getmore_cmd] 2015-08-22T14:52:33.920+0000  m30000| 2015-08-22T14:52:33.917+0000 I SHARDING [conn4] forking for cleanup of chunk data
[js_test:find_getmore_cmd] 2015-08-22T14:52:33.921+0000  m30000| 2015-08-22T14:52:33.917+0000 I SHARDING [RangeDeleter] Deleter starting delete for: test.find_getmore_cmd from { _id: 0.0 } -> { _id: MaxKey }, with opId: 57
[js_test:find_getmore_cmd] 2015-08-22T14:52:33.921+0000  m30000| 2015-08-22T14:52:33.918+0000 I SHARDING [RangeDeleter] rangeDeleter deleted 3 documents for test.find_getmore_cmd from { _id: 0.0 } -> { _id: MaxKey }
[js_test:find_getmore_cmd] 2015-08-22T14:52:33.921+0000  m30000| 2015-08-22T14:52:33.918+0000 I SHARDING [conn4] MigrateFromStatus::done About to acquire global lock to exit critical section
[js_test:find_getmore_cmd] 2015-08-22T14:52:33.922+0000  m30000| 2015-08-22T14:52:33.920+0000 I SHARDING [conn4] distributed lock with ts: 55d88cb1add6143dfe0fc492' unlocked.
[js_test:find_getmore_cmd] 2015-08-22T14:52:33.922+0000  m30000| 2015-08-22T14:52:33.920+0000 I SHARDING [conn4] about to log metadata event: { _id: "ip-10-136-56-106-2015-08-22T14:52:33.920+0000-55d88cb1add6143dfe0fc495", server: "ip-10-136-56-106", clientAddr: "10.136.56.106:40826", time: new Date(1440255153920), what: "moveChunk.from", ns: "test.find_getmore_cmd", details: { min: { _id: 0.0 }, max: { _id: MaxKey }, step 1 of 6: 0, step 2 of 6: 6, step 3 of 6: 74, step 4 of 6: 28, step 5 of 6: 25, step 6 of 6: 0, to: "shard0001", from: "shard0000", note: "success" } }
[js_test:find_getmore_cmd] 2015-08-22T14:52:33.927+0000  m30000| 2015-08-22T14:52:33.924+0000 I COMMAND  [conn4] command test.find_getmore_cmd command: moveChunk { moveChunk: "test.find_getmore_cmd", from: "ip-10-136-56-106:30000", to: "ip-10-136-56-106:30001", fromShard: "shard0000", toShard: "shard0001", min: { _id: 0.0 }, max: { _id: MaxKey }, maxChunkSizeBytes: 52428800, configdb: "test-configRS/ip-10-136-56-106:29000,ip-10-136-56-106:29001,ip-10-136-56-106:29002", secondaryThrottle: true, waitForDelete: false, maxTimeMS: 0, epoch: ObjectId('55d88cb1272957d2ef80c595') } ntoreturn:1 ntoskip:0 keyUpdates:0 writeConflicts:0 numYields:0 reslen:22 locks:{ Global: { acquireCount: { r: 9, w: 2, R: 3 } }, Database: { acquireCount: { r: 2, w: 2 } }, Collection: { acquireCount: { r: 2, W: 2 } } } protocol:op_command 142ms

Comment by Randolph Tan [ 21/Aug/15 ]

Notes:

  • moveChunk actually completes
  • getMore was hanging until the test was forcefully terminated. Unfortunately, the hang analyzer was not able to capture the stack traces of the mongo processes.
Generated at Thu Feb 08 03:53:12 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.