[SERVER-10098] zbigMapReduce.js failure on Windows 64-bit DEBUG Created: 03/Jul/13  Updated: 10/Dec/14  Resolved: 10/Jul/13

Status: Closed
Project: Core Server
Component/s: Testing Infrastructure
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Matt Kangas Assignee: Eric Milkie
Resolution: Duplicate Votes: 0
Labels: buildbot
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
is duplicated by SERVER-10116 zBigMapReduce.js failure on Linux 64-... Closed
Operating System: ALL
Participants:

 Description   

Failure on zbigMapReduce.js:

http://buildbot.mongodb.org/builders/Windows%2064-bit%20DEBUG/builds/884 revision 5741e9a35cfdbf7aa475f4a611f9b57c0c98134d

http://buildlogs.mongodb.org/Windows%2064-bit%20DEBUG/builds/884/test/sharding/zbigMapReduce.js

m31200| Wed Jul 03 17:08:12.237 [conn14] command test.$cmd command: { mapreduce: "foo", map: function () { emit(this.val, 1); }, reduce: function (key, values) { return Array.sum(values); }, query: { i: { $gte: 51200.0 } }, out: "tmp.mrs.foo_1372885543_16", shardedFirstPass: true } ntoreturn:1 keyUpdates:0 numYields: 42192 locks(micros) W:17031 r:118007346 w:86481206 reslen:149 149175ms
 m31201| Wed Jul 03 17:08:12.252 [repl writer worker 1] 
 m31201| debug have W lock but w would suffice for command drop
 m31201| Wed Jul 03 17:08:12.252 [repl writer worker 1] CMD: drop test.tmp.mr.foo_18_inc
Wed Jul 03 17:08:12.252 map reduce failed:{
	"code" : 15988,
	"ok" : 0,
	"errmsg" : "exception: could not run map command on all shards for ns test.foo and query { i: { $gte: 51200.0 } } :: caused by :: error querying server"
} at src/mongo/shell/collection.js:975
failed to load: D:\slave\Windows_64bit_DEBUG\mongo\jstests\sharding\zbigMapReduce.js

milkie - please triage? This doesn't seem to resemble the other zbigMapReduce.js failures we've seen lately.



 Comments   
Comment by Eric Milkie [ 10/Jul/13 ]

I pushed my fix for this issue on Monday, and the failing build in this ticket was for last Wednesday.

Comment by Greg Studer [ 10/Jul/13 ]

This looks like a similar issue to other replica set heartbeat issues, unrelated to sharding in general:

Secondary can't talk to primary easily because M/R seems unable to yield?

 m31101| Wed Jul 03 16:19:02.311 [rsBackgroundSync] replSet syncing to: AMAZONA-J7UBCUV:31100
 m31101| Wed Jul 03 16:19:02.311 [rsBackgroundSync] replSet remoteOldestOp:    Jul 03 16:18:41 51d48721:1
 m31101| Wed Jul 03 16:19:02.311 [rsBackgroundSync] replSet lastOpTimeFetched: Jul 03 16:18:41 51d48721:1
 m31101| Wed Jul 03 17:05:39.981 [rsBackgroundSync] Socket recv() timeout  10.28.45.204:31100
 m31101| Wed Jul 03 17:05:39.981 [rsBackgroundSync] SocketException: remote: 10.28.45.204:31100 error: 9001 socket exception [RECV_TIMEOUT] server [10.28.45.204:31100] 
 m31101| Wed Jul 03 17:05:39.981 [rsBackgroundSync] replSet sync source problem: 10278 dbclient error communicating with server: AMAZONA-J7UBCUV:31100
 m31101| Wed Jul 03 17:05:39.981 [rsBackgroundSync] replSet syncing to: AMAZONA-J7UBCUV:31100
 m31101| Wed Jul 03 17:05:42.679 [rsBackgroundSync] replSet remoteOldestOp:    Jul 03 16:50:55 51d48eaf:b
 m31101| Wed Jul 03 17:05:42.679 [rsBackgroundSync] replSet lastOpTimeFetched: Jul 03 17:05:09 51d49205:fb

So getMore takes a long time

 m31100| Wed Jul 03 17:05:47.484 [conn5] getmore local.oplog.rs query: { ts: { $gte: Timestamp 1372882721000|1 } } cursorid:70168653289864 ntoreturn:0 keyUpdates:0 numYields: 64 locks(micros) r:3405848 nreturned:25609 reslen:2791409 4808ms

..and then we stepdown, and test explodes:

 m31100| Wed Jul 03 17:05:53.709 [rsHealthPoll] DBClientCursor::init call() failed
 m31100| Wed Jul 03 17:05:53.709 [rsHealthPoll] dev: lastError==0 won't report:DBClientBase::findN: transport error: AMAZONA-J7UBCUV:31101 ns: admin.$cmd query: { replSetHeartbeat: "test-rs0", v: 1, pv: 1, checkEmpty: false, from: "AMAZONA-J7UBCUV:31100" }
 m31100| Wed Jul 03 17:05:53.709 [rsHealthPoll] replSet info AMAZONA-J7UBCUV:31101 is down (or slow to respond): 
 m31100| Wed Jul 03 17:05:53.709 [rsHealthPoll] replSet member AMAZONA-J7UBCUV:31101 is now in state DOWN
 m31100| Wed Jul 03 17:05:53.709 [rsMgr] can't see a majority of the set, relinquishing primary
 m31100| Wed Jul 03 17:05:53.740 [rsMgr] replSet relinquishing primary state
 m31100| Wed Jul 03 17:05:53.740 [rsMgr] replSet SECONDARY
 m31100| Wed Jul 03 17:05:53.740 [rsMgr] replSet closing client sockets after relinquishing primary
 m31100| Wed Jul 03 17:05:53.740 [conn1] end connection 127.0.0.1:54582 (19 connections now open)
 m31100| Wed Jul 03 17:05:53.740 [conn121] end connection 10.28.45.204:57190 (19 connections now open)
 m31100| Wed Jul 03 17:05:53.740 [conn13] end connection 10.28.45.204:54641 (19 connections now open)
 m31100| Wed Jul 03 17:05:53.740 [conn1] thread conn1 stack usage was 34472 bytes,  which is the most so far
 m31100| Wed Jul 03 17:05:53.740 [conn13] thread conn13 stack usage was 68312 bytes,  which is the most so far
 m31100| Wed Jul 03 17:05:53.740 [conn10] end connection 10.28.45.204:54622 (19 connections now open)
 m31100| Wed Jul 03 17:05:53.740 [conn12] end connection 10.28.45.204:54639 (19 connections now open)
 m30999| Wed Jul 03 17:05:53.740 [conn1] SocketException: remote: 10.28.45.204:31100 error: 9001 socket exception [CLOSED] server [10.28.45.204:31100] 
 m30999| Wed Jul 03 17:05:53.740 [conn1] DBClientCursor::init lazy say() failed
 m30999| Wed Jul 03 17:05:53.740 [conn1] DBClientCursor::init message from say() was empty
 m30999| Wed Jul 03 17:05:53.740 [conn1] got not master for: AMAZONA-J7UBCUV:31100
 m30999| Wed Jul 03 17:05:53.740 [conn1] User Assertion: 15988:error querying server
 m30999| Wed Jul 03 17:05:53.740 [conn1] warning: exception when receiving data from test-rs0:test-rs0/AMAZONA-J7UBCUV:31100,AMAZONA-J7UBCUV:31101, current connection state is { state: { conn: "test-rs0/AMAZONA-J7UBCUV:31100,AMAZONA-J7UBCUV:31101", vinfo: "test.foo @ 122|3||51d48757ffcb34304e502bbe", cursor: "(empty)", count: 0, done: false }, retryNext: false, init: true, finish: true, errored: false } :: caused by :: 15988 error querying server

Solution may be same as before, don't hold lock for heartbeats, otherwise do something about recursive locks in M/R? Is it possible to push the lock fix for heartbeats ASAP if we can?

Comment by Eric Milkie [ 10/Jul/13 ]

I'm not very familiar with sharding code, so I can't really triage this other than bump it to Greg. It looks like it got to the end of the map/reduce and then it just failed, but I don't see why.

Generated at Thu Feb 08 03:22:16 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.