[SERVER-1719] mongos process died with splitIfShould failed: medianKey command failed Created: 01/Sep/10  Updated: 30/Mar/12  Resolved: 13/Sep/10

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 1.6.1
Fix Version/s: None

Type: Bug Priority: Minor - P4
Reporter: Alvin Richards (Inactive) Assignee: Eliot Horowitz (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

ec2 / ubuntu


Operating System: Linux
Participants:

 Description   

Problem:
Running a multi-shard system (2 shards, 2 members). After loading data, I wanted to drop the daya in the collection

> use scaleout
switched to db scaleout
> db.blogs.drop()
true

When I tries to re-connect to the mongos I got the following

vero:scripts$ ../../software/mongodb-osx-x86_64-1.6.0/bin/mongo --port 27500 -host ec2-184-73-98-241.compute-1.amazonaws.com
MongoDB shell version: 1.6.0
connecting to: ec2-184-73-98-241.compute-1.amazonaws.com:27500/test
Wed Sep 1 14:10:32 Error: couldn't connect to server ec2-184-73-98-241.compute-1.amazonaws.com:27500} (anon):1139
exception: connect failed

Looking at the node running mongos, the process was no longer running. I saw the following in the logs

ed Sep 1 21:01:10 [WriteBackListener] config change: { _id: "ip-10-196-174-191-2010-09-01T21:01:10-25", server: "ip-10-196-174-191", time: new Date(1283374870915), what: "split", ns: "scaleout.blogs", details: { before: { min:

{ ts: 26140129 }

, max:

{ ts: 30278795 }

}, left: { min:

{ ts: 26140129 }

, max:

{ ts: 27936982 }

}, right: { min:

{ ts: 27936982 }

, max:

{ ts: 30278795 }

} } }
Wed Sep 1 21:04:23 [conn2] DROP: scaleout.blogs
Wed Sep 1 21:04:23 [conn2] config change: { _id: "ip-10-196-174-191-2010-09-01T21:04:23-26", server: "ip-10-196-174-191", time: new Date(1283375063464), what: "dropCollection.start", ns: "scaleout.blogs", details: {} }
Wed Sep 1 21:06:01 [conn2] config change: { _id: "ip-10-196-174-191-2010-09-01T21:06:01-27", server: "ip-10-196-174-191", time: new Date(1283375161378), what: "dropCollection", ns: "scaleout.blogs", details: {} }
Wed Sep 1 21:06:01 [conn8] ChunkManager: couldn't find chunk for:

{ ts: 26631861 }

going to retry
Wed Sep 1 21:06:01 [conn8] DBException in process: couldn't find a chunk aftry retry which should be impossible extracted:

{ ts: 26631861 }

Wed Sep 1 21:06:01 [WriteBackListener] ~ScopedDBConnection: _conn != null
Wed Sep 1 21:06:01 [WriteBackListener] ERROR: splitIfShould failed: medianKey command failed: { errmsg: "no index found for specified keyPattern:

{ ts: 1.0 }

", ok: 0.0 }

Log from shard000
Wed Sep 1 20:57:31 [conn3] query admin.$cmd ntoreturn:1 command:

{ serverStatus: 1 }

reslen:856 109ms
Wed Sep 1 21:07:11 [rs_sync] CMD: drop scaleout.blogs
Wed Sep 1 21:07:11 [rs_sync] building new index on

{ _id: 1 }

for scaleout.blogs
Wed Sep 1 21:07:11 [rs_sync] Buildindex scaleout.blogs idxNo:0 { name: "id", ns: "scaleout.blogs", key:

{ _id: 1 }

}
Wed Sep 1 21:07:11 [rs_sync] done for 0 records 0secs
Wed Sep 1 21:10:41 [conn3] end connection 127.0.0.1:42059

Log from shard001
Wed Sep 1 20:50:17 [conn5] query admin.$cmd ntoreturn:1 command:

{ serverStatus: 1 }

reslen:856 443ms
Wed Sep 1 20:52:17 [conn5] query admin.$cmd ntoreturn:1 command:

{ serverStatus: 1 }

reslen:856 121ms
Wed Sep 1 20:58:35 allocating new datafile /var/lib/mongodb/replset1/scaleout.7, filling with zeroes...
Wed Sep 1 20:58:35 done allocating datafile /var/lib/mongodb/replset1/scaleout.7, size: 2047MB, took 0.01 secs
Wed Sep 1 20:59:17 [conn5] query admin.$cmd ntoreturn:1 command:

{ serverStatus: 1 }

reslen:856 126ms
Wed Sep 1 21:05:59 [rs_sync] CMD: drop scaleout.blogs
Wed Sep 1 21:11:14 [conn5] end connection 127.0.0.1:43086

Reproduce:
This is what I did
– loaded both shards with data
– performed the db.blogs.drop()

Does not look like an out of disk space problem.

Workaround:
Restart mongos

Business Case:
Reliability



 Comments   
Comment by Eliot Horowitz (Inactive) [ 13/Sep/10 ]

The mechanism changed a bit - so finding this exact is going to be impossible.
Added more tests for new mechanism

Comment by Eliot Horowitz (Inactive) [ 02/Sep/10 ]

Its probably a concurrency issue with the order of doing a drop

Comment by Alvin Richards (Inactive) [ 02/Sep/10 ]

I tried a basic repro and it did not repro. Any clues from the code that make indicate what tickled this?

Comment by Eliot Horowitz (Inactive) [ 02/Sep/10 ]

Can you make a js test?

Generated at Thu Feb 08 02:57:50 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.