[SERVER-1766] ERROR: splitIfShould failed: locking namespace failed Created: 09/Sep/10  Updated: 12/Jul/16  Resolved: 10/Sep/10

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 1.7.0
Fix Version/s: 1.7.1

Type: Bug Priority: Major - P3
Reporter: Alvin Richards (Inactive) Assignee: Alvin Richards (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

running the nightly

git version: d16ac9d54d9595710ad8288ccdd742d9242a6fc3


Issue Links:
Related
is related to SERVER-1521 yield lock during removeRange Closed
Operating System: ALL
Participants:

 Description   

Problem:

Running a bulk insert via a Java program into a 3 shard system. After about 30 minutes I see the following errors in the log file for the router node

Thu Sep 9 19:09:15 [conn8] autosplitting scaleout.blogs size: 125479578 shard: ns:scaleout.blogs at: replset0:replset0/10.204.33.94:27000 lastmod: 3|25 min:

{ ts: -539057490 }

max:

{ ts: -2960867 }

on:

{ ts: -271305523 }

(splitThreshold 104857600)
Thu Sep 9 19:09:15 [conn8] ERROR: splitIfShould failed: locking namespace failed
Thu Sep 9 19:09:25 [conn6] autosplitting scaleout.blogs size: 125322369 shard: ns:scaleout.blogs at: replset0:replset0/10.204.33.94:27000 lastmod: 3|23 min:

{ ts: -1610580558 }

max:

{ ts: -1076104534 }

on:

{ ts: -1343209153 }

(splitThreshold 104857600)
Thu Sep 9 19:09:25 [conn6] ERROR: splitIfShould failed: locking namespace failed
Thu Sep 9 19:22:21 [conn2] end connection 71.139.0.44:55312

At the same time, I see my Java clients fail with
Exception in thread "Thread-1" com.mongodb.MongoException$Network: can't call something
at com.mongodb.DBTCPConnector.call(DBTCPConnector.java:194)
at com.mongodb.DBTCPConnector.call(DBTCPConnector.java:192)
at com.mongodb.DBTCPConnector.call(DBTCPConnector.java:192)
at com.mongodb.DBApiLayer$MyCollection.__find(DBApiLayer.java:223)
at com.mongodb.DBCollection.findOne(DBCollection.java:486)
at com.mongodb.DBCollection.findOne(DBCollection.java:475)
at com.mongodb.DB.command(DB.java:137)
at com.mongodb.DB.getLastError(DB.java:283)
at InsertSpeed$Runner.run(InsertSpeed.java:64)
Caused by: java.io.IOException: couldn't connect to [/10.204.69.250:27500] bc:java.net.ConnectException: Connection timed out
at com.mongodb.DBPort._open(DBPort.java:150)
at com.mongodb.DBPort.go(DBPort.java:70)
at com.mongodb.DBPort.call(DBPort.java:56)
at com.mongodb.DBTCPConnector.call(DBTCPConnector.java:186)
... 8 more

All mongod's are still running on all machines.

Reproduce:
Not clear, second run did not hit this problem.

Solution:
Need to understand why this error is occurring and what the user can do about it.

Business Case:
Reliability



 Comments   
Comment by Gilles Gagniard [ 24/Sep/10 ]

Just occured on my 1.6.2 test shard ... the mongos router has been completely stuck for several hours after this splitIfShould failed error message. Killing and restarting it put back the shard in working order.

Comment by Che-Ching Wu [ 13/Sep/10 ]

We encountered this also in 1.6.1.

Comment by Alvin Richards (Inactive) [ 10/Sep/10 ]

Not seeing the same blocks with the following

db version v1.7.1-pre-, pdfile version 4.5
Fri Sep 10 17:29:41 git version: 6766569f9acdd80e27d957906f61dd7a14425d0d

Comment by Eliot Horowitz (Inactive) [ 09/Sep/10 ]

did the remove yield lock change - so lets see if it happens again with that in place

Comment by Alvin Richards (Inactive) [ 09/Sep/10 ]

Matias seems to think its stuck on the last changelog entry for moveChunck

> db.changelog.find().sort(

{time:-1}

) { "_id" : "ip-10-204-33-94-2010-09-09T19:10:36-12", "server" : "ip-10-204-33-94", "time" : "Thu Sep 09 2010 12:10:36 GMT-0700 (PDT)", "what" : "moveChunk", "ns" : "scaleout.blogs", "details" : { "min" :

{ "ts" : -1610580558 }

, "max" :

{ "ts" : -1076104534 }

, "from" : "replset0", "to" : "replset1" } }
{ "_id" : "ip-10-202-47-225-2010-09-09T19:10:36-3", "server" : "ip-10-202-47-225", "time" : "Thu Sep 09 2010 12:10:36 GMT-0700 (PDT)", "what" : "moveChunk.to", "ns" : "scaleout.blogs", "details" :

{ "step1" : 54, "step2" : 0, "step3" : 594920, "step4" : 16, "step5" : 103 }

}

No entries since this point

Comment by Alvin Richards (Inactive) [ 09/Sep/10 ]

Looks like the router is not accepting connections

vero:10gen$ ./software/mongodb-osx-x86_64-1.6.0/bin/mongo --port 27500 --host ec2-184-72-193-160.compute-1.amazonaws.com
MongoDB shell version: 1.6.0
connecting to: ec2-184-72-193-160.compute-1.amazonaws.com:27500/test
Thu Sep 9 12:48:21 Error: couldn't connect to server ec2-184-72-193-160.compute-1.amazonaws.com:27500} (anon):1139
exception: connect failed

Generated at Thu Feb 08 02:57:58 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.