[SERVER-4568] One shard down at run time then insertion on both shard stop by mongos Created: 28/Dec/11  Updated: 11/Jul/16  Resolved: 10/Feb/12

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 2.0.2
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: jitendra Assignee: Scott Hernandez (Inactive)
Resolution: Done Votes: 0
Labels: sharding
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Linux debian


Operating System: Linux
Participants:

 Description   

mongo server setup are following
Config server 1
./mongod --bind_ip 192.168.50.75 --port 30000 --dbpath /sata1/configdb --configsvr --quiet --logpath /usr/local/ct/depend/mongo/logs/mongod_30000.log --logappend --journalCommitInterval 2

Shard server: there are two shard server , we generate 1 to 10. shard keys ,1 to 5 shard keys will go shard1 and 6 to 10 shard keys will go shard2.

shard1: ./mongod --bind_ip 192.168.50.75 --port 20000 --dbpath /sata1/master --shardsvr --quiet --logpath /usr/local/ct/depend/mongo/logs/mongod_20000.log --logappend --journalCommitInterval 2
shard2:./mongod --bind_ip 192.168.50.75 --port 25000 --dbpath /sata2/master --shardsvr --quiet --logpath /usr/local/ct/depend/mongo/logs/mongod_25000.log --logappend --journalCommitInterval 2

mongos 1
./mongos --bind_ip 192.168.50.75 --port 35000 --configdb 192.168.50.75:30000 --quiet --logpath /usr/local/ct/depend/mongo/logs/mongos_35000.log --logappend

now connect to mongos
./mongo 192.168.50.75:35000
at run time i will stop shard2
and apply

mongos> db.Database.insert({_sk:2})
Wed Dec 28 20:22:19 uncaught exception: error

{ "$err" : "socket exception", "code" : 11002 }

mongos> db.Database.insert({_sk:7})
socket exception

i got exception on for _sk:7 , its ok but got exception for _sk:2 which not ok because shard key 2 will go shard1 which is runnig.

Regars
Jitendra Verma



 Comments   
Comment by Scott Hernandez (Inactive) [ 10/Feb/12 ]

Please reopen if this happens again with the new builds.

Comment by Scott Hernandez (Inactive) [ 30/Dec/11 ]

The unit tests run against all versions of mongodb built; I'd have to look at when the test was created but probably before 1.8.0

It is possible the getLastError message is something that is fixed in 2.1.X so you may want to test sa nightly dev build to check.

Comment by Scott Hernandez (Inactive) [ 30/Dec/11 ]

The unit tests run against all versions of mongodb built; I'd have to look at when the test was created but probably before 1.8.0

It is possible the getLastError message is something that is fixed in 2.1.X so you may want to test sa nightly dev build to check.

Comment by jitendra [ 30/Dec/11 ]

can u tell me on which version u tested.

Comment by jitendra [ 30/Dec/11 ]

MongoDB with 2 shards. When both the shards are up and running, our mongo driver using MongoS
inserts the objects properly and get proper codes in getLastError.
When one of the shard(Shard1) is down (MongoD process is crashed), MongoS
starts giving socket exception (with code *) for both shards. (Even though
it keeps inserting objects on one shard2)

Comment by jitendra [ 30/Dec/11 ]

can u reply my previous comment.

Comment by Scott Hernandez (Inactive) [ 30/Dec/11 ]

So far everything you have posted seems to be consistent with being able to write data to the active shards but getting errors on the ones which are down. This is the expected behavior.

We have tests which verify this and we cannot reproduce any problems.

Please provide more information if you feel something is wrong...

Comment by jitendra [ 30/Dec/11 ]

I want to ask one thing, if one shard down , insert into other on shard and call getLastError() then it give err "socket exception".

Comment by jitendra [ 29/Dec/11 ]

hi, pls reply.

Comment by jitendra [ 29/Dec/11 ]

mongos logs when you try to insert with _sk:2.Now if can be more clear

Thu Dec 29 14:37:42 [conn2] Request::process ns: 00291211.Database msg id:131 attempt: 0
Thu Dec 29 14:37:42 [conn2] write: 00291211.Database
Thu Dec 29 14:37:42 [conn2] server:shard0000:192.168.50.171:20000

{ _id: ObjectId('4efc2ddead252b524357758f'), _sk: 2.0 }

Thu Dec 29 14:37:42 [conn2] Request::process ns: 00291211.$cmd msg id:132 attempt: 0
Thu Dec 29 14:37:42 [conn2] single query: 00291211.$cmd

{ getlasterror: 1.0, w: 1.0 }

ntoreturn: -1 options : 0
Thu Dec 29 14:37:42 [conn2] creating new connection to:192.168.50.171:25000
Thu Dec 29 14:37:42 BackgroundJob starting: ConnectBG
Thu Dec 29 14:37:42 [conn2] DBException in process: socket exception
Thu Dec 29 14:37:42 [conn2] Request::process ns: admin.$cmd msg id:133 attempt: 0
Thu Dec 29 14:37:42 [conn2] single query: admin.$cmd

{ replSetGetStatus: 1, forShell: 1 }

ntoreturn: 1 options : 0

Comment by jitendra [ 29/Dec/11 ]

mongos logs when you try to insert with _sk:2.Now if can be more clear

Thu Dec 29 14:37:42 [conn2] Request::process ns: 00291211.Database msg id:131 attempt: 0
Thu Dec 29 14:37:42 [conn2] write: 00291211.Database
Thu Dec 29 14:37:42 [conn2] server:shard0000:192.168.50.171:20000

{ _id: ObjectId('4efc2ddead252b524357758f'), _sk: 2.0 }

Thu Dec 29 14:37:42 [conn2] Request::process ns: 00291211.$cmd msg id:132 attempt: 0
Thu Dec 29 14:37:42 [conn2] single query: 00291211.$cmd

{ getlasterror: 1.0, w: 1.0 }

ntoreturn: -1 options : 0
Thu Dec 29 14:37:42 [conn2] creating new connection to:192.168.50.171:25000
Thu Dec 29 14:37:42 BackgroundJob starting: ConnectBG
Thu Dec 29 14:37:42 [conn2] DBException in process: socket exception
Thu Dec 29 14:37:42 [conn2] Request::process ns: admin.$cmd msg id:133 attempt: 0
Thu Dec 29 14:37:42 [conn2] single query: admin.$cmd

{ replSetGetStatus: 1, forShell: 1 }

ntoreturn: 1 options : 0

Comment by jitendra [ 29/Dec/11 ]

one shard is down and other shard is on.

Insert on other shard give error.

DBClientBase::findN: transport error: 192.168.50.171:25000 query:

{ setShardVersion: "00291211.Database", configdb: "192.168.50.171:30000", version: Timestamp 6000|0, serverID: ObjectId('4ef24915416fb0a9b89716f8'), shard: "shard0001", shardHost: "192.168.50.171:25000" }
Comment by Eliot Horowitz (Inactive) [ 29/Dec/11 ]

I don't understand what you mean.

All shards are healthy?

Comment by jitendra [ 29/Dec/11 ]

shard is on,can u verify in ur hand.

Comment by Eliot Horowitz (Inactive) [ 29/Dec/11 ]

It means it tried to a read or a write but failed because of a socket error.
i..e a shard was down

Comment by jitendra [ 29/Dec/11 ]

"uncaught exception: error

{ "$err" : "socket exception", "code" : 11002 }

"

what does it means.

Comment by jitendra [ 29/Dec/11 ]

"uncaught exception: error

{ "$err" : "socket exception", "code" : 11002 }

"

what does it means.

Comment by Eliot Horowitz (Inactive) [ 29/Dec/11 ]

If one shard is down - the writes to that shard will fail.
writes to other shards will succeed.

Comment by jitendra [ 29/Dec/11 ]

hi
it means insertion on other shard must be success or fail.

Comment by Eliot Horowitz (Inactive) [ 28/Dec/11 ]

If one shard is down - then any query that needs that shard will fail unless you have the partial flag set.

Comment by jitendra [ 28/Dec/11 ]

can u tell me if at run time one shard one down and other is running
then what is mongos behavior.

Comment by jitendra [ 28/Dec/11 ]

one more thing i want to add if shard0001 is already down and connect to mongos client then isertion on _sk ;2 work
but at run time shard down then it give error. can u tyr in ur system.

Comment by Scott Hernandez (Inactive) [ 28/Dec/11 ]

Can you increase the logging level to see if you get more information about the issue on the client connection?

http://www.mongodb.org/display/DOCS/setParameter+Command

Comment by jitendra [ 28/Dec/11 ]

log come like below

WriteBackListener-192.168.50.75:25000] WriteBackListener exception : socket exception
[WriteBackListener-192.168.50.75:25000] WriteBackListener exception : socket exception
[WriteBackListener-192.168.50.75:25000] WriteBackListener exception : socket exception
[WriteBackListener-192.168.50.75:25000] WriteBackListener exception : socket exception
[conn11] DBException in process: socket exception
[conn11] DBException in process: socket exception

Comment by Scott Hernandez (Inactive) [ 28/Dec/11 ]

Can you please post the mongos logs when you try to insert with _sk:2

That explain doesn't look like it is being run against a sharded collection; there is no shards node and it is using a BasicCursor which does not exist on mongos.

Comment by jitendra [ 28/Dec/11 ]

hi pls reply i am waitung for ur reply

Comment by jitendra [ 28/Dec/11 ]

hi ,
any thing missing

Comment by jitendra [ 28/Dec/11 ]

shard0000 is runnin but i try to insert for sk:2 ,this give error uncaught exception: error

{ "$err" : "socket exception", "code" : 11002 }

is this correct.

db.Database.find({_sk:2}).explain()
{
"cursor" : "BasicCursor",
"nscanned" : 0,
"nscannedObjects" : 0,
"n" : 0,
"millis" : 0,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" : {

}
}

Comment by Scott Hernandez (Inactive) [ 28/Dec/11 ]

Okay, maybe I'm misunderstanding but the explain and chunks shows that shard0001 (port 25000) is being used for _sk:7; the error also indicates this when you try to insert that is the shard used, and you state that you are taking down shard2 (which is on port 25000) so it is completely correct to get that error.

Am I missing something here?

Maybe we should take a look at _sk:2 to see if it is different.

Comment by jitendra [ 28/Dec/11 ]

db.Database.stats()
{
"sharded" : true,
"flags" : 1,
"ns" : "00281211.Database",
"count" : 6,
"numExtents" : 36,
"size" : 216,
"storageSize" : 1027227648,
"totalIndexSize" : 32704,
"indexSizes" :

{ "_id_" : 16352, "_sk_1" : 16352 }

,
"avgObjSize" : 36,
"nindexes" : 2,
"nchunks" : 11,
"shards" : {
"shard0000" : {
"ns" : "00281211.Database",
"count" : 5,
"size" : 180,
"avgObjSize" : 36,
"storageSize" : 779964416,
"numExtents" : 21,
"nindexes" : 2,
"lastExtentSize" : 133492736,
"paddingFactor" : 1,
"flags" : 1,
"totalIndexSize" : 16352,
"indexSizes" :

{ "_id_" : 8176, "_sk_1" : 8176 }

,
"ok" : 1
},
"shard0001" : {
"ns" : "00281211.Database",
"count" : 1,
"size" : 36,
"avgObjSize" : 36,
"storageSize" : 247263232,
"numExtents" : 15,
"nindexes" : 2,
"lastExtentSize" : 44699648,
"paddingFactor" : 1,
"flags" : 1,
"totalIndexSize" : 16352,
"indexSizes" :

{ "_id_" : 8176, "_sk_1" : 8176 }

,
"ok" : 1
}
},
"ok" : 1
}

db.shards.find()

{ "_id" : "shard0000", "host" : "192.168.50.75:20000" } { "_id" : "shard0001", "host" : "192.168.50.75:25000" }
Comment by Scott Hernandez (Inactive) [ 28/Dec/11 ]

Which is shard2? Is that shard0000?

Can you also post the stats for that collection? db.Database.stats()
Also, from the config db can you post db.shards.find() please?

Comment by jitendra [ 28/Dec/11 ]

after down shard2

db.Database.find({_sk:7}).explain()
Wed Dec 28 21:29:15 uncaught exception: error: {
"$err" : "could not initialize cursor across all shards because : error querying server: 192.168.50.75:25000",
"code" : 14827
}

Comment by jitendra [ 28/Dec/11 ]

db.Database.find({_sk:7}).explain()
{
"clusteredType" : "ParallelSort",
"shards" : {
"192.168.50.75:25000" : [
{
"cursor" : "BtreeCursor _sk_1",
"nscanned" : 1,
"nscannedObjects" : 1,
"n" : 1,
"millis" : 0,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" :

{ "_sk" : [ [ 7, 7 ] ] }

}
]
},
"n" : 1,
"nChunkSkips" : 0,
"nYields" : 0,
"nscanned" : 1,
"nscannedObjects" : 1,
"millisTotal" : 0,
"millisAvg" : 0,
"numQueries" : 1,
"numShards" : 1
}

Comment by Scott Hernandez (Inactive) [ 28/Dec/11 ]

Please attach the explain output.

Comment by jitendra [ 28/Dec/11 ]

Dec 28 2011 03:45:58 PM it is system local time when error came. database was correct.

Comment by Scott Hernandez (Inactive) [ 28/Dec/11 ]

It looks like you may be testing on the wrong database. Please return the explain I mentioned in the previous comment.

Comment by jitendra [ 28/Dec/11 ]

I send u sharding status where _sk is shard key. chunks contain _sk (1 to 5) move shard0000 and _sk( 6 to 10) move shard0001 and autobalancing of chunks is off.

{ "_id" : "00281211", "partitioned" : true, "primary" : "shard0000" }

00281211.Database chunks:
shard0000 6
shard0001 5
{ "_sk" :

{ $minKey : 1 }

} -->>

{ "_sk" : 1 }

on : shard0000

{ "t" : 6000, "i" : 1 }

{ "_sk" : 1 }

-->>

{ "_sk" : 2 }

on : shard0000

{ "t" : 1000, "i" : 3 }

{ "_sk" : 2 }

-->>

{ "_sk" : 3 }

on : shard0000

{ "t" : 1000, "i" : 5 }

{ "_sk" : 3 }

-->>

{ "_sk" : 4 }

on : shard0000

{ "t" : 1000, "i" : 7 }

{ "_sk" : 4 }

-->>

{ "_sk" : 5 }

on : shard0000

{ "t" : 1000, "i" : 9 }

{ "_sk" : 5 }

-->>

{ "_sk" : 6 }

on : shard0000

{ "t" : 1000, "i" : 11 }

{ "_sk" : 6 }

-->>

{ "_sk" : 7 }

on : shard0001

{ "t" : 2000, "i" : 0 }

{ "_sk" : 7 }

-->>

{ "_sk" : 8 }

on : shard0001

{ "t" : 3000, "i" : 0 }

{ "_sk" : 8 }

-->>

{ "_sk" : 9 }

on : shard0001

{ "t" : 4000, "i" : 0 }

{ "_sk" : 9 }

-->>

{ "_sk" : 10 }

on : shard0001

{ "t" : 5000, "i" : 0 }

{ "_sk" : 10 }

-->> { "_sk" :

{ $maxKey : 1 }

} on : shard0001

{ "t" : 6000, "i" : 0 } { "_id" : "0027711211", "partitioned" : false, "primary" : "shard0000" } { "_id" : "00301211", "partitioned" : false, "primary" : "shard0000" } { "_id" : "test", "partitioned" : false, "primary" : "shard0000" }
Comment by Scott Hernandez (Inactive) [ 28/Dec/11 ]

Please attach the chunk information (best to take a dump of the config db).

You could also provide an explain [find({_sk:7}).explain()] for those _sk values.

Generated at Thu Feb 08 03:06:22 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.