[SERVER-6639] getLastError to Mongos gives an empty errmsg field if it fails Created: 30/Jul/12  Updated: 10/Dec/14  Resolved: 04/Apr/13

Status: Closed
Project: Core Server
Component/s: Replication, Sharding
Affects Version/s: 2.0.6, 2.2.0-rc0
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Aristarkh Zagorodnikov Assignee: Randolph Tan
Resolution: Duplicate Votes: 0
Labels: mongos, replicaset, sharding
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File empty_error.js    
Issue Links:
Duplicate
duplicates SERVER-6581 mongos GLE needs to report errmsg whe... Closed
Related
related to SERVER-8984 DBClientInterface::getLastError shoul... Closed
Operating System: ALL
Participants:

 Description   

We were performing a series of restarts of config servers and shards. While some of the servers were rebooting, mongos that used them occasionally returned the following error:

{ "ok" : 0.0, "errmsg" : "" }

).
Our auto-retry framework relies on a set of error codes and/or messages and can't cope with such errors. Emitting a more descriptive error and assigning this kind of error an error code (even generic one) would be very helpful both for diagnosing reasons of the error and attempting to recover from it.



 Comments   
Comment by Randolph Tan [ 08/Aug/12 ]

Bug reproduced - attaching the test file.

Comment by Aristarkh Zagorodnikov [ 08/Aug/12 ]

Sure. It was inserting an object into a GridFS chunk collection though mongos (via MongoGridFS.Upload() method of the C# driver), so it was doing something like db.fs.chunks.insert({_id:ObjectId("..."), files_id:NumberLong(...), n:<number>, data:BinData(0,"...")});

Comment by Randolph Tan [ 07/Aug/12 ]

Hi,

I have trouble trying to reproduce this. Would you be able to give us more details on what the client was running (I assumed that it was a command) when it got the empty errmsg?

Thanks!

Comment by Aristarkh Zagorodnikov [ 31/Jul/12 ]

No exact steps, but killing the shard replica set servers leads to this.
This is an excerpt from a mongos log (d2.s2.fs.drive.bru is one of the shard replica set servers which just gone down):
Wed Jul 25 10:44:07 [WriteBackListener-d2.s2.fs.drive.bru:27022] DBClientCursor::init call() failed
Wed Jul 25 10:44:07 [WriteBackListener-d2.s2.fs.drive.bru:27022] WriteBackListener exception : DBClientBase::findN: transport error:
d2.s2.fs.drive.bru:27022 query:

{ writebacklisten: ObjectId('500066fab28541bf1d929a63') }

Wed Jul 25 10:44:08 [WriteBackListener-d2.s2.fs.drive.bru:27022] Socket recv() errno:104 Connection reset by peer 10.3.2.13:27022
Wed Jul 25 10:44:08 [WriteBackListener-d2.s2.fs.drive.bru:27022] SocketException: remote: 10.3.2.13:27022 error: 9001 socket excepti
on [1] server [10.3.2.13:27022]
Wed Jul 25 10:44:08 [WriteBackListener-d2.s2.fs.drive.bru:27022] DBClientCursor::init call() failed
Wed Jul 25 10:44:08 [WriteBackListener-d2.s2.fs.drive.bru:27022] WriteBackListener exception : DBClientBase::findN: transport error:
d2.s2.fs.drive.bru:27022 query:

{ writebacklisten: ObjectId('500066fab28541bf1d929a63') }

Wed Jul 25 10:44:10 [Balancer] Socket recv() errno:104 Connection reset by peer 10.3.2.13:27022
Wed Jul 25 10:44:10 [Balancer] SocketException: remote: 10.3.2.13:27022 error: 9001 socket exception [1] server [10.3.2.13:27022]
Wed Jul 25 10:44:10 [Balancer] DBClientCursor::init call() failed
Wed Jul 25 10:44:10 [Balancer] scoped connection to c1.fs.drive.bru:27020,c2.fs.drive.bru:27020,c3.fs.drive.bru:27020 not being returned to the pool
Wed Jul 25 10:44:10 [Balancer] caught exception while doing balance: DBClientBase::findN: transport error: d2.s2.fs.drive.bru:27022 query:

{ features: 1 }

Wed Jul 25 10:44:17 [ReplicaSetMonitorWatcher] Socket recv() errno:104 Connection reset by peer 10.3.2.13:27022
Wed Jul 25 10:44:17 [ReplicaSetMonitorWatcher] SocketException: remote: 10.3.2.13:27022 error: 9001 socket exception [1] server [10.3.2.13:27022]
Wed Jul 25 10:44:17 [ReplicaSetMonitorWatcher] DBClientCursor::init call() failed
Wed Jul 25 10:44:17 [ReplicaSetMonitorWatcher] trying reconnect to d2.s2.fs.drive.bru:27022
Wed Jul 25 10:44:17 [ReplicaSetMonitorWatcher] reconnect d2.s2.fs.drive.bru:27022 ok

The client got the "empty" error around 10:44:10.

Comment by Randolph Tan [ 30/Jul/12 ]

Hi,

Do you have the exact steps for reproducing this blank error message?

Comment by Aristarkh Zagorodnikov [ 30/Jul/12 ]

First, there is a batch of "can't initialize cursor across all shards" w/socket exceptions interleaved with ReplicaSetMonitor complaining about no master in the replica set.
Then, there is a chunk of "stale config" errors right after underlying replica set recovery (due to a large amount of queued writes landing at that very moment).

Comment by Aristarkh Zagorodnikov [ 30/Jul/12 ]

That would be too easy =)
Since there was a restart of config servers and shards there is about a hundred records (+/- 30 secs since time isn't synced between mongos and its client) about this time in the log.

Comment by Daniel Pasette (Inactive) [ 30/Jul/12 ]

Is there a corresponding error in the mongos log?

Generated at Thu Feb 08 03:12:16 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.