[SERVER-4532] GetLastError on sharded cluster can report incorrect result Created: 20/Dec/11  Updated: 11/Jul/16  Resolved: 12/Dec/12

Status: Closed
Project: Core Server
Component/s: Write Ops
Affects Version/s: 1.8.3, 2.2.0, 2.2.1
Fix Version/s: 2.2.3, 2.3.2

Type: Bug Priority: Major - P3
Reporter: Philipp Marx Assignee: Eliot Horowitz (Inactive)
Resolution: Done Votes: 3
Labels: triage, update
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

3 x shards: Each shard has 3 nodes in as ReplicaSet and each first node has the configd running.
4 x application server with local mongos connected to the 3 config from above.

Everything is running on AWS-EC2 Linux-Large instances. The application server are using the Java driver 2.7.2.


Attachments: File noUpdateButN1.js     File noUpdateButN1inAnotherCollection.js     File shTest.js    
Issue Links:
Depends
Duplicate
is duplicated by SERVER-7580 PHP getLastError wrong when Sharding Closed
is duplicated by SERVER-7885 First update on just migrated documen... Closed
is duplicated by SERVER-7109 Errors in Writeback Listener is leaki... Closed
Related
related to PYTHON-442 GridFS object creation causes index e... Closed
related to SERVER-7958 GLE on sharded cluster can return pre... Closed
related to SERVER-8097 Inserts don't increment writebacksSince Closed
is related to SERVER-7888 E11000 duplicate key error index whi... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:

 Description   

Steps:

1. Insert object into shard (successfully)
2. Update object by _id (WriteResult writeResult = dbCollection.update(query, update, false, false, WriteConcern.SAFE)
query looks like: {_id : XYZ}
update looks like: {$set : {foo : bar, foo2 : bar2}}
3. Object is updated

WriteResult:

{ 
   writeResult =   { 
      "serverUsed":"127.0.0.1:27017/myDB", 
      "singleShard":"ReplSet3/ec2-xxxx.compute-1.amazonaws.com:27018,ec2-xxxx.compute-1.amazonaws.com:27018,ec2-xxxx.compute-1.amazonaws.com:27018", 
=>    "n":0, 
      "lastOp":5686019109100191745, 
      "connectionId":1524, 
      "err":null, 
      "ok":1.0, 
=>    "updatedExisting":true, 
      "wtime":0, 
      "writebackGLE": { 
         "singleShard":"ReplSet3/ec2-xxxx.compute-1.amazonaws.com:27018,ec2-xxxx.compute-1.amazonaws.com:27018,ec2-xxxx.comput1.amazonaws.com:27018", 
         "n":0, 
         "lastOp":5686019109100191745, 
         "connectionId":1524, 
         "err":null, 
         "ok":1.0 
      }, 
      "initialGLEHost":"ReplSet1/ec2-xxxx.compute-1.amazonaws.com:27018,ec2-xxxx.compute-1.amazonaws.com:27018,ec2-xxxx.compute1.amazonaws.com:27018" 
   } 

This is very hard to reproduce since it is very intermittent. Maybe somebody has an vague idea about the exact situation when this might occur. One thing to note is that step 1 and 2 are executed very quickly.

I had created a thread in the forum: http://groups.google.com/group/mongodb-user/browse_thread/thread/d1ca7e977dc7455b/973cab27e21df385?lnk=gst&q=smigfu#973cab27e21df385



 Comments   
Comment by auto [ 14/Dec/12 ]

Author:

{u'date': u'2012-12-12T05:18:43Z', u'email': u'eliot@10gen.com', u'name': u'Eliot Horowitz'}

Message: SERVER-4532 can't call ClientInfo::addShard on things you don't really use

Conflicts:

src/mongo/s/s_only.cpp
Branch: v2.2
https://github.com/mongodb/mongo/commit/d03939e182aec740fb83a65400ee02ce572752ff

Comment by auto [ 12/Dec/12 ]

Author:

{u'date': u'2012-12-12T05:18:43Z', u'email': u'eliot@10gen.com', u'name': u'Eliot Horowitz'}

Message: SERVER-4532 can't call ClientInfo::addShard on things you don't really use
Branch: master
https://github.com/mongodb/mongo/commit/587fde4994df19665404160ff1e399dd5af9d6b0

Comment by Asya Kamsky [ 09/Dec/12 ]

Okay just ran the same test with 2.3.2-pre (nightly from today) and it also fails.

Ignore previous comment saying that it's been fixed...

Comment by Asya Kamsky [ 09/Dec/12 ]

SERVER-7885 which I opened seems like a sub-case of this one. Here's a way to create n=0 on a successful update in the shell with two shards and two mongos processes.

In one mongos process moveChunk({_id: 1001}, "othershard");
Then immediately after run (against the other mongos):

$ mongo localhost:57017/test2 --eval 'db.coll.update({_id:1001},{$set:{"a":"999"}});printjson(db.getLastErrorObj())'
MongoDB shell version: 2.2.1
connecting to: localhost:57017/test2
{
   [...]
	"n" : 0,
	"err" : null,
	"ok" : 1
}

n is returned as 0 and updatedExisting is not set at all.
In the logs you can see that correct thing is returned from mongod:
Last thing from WritebackListener:
[WriteBackListener-localhost:37017] GLE is

{ singleShard: "asya/localhost:27017,localhost:27018,localhost:27019", updatedExisting: true, n: 1, lastOp: Timestamp 1355026437000|1, connectionId: 32370, err: null, ok: 1.0 }
Comment by Kay Agahd [ 06/Nov/12 ]

Please reopen this issue since we can reproduce it with version 2.2.0.

The WriteResult obtained by the mongo driver v2.9.3 is as follows:

{ "serverUsed" : "sx176.ipx/x.x.x.x:27018" , 
"singleShard" : "offerStoreDE2/s127:27018,s131:27018,s136:27018" , 
"n" : 0 , 
"lastOp" : { "$ts" : 1352210785 , "$inc" : 182} , 
"connectionId" : 2409830 , 
"err" :  null  , 
"ok" : 1.0 , 
"writeback" : { "$oid" : "50991961000000000005dd5f"} , 
"instanceIdent" : "s136:27018" , 
"updatedExisting" : true , 
"wtime" : 0 , 
"writebackGLE" : {
  "singleShard" : "offerStoreDE2/s127:27018,s131:27018,s136:27018" , 
  "n" : 0 , 
  "lastOp" : { "$ts" : 1352210785 , "$inc" : 182} , 
  "connectionId" : 2409830 , 
  "err" :  null  , 
  "ok" : 1.0} , 
"initialGLEHost" : "offerStoreDE2/s127:27018,s131:27018,s136:27018"}

We are sure that the update was successful because the new values were written into the DB.

We observed that the error occurs when the balancer it turned on. When we turn the balancer off, the errors continue BUT when we restart the router (mongos) while keeping the balancer off then these errors wont happen again.
However, we can't keep the balancer off since our shards would be unbalanced.

Please advice.

Comment by Philipp Marx [ 20/Dec/11 ]

So is it safe to assume if "updatedExisting" is "true" an update has been performed?

The reason I am asking is that we want to check whether the update has successfully been applied and currently we are checking the n-value.

Comment by Eliot Horowitz (Inactive) [ 20/Dec/11 ]

This was fixed in 2.0
As you said - it was very rare - but possible in 1.8

Generated at Thu Feb 08 03:06:16 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.