[SERVER-8872] error 13388 shard version not ok in Client::Context Created: 06/Mar/13  Updated: 10/Dec/14  Resolved: 15/Aug/13

Status: Closed
Project: Core Server
Component/s: Querying
Affects Version/s: 2.2.3
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Kay Agahd Assignee: Randolph Tan
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Linux 64 Bit


Attachments: File failed_dd_master.log     File ok_master.log     File ver_not_ok.js    
Issue Links:
Duplicate
is duplicated by SERVER-5752 query yielding with frequent migratio... Closed
Related
is related to SERVER-4185 Assertion 13388: shard version not ok... Closed
Operating System: Linux
Steps To Reproduce:

mongos> db.adminCommand( "flushRouterConfig" )
{ "flushed" : true, "ok" : 1 }
mongos> db.offer.count({"smallPicture.0":{$exists:true}})
Wed Mar  6 18:03:31 uncaught exception: count failed: {
    "shards" : {
         
    },
    "cause" : {
        "errmsg" : "13388 [offerStore.offer] shard version not ok in Client::Context: version mismatch detected for offerStore.offer, stored major version 5788 does not match received 5787 ( ns : offerStore.offer, received : 5787|0||000000000000000000000000, wanted : 5788|0||000000000000000000000000, send )",
        "ok" : 0
    },
    "ok" : 0,
    "errmsg" : "failed on : offerStoreDE2"
}

Participants:

 Description   

When we run a slow query, we encounter the "13388 shard version not ok in Client::Context" error even when we flush the router config just before sending the query.

Does this mean that Mongo can't execute a long running query because the config changed in the meanwhile? How to cope with?



 Comments   
Comment by Kay Agahd [ 14/Jun/14 ]

I already do so, thank you.

Comment by Daniel Pasette (Inactive) [ 14/Jun/14 ]

The issue is only resolved because it is a duplicate of SERVER-5752. You should watch for SERVER-5752 instead for progress on this issue.

Comment by Kay Agahd [ 14/Jun/14 ]

Why the status of this ticket is "resolved" if the ticket SERVER-5752 which would address this problem is still open and unresolved?
Btw. we are running the latest version (2.6.1) on all our mongodb servers and are still seeing this error.

Comment by Randolph Tan [ 02/Aug/13 ]

This assert happens when the connection was able to establish the correct shardVersion but the shardVersion got bumped up because of a migration. Slow queries are susceptible to this error because this error check is done every time we reacquire the lock after a yield. I have also attached a related ticket (SERVER-5752) which would address this problem.

Comment by Randolph Tan [ 02/Apr/13 ]

Attached truncated logs (last 100k lines) from running the test on master branch:

failed_dd_master.log - binaries built with --dd and was able to reproduce the error after running the script.
ok_master.log - binaries built with normal settings but was not able to reproduce the error after running the script.

Comment by Randolph Tan [ 01/Apr/13 ]

Hi,

We were able to successfully reproduce the bug so we might not need the logs any more. Attaching test script.

Comment by Kay Agahd [ 01/Apr/13 ]

We are runnning 3 mongos, 3 config servers, 3 shards (each one consisting of 3 mongod's). Hardware and configuration of mongod's are identical. They are running on dedicacted servers (no virtualisation). Tomorrow, I'll set up a fourth mongos with level 3 logs in order to reproduce it and send you the logs.

Comment by Randolph Tan [ 01/Apr/13 ]

Thanks for the report. Would you be able to provide a mongos log with log level 3 and mongod logs with log level 1? Can you also share the setup of the environment - how many mongos, shards (replica sets?).

Thanks!

Comment by Kay Agahd [ 01/Apr/13 ]

Yes, I'm pretty sure that there were active migrations. Must I stop the balancer when executing longer running queries?

Comment by Daniel Pasette (Inactive) [ 01/Apr/13 ]

Sorry for the delayed response. Can you tell me if there are a bunch of migrations active in your cluster when you get this error? There was some work done on how commands are run in a sharded cluster in 2.2, and I'd like to follow up. This error is not expected behavior.

Generated at Thu Feb 08 03:18:41 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.