[SERVER-3951] setShardVersion failed host errors Created: 26/Sep/11  Updated: 27/Apr/12  Resolved: 27/Apr/12

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 1.8.2
Fix Version/s: None

Type: Question Priority: Major - P3
Reporter: Ajay Batheja Assignee: Greg Studer
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

CentOS release 5.6 (Final) x86_64 x86_64 x86_64 GNU/Linux


Participants:

 Description   

Hello,
I got the following errors, when the application try to insert the data in DB. These error started appearing on initial load but after few minutes everything backs to normal.

I found that there were three 500 application errors ( one on each server except its-d01) I looked at the tomcat logs and saw the same three errors which were complaining about mongos. The time of these error entries matches exactly with the error These issues that I noticed got fixed automatically without any changes/intervention from my side. However, what I am not able to comprehend is that if the errors show up in apache logs. tomcat logs and mongos logs, how did the data pertaining to these usernames(a17828, a052l508) still end up in the DB - in spite of these critical errors. Not sure why it happened and how to prevent it from happening in the future.

Here is the current configuration:-

Shards 2
Configuration Servers :- 3
Mongos :- 4

> db.printShardingStatus()
— Sharding Status —
sharding version:

{ "_id" : 1, "version" : 3 }

shards:
{
"_id" : "shard01",
"host" : "shard01/its-d01.db.aol.com:4308,its-m01.db.aol.com:4308"
}
{
"_id" : "shard02",
"host" : "shard02/its-d02.db.aol.com:4308,its-m02.db.aol.com:4308"
}
databases:

{ "_id" : "admin", "partitioned" : false, "primary" : "config" } { "_id" : "test", "partitioned" : false, "primary" : "shard01" } { "_id" : "historyDb", "partitioned" : true, "primary" : "shard01" }

historyDb.History chunks:
shard02 29
shard01 29
too many chunksn to print, use verbose if you want to force print

{ "_id" : "reputationDb", "partitioned" : true, "primary" : "shard01" }

reputationDb.Karma chunks:
shard02 40
shard01 41
too many chunksn to print, use verbose if you want to force print

shard01:PRIMARY> rs.status()
{
"set" : "shard01",
"date" : ISODate("2011-09-22T17:58:13Z"),
"myState" : 1,
"members" : [
{
"_id" : 0,
"name" : "its-d01.db.aol.com:4308",
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"optime" :

{ "t" : 1316714290000, "i" : 2 }

,
"optimeDate" : ISODate("2011-09-22T17:58:10Z"),
"self" : true
},
{
"_id" : 1,
"name" : "its-m01.db.aol.com:4308",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 311026,
"optime" :

{ "t" : 1316714290000, "i" : 2 }

,
"optimeDate" : ISODate("2011-09-22T17:58:10Z"),
"lastHeartbeat" : ISODate("2011-09-22T17:58:12Z")
},
{
"_id" : 2,
"name" : "its-m02.db.aol.com:4309",
"health" : 1,
"state" : 7,
"stateStr" : "ARBITER",
"uptime" : 311026,
"optime" :

{ "t" : 0, "i" : 0 }

,
"optimeDate" : ISODate("1970-01-01T00:00:00Z"),
"lastHeartbeat" : ISODate("2011-09-22T17:58:12Z")
}

> db.stats()
{
"raw" : {
"shard01/its-d01.db.aol.com:4308,its-m01.db.aol.com:4308" :

{ "db" : "test", "collections" : 3, "objects" : 5, "avgObjSize" : 40, "dataSize" : 200, "storageSize" : 7168, "numExtents" : 3, "indexes" : 1, "indexSize" : 8192, "fileSize" : 201326592, "ok" : 1 }

},
"objects" : 5,
"avgObjSize" : 40,
"dataSize" : 200,
"storageSize" : 7168,
"numExtents" : 3,
"indexes" : 1,
"indexSize" : 8192,
"fileSize" : 201326592,
"ok" : 1
}

Here are the logs to support my research.

ITS-M01

-----------

Apache logs:-

reputation.aol.com 172.19.150.23 - - [20/Sep/2011:15:50:39 --0400] "GET /its/api/submit_claim?username=a052l508&user_id=a052l508&context=login&source=imagination&reason=evil&confidence=1&value=fraud&ip=172.19.150.23&locale=en-US HTTP/1.1" 500 2494 "" "" Tnjuj6wVzHQAAGK2M-gAAAIt "-" /20110920/20110920-1550/20110920-155039-Tnjuj6wVzHQAAGK2M-gAAAIt 0 997 md5:0474fc81293193ac2f05ed83a2200993

Tomcat Logs:-

----------------------

2011-09-20 15:50:39,046 [TP-Processor37] DEBUG com.aol.trust.model.dao.KarmaDAO$1 - Executing BasicDBObjectCommand { "username" : { "$in" : [ "a052l508"]}} for reputationDb.Karma cursor

2011-09-20 15:50:39,048 [TP-Processor37] DEBUG com.aol.trust.model.dao.KarmaDAO$1 - MongoDB query plan com.aol.trust.model.Karma$DirtyableDBObject

$1@1cad1378

cursor = "null"

nscanned = "0"

n = "0"

millis = "null"

2011-09-20 15:50:39,050 [TP-Processor37] DEBUG com.aol.trust.model.dao.KarmaDAO$1 - Saving Karma identified by {_id=4e78ee8f6db5deac51214377, user

name=a052l508} with document {createTime=Tue Sep 20 15:50:39 EDT 2011, __olv=1, username=a052l508, _id=4e78ee8f6db5deac51214377, lastUpdate=Tue Sep

20 15:50:39 EDT 2011, userId=a052l508, state=fraud, contexts=

{login=com.aol.trust.model.Context@26d939fd}

, lastContext=login}

2011-09-20 15:50:39,052 [TP-Processor37] INFO com.aol.trust.timer.db.write.karma - ITS_DB_WRITE_KARMA_TIME - [a052l508] 2 millisecs

2011-09-20 15:50:39,052 [TP-Processor37] DEBUG com.aol.trust.model.dao.HistoryDAO$2 - Saving History identified by {_id=4e78ee8f6db5deac51214378,

username=a052l508} with document {__olv=1, username=a052l508, time=Tue Sep 20 15:50:39 EDT 2011, scope=login, _id=4e78ee8f6db5deac51214378, source=

imagination, reason=evil, userId=a052l508, xff=null, value=fraud, confidence=1, ip=172.19.150.23}

2011-09-20 15:50:39,117 [TP-Processor37] ERROR org.apache.catalina.core.ContainerBase.[Catalina].[localhost].[/its].[ApiServlet] - Servlet.service

() for servlet ApiServlet threw exception

com.mongodb.MongoException: setShardVersion failed host[shard01/its-d01.db.aol.com:4308,its-m01.db.aol.com:4308]

{ oldVersion: Timestamp 2000|5, as sertion: "assertion s/d_state.cpp:501", errmsg: "db assertion failure", ok: 0.0 }

at com.mongodb.CommandResult.getException(CommandResult.java:82)

at com.mongodb.CommandResult.throwOnError(CommandResult.java:116)

at com.mongodb.DBTCPConnector._checkWriteError(DBTCPConnector.java:126)

at com.mongodb.DBTCPConnector.say(DBTCPConnector.java:148)

at com.mongodb.DBTCPConnector.say(DBTCPConnector.java:132)

at com.mongodb.DBApiLayer$MyCollection.update(DBApiLayer.java:343)

at org.iternine.jeppetto.dao.mongodb.MongoDBQueryModelDAO.trueSave(MongoDBQueryModelDAO.java:588)

at org.iternine.jeppetto.dao.mongodb.MongoDBQueryModelDAO.save(MongoDBQueryModelDAO.java:257)

at com.aol.trust.IdentityTrustServiceImplementation.saveToHistory(IdentityTrustServiceImplementation.java:310)

at com.aol.trust.IdentityTrustServiceImplementation.submitClaim(IdentityTrustServiceImplementation.java:133)

at com.aol.trust.SubmitClaimHandler.processRequest(SubmitClaimHandler.java:73)

at com.aol.trust.ApiServlet.processApiCall(ApiServlet.java:131)

at com.aol.trust.ApiServlet.doGet(ApiServlet.java:104)

at javax.servlet.http.HttpServlet.service(HttpServlet.java:627)

at javax.servlet.http.HttpServlet.service(HttpServlet.java:729)

at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:269)

at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188)

at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:204)

at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:172)

at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)

at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117)

at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:108)

Mongos

----------

Tue Sep 20 15:50:39 [conn8] setShardVersion failed host[shard01/its-d01.db.aol.com:4308,its-m01.db.aol.com:4308]

{ oldVersion: Timestamp 2000|5, asserti on: "assertion s/d_state.cpp:501", errmsg: "db assertion failure", ok: 0.0 }

Tue Sep 20 15:50:39 [conn8] Assertion: 10429:setShardVersion failed host[shard01/its-d01.db.aol.com:4308,its-m01.db.aol.com:4308]

{ oldVersion: Timestamp 200 0|5, assertion: "assertion s/d_state.cpp:501", errmsg: "db assertion failure", ok: 0.0 }

0x5204fa 0x6a15ed 0x6a1152 0x6a1152 0x6a1152 0x578ce6 0x57715a 0x6311ed 0x6395bc 0x63d433 0x6682f2 0x67d187 0x580b7c 0x6a50a0 0x321420673d 0x32136d44bd

/data/servers/mongodb/bin/mongos(_ZN5mongo11msgassertedEiPKc+0x12a) [0x5204fa]

/data/servers/mongodb/bin/mongos [0x6a15ed]

/data/servers/mongodb/bin/mongos [0x6a1152]

/data/servers/mongodb/bin/mongos [0x6a1152]

/data/servers/mongodb/bin/mongos [0x6a1152]

/data/servers/mongodb/bin/mongos(_ZN5boost6detail8function17function_invoker4IPFbRN5mongo12DBClientBaseERKSsbiEbS5_S7_biE6invokeERNS1_15function_bufferES5_S

7_bi+0x16) [0x578ce6]

/data/servers/mongodb/bin/mongos(_ZN5mongo15ShardConnection11_finishInitEv+0xba) [0x57715a]

/data/servers/mongodb/bin/mongos(_ZN5mongo8Strategy7doWriteEiRNS_7RequestERKNS_5ShardEb+0xdd) [0x6311ed]

/data/servers/mongodb/bin/mongos(_ZN5mongo13ShardStrategy7_updateERNS_7RequestERNS_9DbMessageEN5boost10shared_ptrINS_12ChunkManagerEEE+0xd1c) [0x6395bc]

/data/servers/mongodb/bin/mongos(_ZN5mongo13ShardStrategy7writeOpEiRNS_7RequestE+0x2b3) [0x63d433]

/data/servers/mongodb/bin/mongos(_ZN5mongo7Request7processEi+0x172) [0x6682f2]

/data/servers/mongodb/bin/mongos(_ZN5mongo21ShardedMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x77) [0x67d187]

/data/servers/mongodb/bin/mongos(_ZN5mongo3pms9threadRunEPNS_13MessagingPortE+0x34c) [0x580b7c]

/data/servers/mongodb/bin/mongos(thread_proxy+0x80) [0x6a50a0]

/lib64/libpthread.so.0 [0x321420673d]

/lib64/libc.so.6(clone+0x6d) [0x32136d44bd]

Tue Sep 20 15:50:39 [conn8] ~ScopedDBConnection: _conn != null

Tue Sep 20 15:50:39 [conn8] AssertionException in process: setShardVersion failed host[shard01/its-d01.db.aol.com:4308,its-m01.db.aol.com:4308]

{ oldVersion: Timestamp 2000|5, assertion: "assertion s/d_state.cpp:501", errmsg: "db assertion failure", ok: 0.0 }

Mongod

------------

> db.Karma.find(

{"username" : "a052l508"}

)

{ "id" : ObjectId("4e78ee8f6db5deac51214377"), "createTime" : ISODate("2011-09-20T19:50:39.046Z"), "_olv" : 1, "username" : "a052l508", "lastUpdate" : ISODate("2011-09-20T19:50:39.046Z"), "userId" : "a052l508", "state" : "fraud", "contexts" : { "login" :

{ "time" : ISODate("2011-09-20T19:50:39.046Z"), "scope" : "login", "source" : "imagination", "reason" : "evil", "xff" : null, "value" : "fraud", "confidence" : "1", "ip" : "172.19.150.23" }

}, "lastContext" : "login" }



 Comments   
Comment by Greg Studer [ 02/Jan/12 ]

Any update on whether a collection was dropped/recreated around this time? If running successive jobs where this occurs, you'll need to flushRouterConfig all mongoses afterward in 2.0 branch.

Comment by Greg Studer [ 21/Dec/11 ]

Were collections dropped or created in mongodb before this time? The assertion seems to indicate that the cluster versioning info got reset and was still stale on either the mongods or mongos.

Comment by Eliot Horowitz (Inactive) [ 21/Dec/11 ]

What version was this on?

Comment by Joseph Shraibman [ 21/Dec/11 ]

I tried restarting my chain. It completed that job that failed, but I got the assertion failure on the next one.

Comment by Joseph Shraibman [ 21/Dec/11 ]

I also got this assertion, I have no idea why. This was on stage 2 of a multipart Hadoop MapReduce job. Stage 1 ran fine. Backend is 2.0.1 with a single shard.

11/12/21 00:51:19 INFO mapreduce.Job: Task Id : attempt_201112201847_0050_m_000000_2, Status : FAILED
com.mongodb.CommandResult$CommandFailure: command failed [command failed [count] { "serverUsed" : "gw:30010" , "assertion" : "setShardVersion failed host: gw:20011

{ oldVersion: Timestamp 1000|21174, assertion: \"assertion s/d_state.cpp:513\", errmsg: \"db assertion failure\", ok: 0.0 }

" , "assertionCode" : 10429 , "errmsg" : "db assertion failure" , "ok" : 0}
at com.mongodb.CommandResult.getException(CommandResult.java:75)
at com.mongodb.CommandResult.throwOnError(CommandResult.java:121)
at com.mongodb.DBCollection.getCount(DBCollection.java:863)
at com.mongodb.DBCursor.size(DBCursor.java:646)
at com.mongodb.hadoop.input.MongoRecordReader.initialize(MongoRecordReader.java:60)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:452)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:644)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:320)
at org.apache.hadoop.mapred.Child$4.run(Child.java:223)
at java.security.AccessController.doPrivileged(Native Meth
attempt_201112201847_0050_m_000000_2: should setup context
attempt_201112201847_0050_m_000000_2: should abort task

Generated at Thu Feb 08 03:04:31 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.