[SERVER-14213] Config Server Corruption - BSONObj size: 1852404841 (0x6974696E) is invalid. Size must be between 0 and 16793600(16MB) Created: 09/Jun/14  Updated: 10/Dec/14  Resolved: 09/Jun/14

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: None

Type: Question Priority: Major - P3
Reporter: Mike Assignee: Ramon Fernandez Marina
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Participants:

 Description   

Hi MongoDB,

I'm running a sharded cluster with:

  • 3 MongoS instances (version 2.4.5)
  • 9 MongoD (3 data nodes per shard, a primary, secondary and arbiter, all are version 2.4.5)
  • 3 Config Servers (version 2.4.5)

When I attempt to shard a new collection on an existing DB I get the following:

sh.shardCollection("Customer.CustomerEventVisits",{a:1,b:1},true)
{
        "code" : 8017,
        "ok" : 0,
        "errmsg" : "exception: update not consistent  ns: config.chunks query: { _id: \"Customer.CustomerEventVisits-a_MinKeyb_MinKey\" } update: { _id: \"Customer.CustomerEventVisits-a_MinKeyb_MinKey\", lastmod: Timestamp 1000|0, lastmodEpoch: ObjectId('5395c5793d5ce34d5ccd6823'), ns: \"Customer.CustomerEventVisits\", min: { a: MinKey, b: MinKey }, max: { a: MaxKey, b: MaxKey }, shard: \"j1shard\" } gle1: { updatedExisting: false, n: 1, lastOp: Timestamp 1402324345000|3, connectionId: 2032481, waited: 27, err: null, ok: 1.0 } gle2: { err: \"BSONObj size: 1852404841 (0x6974696E) is invalid. Size must be between 0 and 16793600(16MB) First element: : ?type=103\", code: 10334, n: 0, connectionId: 2031264, waited: 10, ok: 1.0 
}

I'm also seeing the following logged on each config server:

Mon Jun  9 13:09:03.238 [LockPinger] warning: distributed lock pinger 'bomongodbc1n1:30003,bomongodbc1n2:30003,bomongodbc1n3:30003/bomongos02.csnzoo.com:30004:1396381269:1804289383' detected an exception while pinging. :: caused by :: update not consistent  ns: config.lockpings query: { _id: "bomongos02.csnzoo.com:30004:1396381269:1804289383" } update: { $set: { ping: new Date(1402333743122) } } gle1: { updatedExisting: true, n: 1, lastOp: Timestamp 1402333743000|2, connectionId: 2035785, waited: 36, err: null, ok: 1.0 } gle2: { err: "BSONObj size: 1852404841 (0x6974696E) is invalid. Size must be between 0 and 16793600(16MB) First element: : ?type=103", code: 10334, n: 0, connectionId: 2034560, waited: 4, ok: 1.0 }

as well as entries like:

Jun  9 13:24:42 bomongodbc1n3 mongod.30003[32118]: Mon Jun  9 13:24:42.223 [conn2034890] update config.mongos query: { _id: "bomongos01.csnzoo.com:30004" } update: { $set: { ping: new Date(1402334682197), up: 5953440, waiting: true, mongoVersion: "2.4.5" } } idhack:1 fastmod:1 keyUpdates:0 exception: BSONObj size: 1852404841 (0x6974696E) is invalid. Size must be between 0 and 16793600(16MB) First element: : ?type=103 code:10334 locks(micros) w:25423 12ms

I believe my config server collections (lockpings and mongos) have bad data in them... in fact when I look at the documents in each there are old mongos entries that don't exist and there inconsistent lock times or entries that are valid when comparing across the 3 config servers

Any idea on how to resolve this?

It's a production instance so I'm hesitant to make a change and it doesn't sound like my config backups will help since this has been going on past the retention threshold I have...

Thanks so much!
Mike



 Comments   
Comment by Ramon Fernandez Marina [ 09/Jun/14 ]

Hi amarettoslim,

as you point out, the issue came from one config server having corrupted data. Corruption can happen for various reasons and usually is hard to track down its cause, although the more common causes are network problems or failing hard drives. I would recommend you check the health of the hard drives in the failed config server to be on the safe side.

Regards,
Ramón.

Comment by Mike [ 09/Jun/14 ]

I tracked the issue down to one config server instance out of the 3 and resolved the matter by replacing the bad config server's data directory with a working copy from one of the other two.

http://docs.mongodb.org/manual/tutorial/replace-config-server/

I hope this helps someone else. It would still be nice to know what causes this in the first place so if someone from Mongo wants to comment that'd be excellent.

-Mike

Generated at Thu Feb 08 03:34:10 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.