[SERVER-10560] bad offset:-33553922 Created: 19/Aug/13  Updated: 10/Dec/14  Resolved: 15/Jan/14

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 2.4.4
Fix Version/s: None

Type: Question Priority: Major - P3
Reporter: huangxing Assignee: David Hows
Resolution: Done Votes: 0
Labels: replication
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

OS:RHEL 6.4 6 bit
Hardware: Vmware


Participants:

 Description   

There is some thing wrong with my replication set When I insert data with java:

sh0:PRIMARY> rs.status()
{
"set" : "sh0",
"date" : ISODate("2013-08-19T05:37:36Z"),
"myState" : 1,
"members" : [
{
"_id" : 1,
"name" : "192.168.69.43:10000",
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"uptime" : 256710,
"optime" :

{ "t" : 1376721151, "i" : 167 }

,
"optimeDate" : ISODate("2013-08-17T06:32:31Z"),
"self" : true
},
{
"_id" : 2,
"name" : "192.168.69.44:10000",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 252135,
"optime" :

{ "t" : 1376721151, "i" : 167 }

,
"optimeDate" : ISODate("2013-08-17T06:32:31Z"),
"lastHeartbeat" : ISODate("2013-08-19T05:37:35Z"),
"lastHeartbeatRecv" : ISODate("2013-08-19T05:37:35Z"),
"pingMs" : 0,
"lastHeartbeatMessage" : "db exception in producer: 10320 BSONElement: bad type -2",
"syncingTo" : "192.168.69.43:10000"
},
{
"_id" : 3,
"name" : "192.168.69.45:10000",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 256659,
"optime" :

{ "t" : 1376721151, "i" : 167 }

,
"optimeDate" : ISODate("2013-08-17T06:32:31Z"),
"lastHeartbeat" : ISODate("2013-08-19T05:37:35Z"),
"lastHeartbeatRecv" : ISODate("2013-08-19T05:37:35Z"),
"pingMs" : 0,
"lastHeartbeatMessage" : "db exception in producer: 10320 BSONElement: bad type -2",
"syncingTo" : "192.168.69.43:10000"
}
],
"ok" : 1
}

This is the error log comes from the primary:

Mon Aug 19 13:26:58.055 [conn18900] assertion 13440 bad offset:-33553922 accessing file: /mongodb/scheme2/sh0/data/local.
1 - consider repairing database ns:local.oplog.rs query:{}
Mon Aug 19 13:26:58.063 [conn18900] command local.$cmd command:

{ repairDatabase: 1.0 }

ntoreturn:1 keyUpdates:0 locks(mi
cros) W:357442 reslen:208 357ms
Mon Aug 19 13:27:10.025 [LockPinger] cluster 192.168.69.43:20000,192.168.69.44:20000,192.168.69.45:20000 pinged successfu
lly at Mon Aug 19 13:27:09 2013 by distributed lock pinger '192.168.69.43:20000,192.168.69.44:20000,192.168.69.45:20000/m
ongo_43:10000:1376661629:1699669244', sleeping for 30000ms

The log is from one of my secondary:

Mon Aug 19 13:32:07.171 [rsBackgroundSync] replSet db exception in producer: 10320 BSONElement: bad type -2
Mon Aug 19 13:32:17.171 [rsBackgroundSync] replSet syncing to: 192.168.69.43:10000
Mon Aug 19 13:32:17.174 [rsBackgroundSync] Assertion: 10320:BSONElement: bad type -2
0xdd2331 0xd93c6b 0x6eabc9 0xba72d9 0xba7add 0xbac1b9 0xbad31d 0xe1aad9 0x3c3e007851 0x3c3dce890d
mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xdd2331]
mongod(_ZN5mongo11msgassertedEiPKc+0x9b) [0xd93c6b]
mongod(_ZNK5mongo11BSONElement4sizeEv+0x1f9) [0x6eabc9]
mongod(_ZN5mongo7replset14BackgroundSync7isStaleERNS_11OplogReaderERNS_7BSONObjE+0x319) [0xba72d9]
mongod(_ZN5mongo7replset14BackgroundSync14getOplogReaderERNS_11OplogReaderE+0x36d) [0xba7add]
mongod(_ZN5mongo7replset14BackgroundSync7produceEv+0x39) [0xbac1b9]
mongod(_ZN5mongo7replset14BackgroundSync14producerThreadEv+0x2d) [0xbad31d]

So,I cant't insert data again,After running the command db.repairDatabase(),the problem still exists.

sh0:PRIMARY> db.oplog.rs.find()
error: {
"$err" : "bad offset:-33553922 accessing file: /mongodb/scheme2/sh0/data/local.1 - consider repairing database",
"code" : 13440
}

I think there is something wrong with the oplogy ,but I don't how to deal.Someone give my a hand ?



 Comments   
Comment by David Hows [ 21/Oct/13 ]

Hi huangxing,

This sounds like there is some form of corruption in the oplog of the primary. Given this you have a options ways you can proceed forward. Please note that you should look to perform a backup before following any of the advice here to avoid unintended data loss.

  • Restore the primary from the most recent backup
  • If you have an up-to-date secondary you can stop the primary, have a secondary take over and then re-sync the corrupt member as detailed in our documentation
  • Lastly, if you are worried that this primary has data which is contained in no other members you can perform the following process:
    1. Stop all members of the replSet
    2. Remove all the files matching local.* from the primaries dbpath
    3. Start the primary again and run rs.initiate()
    4. Delete the dbpath of one secondary, start it then run rs.add() to add on the primary to add this new member
    5. Once this member has fully re-sync'd perform the same process for all other secondaries

Regards,
David

Generated at Thu Feb 08 03:23:29 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.