[SERVER-10939] Unable to start Secondary due error:- Assertion: 10320:BSONElement: bad type 101 Created: 27/Sep/13  Updated: 15/Nov/21  Resolved: 06/Nov/13

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.2.2
Fix Version/s: None

Type: Bug Priority: Blocker - P1
Reporter: Appnique OMS Assignee: Unassigned
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
Operating System: ALL
Participants:

 Comments   
Comment by Appnique OMS [ 01/Oct/13 ]

Hi Mike,

Is there any way to repair database without increasing data size i.e without doubling data size.

what operation does repair performs internally.

Comment by Appnique OMS [ 30/Sep/13 ]

Yes Mike, running behind means that the last oplog entry they applied is behind that problematic secondary.

Comment by Michael Grundy [ 30/Sep/13 ]

When you say "running behind" do you mean that the last oplog entry they applied is behind that secondary, or that you have a chained replication setup where those other secondaries are replicating from the problem secondary?

Comment by Appnique OMS [ 30/Sep/13 ]

But Mike,other secondaries are running behind the problematic one...

Comment by Michael Grundy [ 30/Sep/13 ]

If the other secondaries are fine, I would lean towards resyncing the problem secondary. There is no specific bug in 2.2.2 that has been found that matches your issue. There have been similar occurrences, caused by a faulty driver or more frequently by disk / network based corruption.

A driver issue would have affected the other secondaries when they applied the changes that caused the problem, hence my inclination towards doing a resync of the problematic one.

Mike

Comment by Appnique OMS [ 29/Sep/13 ]

Hi Mike -

validated collection as well it is fine.bit confused in where is the actual problem i.e in primary or secondary.

rest of the secondaries are behind the secondary which having problem.they are working fine.

About application driver the back end uses Moped as the MongoDB driver (Ruby). The application itself uses Mongoid, which is an ADM layer whose only supported driver is Moped.
Here are the versions we use.
Mongoid -> 3.0.15
Moped -> 1.3.1

is there any bug in mongodb version 2.2.2 regarding this

Ashraf.

Comment by Michael Grundy [ 27/Sep/13 ]

Hi Ashraf -

Good that oplog validated ok. Did you also run it on the collection? At this stage, the quickest way forward may be to re-sync the node from the Primary. This is only an option if you have enough other nodes available to keep a majority up. Are all Secondary nodes of the Replica Set impacted in the same way? If you don't have enough nodes for a majority, you might have to add an arbiter, but given that this secondary can't start, this probably doesn't apply.

Directions for resyncing a stale member are at http://docs.mongodb.org/manual/tutorial/resync-replica-set-member/#replica-set-auto-resync-stale-member , if you have space move the database files out of your dbpath instead of deleting.

If the resync is unable to move forward, then repair database on the primary is your next step, followed by resyncing the secondaries. A repair database will fix most issues and all of the indexes will be rebuilt. Please make back up copies when possible before proceeding.

I'm concerned about where the invalid BSON objects came from. What driver and version is your application using? Prior to this problem, did you have any issues with crashes or network failures? Are you running with journalling turned off?

2.2.2 can be fine for some installations and problematic for others depending on config and usage patterns. The full list of changes available after 2.2.2 can be reviewed here

Thanks!
Mike

Comment by Appnique OMS [ 27/Sep/13 ]

Hi Mike,
in order to restart servers with --objcheck..my secondaries are very far behind than my primary(near about data size 10GB),if i restart i will go in rollback state.i dont think its good idea for now.

second thing with corruption in oplog i validated oplog it was fine.

will repairing database definitely solve this error?

am running version 2.2.2 is it fine?

Thanks
Ashraf

Comment by Michael Grundy [ 27/Sep/13 ]

Hi Ashraf -

You're getting bad types and bad size (SERVER-10938) because there is invalid bson being inserted or something is corrupting the bson objects.

As a first step, I recommend restarting all instances with --objcheck , this will throw an error any time an invalid object is inserted on the primary. SERVER-9018 is an example of where this exposed an application error.

There may be corruption in your oplog, or the collection, validating may help isolate the issue, validate docs are here. Please be aware that validate does have a performance overhead, so run it during off-peak or a maintenance window.

Repairing the database may get you past this, but be aware that it will require free space on the disk minimally equal to the size of the database being repaired.

Finally you should upgrade to the latest version in the 2.2 release line, currently at 2.2.6 . There are significant fixes available since 2.2.2

Thanks
Mike

Comment by Appnique OMS [ 27/Sep/13 ]

Hi Mike,
Thanks, for reply

Here is the log entries :-
**************
keyUpdates:0 locks(micros) r:235420 nreturned:76 reslen:148174 235ms
Fri Sep 27 09:26:10 [repl writer worker 2] Assertion: 10320:BSONElement: bad type 101
0xaffd31 0xac5eb9 0x57105b 0x5c10d6 0x84722a 0x840034 0x844937 0x822e2f 0x9a34c2 0x9a2d4f 0xad35cd 0xb45ba9 0x7f4fbe514851 0x7f4fbd8b711d
mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xaffd31]
mongod(_ZN5mongo11msgassertedEiPKc+0x99) [0xac5eb9]
mongod(_ZNK5mongo11BSONElement4sizeEv+0x1cb) [0x57105b]
mongod(_ZN5mongo15BSONObjIterator4nextEv+0x46) [0x5c10d6]
mongod(_ZNK5mongo6ModSet7prepareERKNS_7BSONObjE+0x24a) [0x84722a]
mongod(_ZN5mongo14_updateObjectsEbPKcRKNS_7BSONObjES4_bbbRNS_7OpDebugEPNS_11RemoveSaverEbRKNS_24QueryPlanSelectionPolicyEb+0x10f4) [0x840034]
mongod(_ZN5mongo27updateObjectsForReplicationEPKcRKNS_7BSONObjES4_bbbRNS_7OpDebugEbRKNS_24QueryPlanSelectionPolicyE+0xb7) [0x844937]
mongod(_ZN5mongo21applyOperation_inlockERKNS_7BSONObjEbb+0xabf) [0x822e2f]
mongod(_ZN5mongo7replset8SyncTail9syncApplyERKNS_7BSONObjEb+0x292) [0x9a34c2]
mongod(_ZN5mongo7replset14multiSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE+0x4f) [0x9a2d4f]
mongod(_ZN5mongo10threadpool6Worker4loopEv+0x26d) [0xad35cd]
mongod() [0xb45ba9]
/lib64/libpthread.so.0(+0x7851) [0x7f4fbe514851]
/lib64/libc.so.6(clone+0x6d) [0x7f4fbd8b711d]
Fri Sep 27 09:26:10 [repl writer worker 2] ERROR: writer worker caught exception: BSONElement: bad type 101 on: { ts: Timestamp 1380235717000|6, h: *******************, v: 2, op: "u", ns: "database.collection", o2: { field: { field:

{ field: "592425919" }

, uc: 160 } }, o: { $set:

{ until: new Date(1380235717000) }

} }
Fri Sep 27 09:26:10 [repl writer worker 2] Fatal Assertion 16360
0xaffd31 0xac5323 0x9a2e56 0xad35cd 0xb45ba9 0x7f4fbe514851 0x7f4fbd8b711d
mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xaffd31]
mongod(_ZN5mongo13fassertFailedEi+0xa3) [0xac5323]
mongod(_ZN5mongo7replset14multiSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE+0x156) [0x9a2e56]
mongod(_ZN5mongo10threadpool6Worker4loopEv+0x26d) [0xad35cd]
mongod() [0xb45ba9]
/lib64/libpthread.so.0(+0x7851) [0x7f4fbe514851]
/lib64/libc.so.6(clone+0x6d) [0x7f4fbd8b711d]
Fri Sep 27 09:26:10 [repl writer worker 2]

***aborting after fassert() failure

Fri Sep 27 09:26:10 Got signal: 6 (Aborted).

Fri Sep 27 09:26:10 Backtrace:
0xaffd31 0x558bb9 0x7f4fbd801920 0x7f4fbd8018a5 0x7f4fbd803085 0xac535e 0x9a2e56 0xad35cd 0xb45ba9 0x7f4fbe514851 0x7f4fbd8b711d
mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xaffd31]
mongod(_ZN5mongo10abruptQuitEi+0x399) [0x558bb9]
/lib64/libc.so.6(+0x32920) [0x7f4fbd801920]
/lib64/libc.so.6(gsignal+0x35) [0x7f4fbd8018a5]
/lib64/libc.so.6(abort+0x175) [0x7f4fbd803085]
mongod(_ZN5mongo13fassertFailedEi+0xde) [0xac535e]
mongod(_ZN5mongo7replset14multiSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE+0x156) [0x9a2e56]
mongod(_ZN5mongo10threadpool6Worker4loopEv+0x26d) [0xad35cd]
mongod() [0xb45ba9]
/lib64/libpthread.so.0(+0x7851) [0x7f4fbe514851]
/lib64/libc.so.6(clone+0x6d) [0x7f4fbd8b711d]
****************
For now to start secondary instance am just changing optype of record to "n" in primaries oplog, but by doing this am loosing my data.

Thanks,
Ashraf

Comment by Michael Grundy [ 27/Sep/13 ]

Hi Ashraf -

Could you provide the log entries from the secondary that contain the assertion and the actions around it? (complete log is ok).

Thanks
Mike

Generated at Thu Feb 08 03:24:27 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.