[SERVER-7112] Replica Secondaries won't start: ERROR: writer worker caught exception: Invalid BSONObj size Created: 23/Sep/12  Updated: 08/Mar/13  Resolved: 08/Oct/12

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.2.0
Fix Version/s: None

Type: Bug Priority: Blocker - P1
Reporter: m Assignee: Spencer Brody (Inactive)
Resolution: Incomplete Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: Linux
Participants:

 Description   

I can't start any replica secondaries:

          • SERVER RESTARTED *****

Sun Sep 23 17:27:05 [initandlisten] MongoDB starting : pid=4177 port=27017 dbpath=/ebs/mongo/db 64-bit host=ge2-mongo-2-b.domain.com
Sun Sep 23 17:27:05 [initandlisten] db version v2.2.0, pdfile version 4.5
Sun Sep 23 17:27:05 [initandlisten] git version: f5e83eae9cfbec7fb7a071321928f00d1b0c5207
Sun Sep 23 17:27:05 [initandlisten] build info: Linux ip-10-2-29-40 2.6.21.7-2.ec2.v1.2.fc8xen #1 SMP Fri Nov 20 17:48:28 EST 2009 x86_64 BOOST_LIB_VERSION=1_49
Sun Sep 23 17:27:05 [initandlisten] options:

{ config: "/ebs/mongo/LATEST/bin/mongo.conf", dbpath: "/ebs/mongo/db", directoryperdb: "true", fork: "true", logappend: "true", logpath: "/mnt/mongo/mongo.log", nojournal: "true", port: 27017, quiet: "true", replSet: "ge-2", rest: "true" }

Sun Sep 23 17:27:05 [initandlisten] Unable to check for journal files due to: boost::filesystem::basic_directory_iterator constructor: No such file or directory: "/ebs/mongo/db/journal"
Sun Sep 23 17:27:05 [websvr] admin web console waiting for connections on port 28017
Sun Sep 23 17:27:05 [initandlisten] waiting for connections on port 27017
Sun Sep 23 17:27:05 [rsStart] replSet I am ge2.mongo.2.b.domain.com:27017
Sun Sep 23 17:27:05 [rsStart] replSet STARTUP2
Sun Sep 23 17:27:05 [rsHealthPoll] replSet member ge2.mongo.2.arb.domain.com:27017 is up
Sun Sep 23 17:27:05 [rsHealthPoll] replSet member ge2.mongo.2.arb.domain.com:27017 is now in state ARBITER
Sun Sep 23 17:27:05 [rsHealthPoll] replSet member ge2.mongo.2.a.domain.com:27017 is up
Sun Sep 23 17:27:05 [rsHealthPoll] replSet member ge2.mongo.2.a.domain.com:27017 is now in state PRIMARY
Sun Sep 23 17:27:06 [rsSync] replSet still syncing, not yet to minValid optime 505e88d1:f
Sun Sep 23 17:27:09 [rsBackgroundSync] replSet syncing to: ge2.mongo.2.a.domain.com:27017
Sun Sep 23 17:27:09 [rsSync] replSet still syncing, not yet to minValid optime 505e88d1:f
Sun Sep 23 17:27:15 [rsSyncNotifier] replset setting oplog notifier to ge2.mongo.2.a.domain.com:27017
Sun Sep 23 17:27:15 [conn28] command admin.$cmd command:

{ replSetGetStatus: 1 }

ntoreturn:1 keyUpdates:0 reslen:732 2502ms
Sun Sep 23 17:27:15 [conn29] command admin.$cmd command:

{ replSetGetStatus: 1 }

ntoreturn:1 keyUpdates:0 reslen:732 2502ms
Sun Sep 23 17:27:15 [conn30] command admin.$cmd command:

{ replSetGetStatus: 1 }

ntoreturn:1 keyUpdates:0 reslen:732 2499ms
Sun Sep 23 17:27:15 [conn35] command admin.$cmd command:

{ replSetGetStatus: 1 }

ntoreturn:1 keyUpdates:0 reslen:732 2497ms
Sun Sep 23 17:27:15 [conn36] command admin.$cmd command:

{ replSetGetStatus: 1 }

ntoreturn:1 keyUpdates:0 reslen:732 2495ms
Sun Sep 23 17:27:15 [conn38] command admin.$cmd command:

{ replSetGetStatus: 1 }

ntoreturn:1 keyUpdates:0 reslen:732 2494ms
Sun Sep 23 17:27:15 [conn40] command admin.$cmd command:

{ replSetGetStatus: 1 }

ntoreturn:1 keyUpdates:0 reslen:732 2487ms
Sun Sep 23 17:27:15 [conn42] command admin.$cmd command:

{ replSetGetStatus: 1 }

ntoreturn:1 keyUpdates:0 reslen:732 2382ms
Sun Sep 23 17:27:15 [conn37] command admin.$cmd command:

{ replSetGetStatus: 1 }

ntoreturn:1 keyUpdates:0 reslen:732 2495ms
Sun Sep 23 17:27:15 [repl writer worker 1] Assertion: 10334:Invalid BSONObj size: -286331154 (0xEEEEEEEE) first element: EOO
0xade6e1 0x8036eb 0x80386c 0x571bc7 0x571f4c 0x74acfa 0xc06088 0xc0acf2 0xc0cddd 0xb70478 0x64fd61 0x64f687 0x77d3dd 0x7c3659 0x7fd3c8ede9ca 0x7fd3c828570d
/ebs/mongo/LATEST/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xade6e1]
/ebs/mongo/LATEST/bin/mongod(_ZN5mongo11msgassertedEiPKc+0x9b) [0x8036eb]
/ebs/mongo/LATEST/bin/mongod() [0x80386c]
/ebs/mongo/LATEST/bin/mongod(_ZNK5mongo7BSONObj14_assertInvalidEv+0x497) [0x571bc7]
/ebs/mongo/LATEST/bin/mongod(_ZN5mongo7BSONObjC1EPKc+0x3c) [0x571f4c]
/ebs/mongo/LATEST/bin/mongod(_ZN5mongo11DataFileMgr12updateRecordEPKcPNS_16NamespaceDetailsEPNS_25NamespaceDetailsTransientEPNS_6RecordERKNS_7DiskLocES2_iRNS_7OpDebugEb+0x6a) [0x74acfa]
/ebs/mongo/LATEST/bin/mongod() [0xc06088]
/ebs/mongo/LATEST/bin/mongod(_ZN5mongo14_updateObjectsEbPKcRKNS_7BSONObjES4_bbbRNS_7OpDebugEPNS_11RemoveSaverEbRKNS_24QueryPlanSelectionPolicyE+0x36b2) [0xc0acf2]
/ebs/mongo/LATEST/bin/mongod(_ZN5mongo13updateObjectsEPKcRKNS_7BSONObjES4_bbbRNS_7OpDebugEbRKNS_24QueryPlanSelectionPolicyE+0x12d) [0xc0cddd]
/ebs/mongo/LATEST/bin/mongod(_ZN5mongo21applyOperation_inlockERKNS_7BSONObjEbb+0x1228) [0xb70478]
/ebs/mongo/LATEST/bin/mongod(_ZN5mongo7replset8SyncTail9syncApplyERKNS_7BSONObjEb+0x271) [0x64fd61]
/ebs/mongo/LATEST/bin/mongod(_ZN5mongo7replset14multiSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE+0x37) [0x64f687]
/ebs/mongo/LATEST/bin/mongod(_ZN5mongo10threadpool6Worker4loopEv+0x26d) [0x77d3dd]
/ebs/mongo/LATEST/bin/mongod() [0x7c3659]
/lib/libpthread.so.0(+0x69ca) [0x7fd3c8ede9ca]
/lib/libc.so.6(clone+0x6d) [0x7fd3c828570d]
Sun Sep 23 17:27:15 [repl writer worker 1] ERROR: writer worker caught exception: Invalid BSONObj size: -286331154 (0xEEEEEEEE) first element: EOO on: { ts: Timestamp 1348372689000|15, h: -5104871501619906441, op: "i", ns: "ge_raw.raw_hits", o: { _id: ObjectId('505e88c573f6927b99000001'), _types: [ "Hit" ], tt_convert: {}, ad_id: "814e6e0a-e225-4f5e-b75e-26573a07c908", uid: "502d547342e0a32b4a000817", trackers_used: {}, ip: "10.204.178.129", agent: "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.9 (KHTML, like Gecko) Chrome/20.0.1115.1 Safari/536.9", platform: "windows", version: "20.0.1115.1", referer: "string string string", _cls: "Hit", uidmd5: "c5febb8357e7a0ea3c1ef0cc135c2821", query:

{ 64: "", 1348258871: "" }

, dt: new Date(1348258870132), type: "v", gen: null, browser: "chrome" } }
Sun Sep 23 17:27:15 [repl writer worker 1] Fatal Assertion 16360
0xade6e1 0x802e03 0x64f77d 0x77d3dd 0x7c3659 0x7fd3c8ede9ca 0x7fd3c828570d
/ebs/mongo/LATEST/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xade6e1]
/ebs/mongo/LATEST/bin/mongod(_ZN5mongo13fassertFailedEi+0xa3) [0x802e03]
/ebs/mongo/LATEST/bin/mongod(_ZN5mongo7replset14multiSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE+0x12d) [0x64f77d]
/ebs/mongo/LATEST/bin/mongod(_ZN5mongo10threadpool6Worker4loopEv+0x26d) [0x77d3dd]
/ebs/mongo/LATEST/bin/mongod() [0x7c3659]
/lib/libpthread.so.0(+0x69ca) [0x7fd3c8ede9ca]
/lib/libc.so.6(clone+0x6d) [0x7fd3c828570d]
Sun Sep 23 17:27:15 [repl writer worker 1]

***aborting after fassert() failure

Sun Sep 23 17:27:15 Got signal: 6 (Aborted).

Sun Sep 23 17:27:15 Backtrace:
0xade6e1 0x5582d9 0x7fd3c81d2af0 0x7fd3c81d2a75 0x7fd3c81d65c0 0x802e3e 0x64f77d 0x77d3dd 0x7c3659 0x7fd3c8ede9ca 0x7fd3c828570d
/ebs/mongo/LATEST/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xade6e1]
/ebs/mongo/LATEST/bin/mongod(_ZN5mongo10abruptQuitEi+0x399) [0x5582d9]
/lib/libc.so.6(+0x33af0) [0x7fd3c81d2af0]
/lib/libc.so.6(gsignal+0x35) [0x7fd3c81d2a75]
/lib/libc.so.6(abort+0x180) [0x7fd3c81d65c0]
/ebs/mongo/LATEST/bin/mongod(_ZN5mongo13fassertFailedEi+0xde) [0x802e3e]
/ebs/mongo/LATEST/bin/mongod(_ZN5mongo7replset14multiSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE+0x12d) [0x64f77d]
/ebs/mongo/LATEST/bin/mongod(_ZN5mongo10threadpool6Worker4loopEv+0x26d) [0x77d3dd]
/ebs/mongo/LATEST/bin/mongod() [0x7c3659]
/lib/libpthread.so.0(+0x69ca) [0x7fd3c8ede9ca]
/lib/libc.so.6(clone+0x6d) [0x7fd3c828570d]



 Comments   
Comment by m [ 25/Sep/12 ]

It is not a capped collection. I tried a full resync and it also fails.

I will try the op field workaround. Thanks.

Comment by Spencer Brody (Inactive) [ 25/Sep/12 ]

Is the "ge_raw.raw_hits" collection capped?

Are these servers monitored in MMS anywhere?

If you want to skip the problematic record, you could set the "op" field of the oplog entry on the primary to "n", which will make it a no-op. I would be very cautious about doing this, however, as it can lead you to have different data on the different members of your replica set, possibly causing data loss. It would be much better to restore from a backup or perform a full resync if possible.

Comment by m [ 24/Sep/12 ]

Not that I know of or that the logs show

Comment by Eliot Horowitz (Inactive) [ 23/Sep/12 ]

Looks like you have "nojournal=true" set.
Was there a crash of either the mongod process or the machine?

Comment by m [ 23/Sep/12 ]

Can I skip defective records during replication?

Comment by m [ 23/Sep/12 ]

I have over 1TB of data in this production database. I can't blow it away and try again...

Generated at Thu Feb 08 03:13:40 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.