[SERVER-6963] : -286331391 (0x01EEEEEE) first element: ^A: ?type=108 Created: 07/Sep/12  Updated: 11/Jul/16  Resolved: 28/Dec/12

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 2.2.0
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Somsak Sriprayoonsakul Assignee: Kevin Matulef
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Mix CentOS 5.8,6.2, and 6.3. All Mongo upgraded from 2.0.6 to 2.2.0 via RPM(YUM).
There are 2 sharded replication sets with 3 configd servers. Each replication set has 3 servers. Mongos run in each client (web server). All client connect to localhost to use MongoDB.
(NOTE: When upgraded from 2.0.6, We didn't run mongod --upgrade since it's not instructed in Changelog)


Attachments: File faildata1.tar.gz     File faildata2.tar.gz     Text File java_exception.txt     File mongodb-dbm1.log.gz     File mongodb-dbm2.log.gz    
Issue Links:
Duplicate
duplicates SERVER-7206 potential mongod crash in profiling, 2.2 Closed
Related
Operating System: Linux
Participants:

 Description   

We encountered the following errors a couple of times on PRIMARY master of each of our replication set.

Mon Sep 3 11:02:17 [conn38] Assertion: 10334:Invalid BSONObj size: -286331391 (0x01EEEEEE) first element: ^A: ?type=108

When this happened, the SECONDARY take over. Starting PRIMARY again will yield the same error. We have to remove all data files and have former PRIMARY re-replicate from current PRIMARY.

Note that, this problem never occured before.

Attach are log files from 2 of our PRIMARY servers.



 Comments   
Comment by Kevin Matulef [ 28/Dec/12 ]

Somsak, I'm going to close out this ticket since we believe it's been fixed in 2.2.2. If you encounter the problem again though, please let us know.

Comment by Kevin Matulef [ 05/Nov/12 ]

Hi Somsak, as Tad mentioned this seems to be caused by the same bug as SERVER-7206. We believe we've identified the problem there and it should be fixed for 2.2.2. Once we check in a patch, it'd be great if you can test out the nightly build to see if it fixes this issue. Stay tuned.

Comment by Tad Marshall [ 05/Nov/12 ]

This seems to be the same profiling crash as SERVER-7206.

Comment by Somsak Sriprayoonsakul [ 05/Nov/12 ]

The problem still persists, the system still crashed from time to time, so we decided to do full migrate (dump everything with mongoexport, redo everything and re-import the .json files)

What we found is that, if we turned on profiling (profile = 2), even with single replication set without any sharding, MongoDb crashed with the same exception during import. Note that this is a very freshly created replication set with latest stable MongoDB 2.2.1 release.

Comment by Somsak Sriprayoonsakul [ 19/Sep/12 ]

It seems that this problem might relate to query profiler.

3-4 days ago, after we found out that system.profile was corrupted, we decided to stop query logging (setProfilingLevel(0, 5000)) on all databases. After that, the rate that problem arise greatly decreased. We only found the problem twice during these 3 days, and one of that occurred with "local" database. Another occurred on a rarely use database. We did the same, resync the whole database to fix it.

Note that the problem will only occurred on PRIMARY. It never occurred on SECONDARY (not until it became PRIMARY).

Comment by Somsak Sriprayoonsakul [ 15/Sep/12 ]

We found some other might-be-related error in the log file

Sat Sep 15 07:33:23 [conn6630] problem detected during query over obecaoc.system.profile :

{ $err: "BSONElement: bad type -114", code: 10320 }

There are quite a few error like above in our mongod.log. It didn't crashed our system though.

Comment by Somsak Sriprayoonsakul [ 13/Sep/12 ]

Could you suggest what we should do next?

We still have no luck on reproducing the problem. The problem occurred from time to time but we don't sure what's the cause (This is a live system where client is a web-site).

Right now we are waiting for the next round that the problem occur, this time we will keep the corrupted data file and see what we can do with it.

Comment by Somsak Sriprayoonsakul [ 13/Sep/12 ]

Yesterday, for some reason, the bricked database appear to be "repairable", so we decided to rolling-repair all mongod servers one by one.

However, in this morning, one of mongod start spitting out the same exception again and crashed. We noticed that there are also type=24 exception around type=108 exceptions.

We still trying to figure out how to reproduce the bug systemetically. Anyway it seems --repair could not solve our issue here.

Comment by Somsak Sriprayoonsakul [ 11/Sep/12 ]

Sorry to mention, yes these servers experience quite a few hard shutdown (electricity problem). And we are running with journaling on.

The problem just appeared today in one of a server. Starting the server with --repair flag yield the same error.

Tue Sep 11 21:42:03 [initandlisten] MongoDB starting : pid=30960 port=27017 dbpath=/data/mongo/mongo2 64-bit host=my.host.name
Tue Sep 11 21:42:03 [initandlisten]
Tue Sep 11 21:42:03 [initandlisten] ** WARNING: You are running on a NUMA machine.
Tue Sep 11 21:42:03 [initandlisten] ** We suggest launching mongod like this to avoid performance problems:
Tue Sep 11 21:42:03 [initandlisten] ** numactl --interleave=all mongod [other options]
Tue Sep 11 21:42:03 [initandlisten]
Tue Sep 11 21:42:03 [initandlisten] db version v2.2.0, pdfile version 4.5
Tue Sep 11 21:42:03 [initandlisten] git version: f5e83eae9cfbec7fb7a071321928f00d1b0c5207
Tue Sep 11 21:42:03 [initandlisten] build info: Linux ip-10-2-29-40 2.6.21.7-2.ec2.v1.2.fc8xen #1 SMP Fri Nov 20 17:48:28 EST 2009 x86_64 BOOST_LIB_VERSION=1_49
Tue Sep 11 21:42:03 [initandlisten] options:

{ config: "/etc/mongod.conf", dbpath: "/data/mongo/mongo2", fork: "true", logappend: "true", logpath: "/var/log/mongo/mongod.log", maxConns: 24000, profile: 2, repair: true, replSet: "rs2", slowms: 500, syncdelay: 60.0 }

Tue Sep 11 21:42:03 [initandlisten] Assertion: 10334:Invalid BSONObj size: -286331391 (0x01EEEEEE) first element: ^A: ?type=-24
0xade6e1 0x8036eb 0x80386c 0x571bc7 0xadf73d 0x7474e4 0xa61af4 0x62dc57 0x62dc98 0x73954c 0x70f40f 0xb5db70 0xb5e963 0x6f698d 0xa90c3a 0xb506b6 0x55a4cd 0x55c6c8 0x55d2dd 0x563dbe
/usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xade6e1]
/usr/bin/mongod(_ZN5mongo11msgassertedEiPKc+0x9b) [0x8036eb]
/usr/bin/mongod [0x80386c]
/usr/bin/mongod(_ZNK5mongo7BSONObj14_assertInvalidEv+0x497) [0x571bc7]
/usr/bin/mongod(_ZN5mongo13unindexRecordEPNS_16NamespaceDetailsEPNS_6RecordERKNS_7DiskLocEb+0x11d) [0xadf73d]
/usr/bin/mongod(_ZN5mongo11DataFileMgr12deleteRecordEPKcPNS_6RecordERKNS_7DiskLocEbbb+0x204) [0x7474e4]
/usr/bin/mongod(_ZN5mongo16NamespaceDetails11cappedAllocEPKci+0x4b4) [0xa61af4]
/usr/bin/mongod(_ZN5mongo16NamespaceDetails6_allocEPKci+0x27) [0x62dc57]
/usr/bin/mongod(_ZN5mongo16NamespaceDetails5allocEPKciRNS_7DiskLocE+0x38) [0x62dc98]
/usr/bin/mongod(_ZN5mongo11DataFileMgr17fast_oplog_insertEPNS_16NamespaceDetailsEPKci+0x6c) [0x73954c]
/usr/bin/mongod(_ZN5mongo7profileERKNS_6ClientERNS_5CurOpE+0x95f) [0x70f40f]
/usr/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0xec0) [0xb5db70]
/usr/bin/mongod(_ZN5mongo14DBDirectClient4callERNS_7MessageES2_bPSs+0x93) [0xb5e963]
/usr/bin/mongod(_ZN5mongo14DBClientCursor4initEv+0xbd) [0x6f698d]
/usr/bin/mongod(_ZN5mongo12DBClientBase5queryERKSsNS_5QueryEiiPKNS_7BSONObjEii+0xea) [0xa90c3a]
/usr/bin/mongod(_ZN5mongo14DBDirectClient5queryERKSsNS_5QueryEiiPKNS_7BSONObjEii+0x56) [0xb506b6]
/usr/bin/mongod(_ZN5mongo19clearTmpCollectionsEv+0x23d) [0x55a4cd]
/usr/bin/mongod(_ZN5mongo14_initAndListenEi+0x458) [0x55c6c8]
/usr/bin/mongod(_ZN5mongo13initAndListenEi+0x1d) [0x55d2dd]
/usr/bin/mongod [0x563dbe]
Tue Sep 11 21:31:14 [initandlisten] exception in initAndListen: 10334 Invalid BSONObj size: -286331391 (0x01EEEEEE) first element: : ?type=-24, terminating
Tue Sep 11 21:31:14 dbexit:
Tue Sep 11 21:31:14 [initandlisten] shutdown: going to close listening sockets...
Tue Sep 11 21:31:14 [initandlisten] shutdown: going to flush diaglog...
Tue Sep 11 21:31:14 [initandlisten] shutdown: going to close sockets...
Tue Sep 11 21:31:14 [initandlisten] shutdown: waiting for fs preallocator...
Tue Sep 11 21:31:14 [initandlisten] shutdown: lock for final commit...
Tue Sep 11 21:31:14 [initandlisten] shutdown: final commit...
Tue Sep 11 21:31:14 [initandlisten] shutdown: closing all files...
Tue Sep 11 21:31:14 [initandlisten] closeAllFiles() finished
Tue Sep 11 21:31:14 [initandlisten] journalCleanup...
Tue Sep 11 21:31:14 [initandlisten] removeJournalFiles
Tue Sep 11 21:31:14 [initandlisten] shutdown: removing fs lock...
Tue Sep 11 21:31:14 dbexit: really exiting now

Comment by Somsak Sriprayoonsakul [ 11/Sep/12 ]

You could be right. May be thes are 2 separated problems.

About repairDatabase(), we did try starting mongod with --repair option but it still crash. We didn't keep the error log though (I think it crashed with the same error so we didn't bother keeping the log). Next time we will try again and keep the log.

We are trying to reproduce this in a testbed system, that's why we found out that mongodump & restore is also not working. I will post it here again if we could find someway to reproduce the problem.

You are right about metadata.json. If I removed the file the mongorestore command works (Sorry, we never thought that by specifying studentfamily.bson it will still looking for studentfamily.metadata.json). Could you tell us what's wrong with the metadata.json file?

Comment by Kevin Matulef [ 11/Sep/12 ]

Hi Somsak,

The "invalid BSONObj size" error you experienced earlier looks a lot like data corruption. Did the old primary ever experience a hard shutdown? I assume you are running with journaling on?

Also, before you erased the old primary's data and resynced it, did you try running "repairDatabase()" or starting it up with the "--repair" option?

I've looked at the dumps you've given. The assertion seems to be coming from the metadata.json files, rather than the data itself (mongorestore works for me if I use the "studentfamily.bson" file without the "studentfamily.metadata.json" file). I expect that error is an issue with mongodump.

Best,
-Kevin

Comment by Somsak Sriprayoonsakul [ 10/Sep/12 ]

We managed to find something. If we dump the data into json with mongoexport, the restored data would have different number of document in collection.

So we map reduce to find which group of data is missing and found out a small set (66 documents) which may be the cause of this problem.

Attached are the mongodump'ed BSON of these data. Each contains only single document. When trying to restored with mongorestore it yielded the same assertion as commented above.

Comment by Somsak Sriprayoonsakul [ 09/Sep/12 ]

We found that a collection might be the cause of this problem. When we dumped the collection with mongodump and tried to restore it on another stand-alone Mongod 2.2.0, montorestore crashed immediately.

[root@sql1-dr obecaoc]# mongorestore -h localhost -d xxx -c studentfamily studentfamily.bson --objcheck --verbose
Sun Sep 9 10:35:56 creating new connection to:localhost:27017
Sun Sep 9 10:35:56 BackgroundJob starting: ConnectBG
Sun Sep 9 10:35:56 connected connection!
connected to: localhost
Sun Sep 9 10:35:56 studentfamily.bson
Sun Sep 9 10:35:56 going into namespace [xxx.studentfamily]
Sun Sep 9 10:35:56 Assertion failure b.empty() src/mongo/db/json.cpp 645
0xad2211 0x67fced 0xa7daa5 0x56aa20 0x56db4a 0xb3b078 0xb41712 0x553f92 0x34e9c1ecdd 0x553e09
mongorestore(_ZN5mongo15printStackTraceERSo+0x21) [0xad2211]
mongorestore(_ZN5mongo12verifyFailedEPKcS1_j+0xfd) [0x67fced]
mongorestore(_ZN5mongo8fromjsonEPKcPi+0x585) [0xa7daa5]
mongorestore(_ZN7Restore9drillDownEN5boost11filesystem210basic_pathISsNS1_11path_traitsEEEbbb+0xff0) [0x56aa20]
mongorestore(_ZN7Restore5doRunEv+0xdda) [0x56db4a]
mongorestore(_ZN5mongo8BSONTool3runEv+0x448) [0xb3b078]
mongorestore(_ZN5mongo4Tool4mainEiPPc+0x1712) [0xb41712]
mongorestore(main+0x32) [0x553f92]
/lib64/libc.so.6(__libc_start_main+0xfd) [0x34e9c1ecdd]
mongorestore(__gxx_personality_v0+0x421) [0x553e09]
assertion: 0 assertion src/mongo/db/json.cpp:645

Note that, exporting the collection into json (with mongoexport) and restore back yielded no problem.

Comment by Somsak Sriprayoonsakul [ 07/Sep/12 ]

I forgot to mentioned one important thing. If this problem occurred, mongod will crash and it will never possible to start again (yielded the same error everytime it started). That's why we need to purge everything and re-replicate data.

Comment by Somsak Sriprayoonsakul [ 07/Sep/12 ]

This java_exception.txt occurred in our web client. Might not related but I post it just in case.

Generated at Thu Feb 08 03:13:15 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.