[SERVER-6963] : -286331391 (0x01EEEEEE) first element: ^A: ?type=108 Created: 07/Sep/12 Updated: 11/Jul/16 Resolved: 28/Dec/12 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 2.2.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Somsak Sriprayoonsakul | Assignee: | Kevin Matulef |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Mix CentOS 5.8,6.2, and 6.3. All Mongo upgraded from 2.0.6 to 2.2.0 via RPM(YUM). |
||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Operating System: | Linux | ||||||||||||
| Participants: | |||||||||||||
| Description |
|
We encountered the following errors a couple of times on PRIMARY master of each of our replication set. Mon Sep 3 11:02:17 [conn38] Assertion: 10334:Invalid BSONObj size: -286331391 (0x01EEEEEE) first element: ^A: ?type=108 When this happened, the SECONDARY take over. Starting PRIMARY again will yield the same error. We have to remove all data files and have former PRIMARY re-replicate from current PRIMARY. Note that, this problem never occured before. Attach are log files from 2 of our PRIMARY servers. |
| Comments |
| Comment by Kevin Matulef [ 28/Dec/12 ] |
|
Somsak, I'm going to close out this ticket since we believe it's been fixed in 2.2.2. If you encounter the problem again though, please let us know. |
| Comment by Kevin Matulef [ 05/Nov/12 ] |
|
Hi Somsak, as Tad mentioned this seems to be caused by the same bug as |
| Comment by Tad Marshall [ 05/Nov/12 ] |
|
This seems to be the same profiling crash as |
| Comment by Somsak Sriprayoonsakul [ 05/Nov/12 ] |
|
The problem still persists, the system still crashed from time to time, so we decided to do full migrate (dump everything with mongoexport, redo everything and re-import the .json files) What we found is that, if we turned on profiling (profile = 2), even with single replication set without any sharding, MongoDb crashed with the same exception during import. Note that this is a very freshly created replication set with latest stable MongoDB 2.2.1 release. |
| Comment by Somsak Sriprayoonsakul [ 19/Sep/12 ] |
|
It seems that this problem might relate to query profiler. 3-4 days ago, after we found out that system.profile was corrupted, we decided to stop query logging (setProfilingLevel(0, 5000)) on all databases. After that, the rate that problem arise greatly decreased. We only found the problem twice during these 3 days, and one of that occurred with "local" database. Another occurred on a rarely use database. We did the same, resync the whole database to fix it. Note that the problem will only occurred on PRIMARY. It never occurred on SECONDARY (not until it became PRIMARY). |
| Comment by Somsak Sriprayoonsakul [ 15/Sep/12 ] |
|
We found some other might-be-related error in the log file Sat Sep 15 07:33:23 [conn6630] problem detected during query over obecaoc.system.profile : { $err: "BSONElement: bad type -114", code: 10320 }There are quite a few error like above in our mongod.log. It didn't crashed our system though. |
| Comment by Somsak Sriprayoonsakul [ 13/Sep/12 ] |
|
Could you suggest what we should do next? We still have no luck on reproducing the problem. The problem occurred from time to time but we don't sure what's the cause (This is a live system where client is a web-site). Right now we are waiting for the next round that the problem occur, this time we will keep the corrupted data file and see what we can do with it. |
| Comment by Somsak Sriprayoonsakul [ 13/Sep/12 ] |
|
Yesterday, for some reason, the bricked database appear to be "repairable", so we decided to rolling-repair all mongod servers one by one. However, in this morning, one of mongod start spitting out the same exception again and crashed. We noticed that there are also type=24 exception around type=108 exceptions. We still trying to figure out how to reproduce the bug systemetically. Anyway it seems --repair could not solve our issue here. |
| Comment by Somsak Sriprayoonsakul [ 11/Sep/12 ] |
|
Sorry to mention, yes these servers experience quite a few hard shutdown (electricity problem). And we are running with journaling on. The problem just appeared today in one of a server. Starting the server with --repair flag yield the same error. Tue Sep 11 21:42:03 [initandlisten] MongoDB starting : pid=30960 port=27017 dbpath=/data/mongo/mongo2 64-bit host=my.host.name Tue Sep 11 21:42:03 [initandlisten] Assertion: 10334:Invalid BSONObj size: -286331391 (0x01EEEEEE) first element: ^A: ?type=-24 |
| Comment by Somsak Sriprayoonsakul [ 11/Sep/12 ] |
|
You could be right. May be thes are 2 separated problems. About repairDatabase(), we did try starting mongod with --repair option but it still crash. We didn't keep the error log though (I think it crashed with the same error so we didn't bother keeping the log). Next time we will try again and keep the log. We are trying to reproduce this in a testbed system, that's why we found out that mongodump & restore is also not working. I will post it here again if we could find someway to reproduce the problem. You are right about metadata.json. If I removed the file the mongorestore command works (Sorry, we never thought that by specifying studentfamily.bson it will still looking for studentfamily.metadata.json). Could you tell us what's wrong with the metadata.json file? |
| Comment by Kevin Matulef [ 11/Sep/12 ] |
|
Hi Somsak, The "invalid BSONObj size" error you experienced earlier looks a lot like data corruption. Did the old primary ever experience a hard shutdown? I assume you are running with journaling on? Also, before you erased the old primary's data and resynced it, did you try running "repairDatabase()" or starting it up with the "--repair" option? I've looked at the dumps you've given. The assertion seems to be coming from the metadata.json files, rather than the data itself (mongorestore works for me if I use the "studentfamily.bson" file without the "studentfamily.metadata.json" file). I expect that error is an issue with mongodump. Best, |
| Comment by Somsak Sriprayoonsakul [ 10/Sep/12 ] |
|
We managed to find something. If we dump the data into json with mongoexport, the restored data would have different number of document in collection. So we map reduce to find which group of data is missing and found out a small set (66 documents) which may be the cause of this problem. Attached are the mongodump'ed BSON of these data. Each contains only single document. When trying to restored with mongorestore it yielded the same assertion as commented above. |
| Comment by Somsak Sriprayoonsakul [ 09/Sep/12 ] |
|
We found that a collection might be the cause of this problem. When we dumped the collection with mongodump and tried to restore it on another stand-alone Mongod 2.2.0, montorestore crashed immediately. [root@sql1-dr obecaoc]# mongorestore -h localhost -d xxx -c studentfamily studentfamily.bson --objcheck --verbose Note that, exporting the collection into json (with mongoexport) and restore back yielded no problem. |
| Comment by Somsak Sriprayoonsakul [ 07/Sep/12 ] |
|
I forgot to mentioned one important thing. If this problem occurred, mongod will crash and it will never possible to start again (yielded the same error everytime it started). That's why we need to purge everything and re-replicate data. |
| Comment by Somsak Sriprayoonsakul [ 07/Sep/12 ] |
|
This java_exception.txt occurred in our web client. Might not related but I post it just in case. |