[SERVER-7078] Data corruption :( Created: 20/Sep/12  Updated: 19/Nov/12  Resolved: 19/Nov/12

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 2.2.0
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Hannes Magnusson Assignee: Tad Marshall
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates SERVER-7206 potential mongod crash in profiling, 2.2 Closed
Operating System: ALL
Participants:

 Description   

After running db.setProfilingLevel(2) and doing loads of queries on the system.profile collection I started getting these consistently

Thu Sep 20 04:50:43 [conn10] Assertion: 10320:BSONElement: bad type 116
0xade6e1 0x8036eb 0x5706bb 0x925fac 0x927a2a 0xa0e287 0xa0ed48 0x5e6081 0x6ac10e 0x6ac1c4 0x6b06e6 0x6b3d38 0xb5ba7d 0xb5d052 0x56fa52 0x5dbf11 0x7f546ea7f9ca 0x7f546de26cdd 
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xade6e1]
 /usr/bin/mongod(_ZN5mongo11msgassertedEiPKc+0x9b) [0x8036eb]
 /usr/bin/mongod(_ZNK5mongo11BSONElement4sizeEv+0x1cb) [0x5706bb]
 /usr/bin/mongod(_ZNK5mongo7Matcher13matchesDottedEPKcRKNS_11BSONElementERKNS_7BSONObjEiRKNS_14ElementMatcherEbPNS_12MatchDetailsE+0x153c) [0x925fac]
 /usr/bin/mongod(_ZNK5mongo7Matcher7matchesERKNS_7BSONObjEPNS_12MatchDetailsE+0xfa) [0x927a2a]
 /usr/bin/mongod(_ZNK5mongo19CoveredIndexMatcher7matchesERKNS_7BSONObjERKNS_7DiskLocEPNS_12MatchDetailsEb+0xc7) [0xa0e287]
 /usr/bin/mongod(_ZNK5mongo19CoveredIndexMatcher14matchesCurrentEPNS_6CursorEPNS_12MatchDetailsE+0xa8) [0xa0ed48]
 /usr/bin/mongod(_ZN5mongo6Cursor14currentMatchesEPNS_12MatchDetailsE+0x41) [0x5e6081]
 /usr/bin/mongod(_ZN5mongo20QueryResponseBuilder14currentMatchesERNS_12MatchDetailsE+0x1e) [0x6ac10e]
 /usr/bin/mongod(_ZN5mongo20QueryResponseBuilder8addMatchEv+0x44) [0x6ac1c4]
 /usr/bin/mongod(_ZN5mongo23queryWithQueryOptimizerEiRKSsRKNS_7BSONObjERNS_5CurOpES4_S4_RKN5boost10shared_ptrINS_11ParsedQueryEEES4_RKNS_17ShardChunkVersionERNS7_10scoped_ptrINS_25PageFaultRetryableSectionEEERNSG_INS_19NoPageFaultsAllowedEEERNS_7MessageE+0x376) [0x6b06e6]
 /usr/bin/mongod(_ZN5mongo8runQueryERNS_7MessageERNS_12QueryMessageERNS_5CurOpES1_+0x16b8) [0x6b3d38]
 /usr/bin/mongod() [0xb5ba7d]
 /usr/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0x3a2) [0xb5d052]
 /usr/bin/mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x82) [0x56fa52]
 /usr/bin/mongod(_ZN5mongo3pms9threadRunEPNS_13MessagingPortE+0x411) [0x5dbf11]
 /lib/libpthread.so.0(+0x69ca) [0x7f546ea7f9ca]
 /lib/libc.so.6(clone+0x6d) [0x7f546de26cdd]
Thu Sep 20 04:50:43 [conn10] assertion 10320 BSONElement: bad type 116 ns:test2.system.profile query:{ ns: "test2.dbref2" }
Thu Sep 20 04:50:43 [conn10]  ntoskip:0 ntoreturn:10
Thu Sep 20 04:50:43 [conn10] problem detected during query over test2.system.profile : { $err: "BSONElement: bad type 116", code: 10320 }



 Comments   
Comment by Daniel Pasette (Inactive) [ 19/Nov/12 ]

closing as a duplicate of SERVER-7206. This will be fixed in v2.2.2

Comment by Peter Bakkum [ 15/Nov/12 ]

I've seen this problem as well on a production cluster but I haven't yet been able to reproduce it in isolation. This created big problems for us, especially because it wasn't at first clear that profiling was to blame.

We have a 9 machine mongo 2.2.0 cluster with 3 config servers and 6 data nodes split into 3 shards, with an arbiter on each of the config nodes. I ran a script to set the profiling level to 2 for a certain database on each of the 6 data nodes. After some time running with this setting, and ONLY when we had heavy load on the cluster, these machines would enter an unusable state that appeared to be caused by corruption in one of the collections within this database, possibly the system.profile collection. When in this state, the mongod server would stay up but one or more of the collections in the db would become at some level unqueryable. Sometimes this meant that we could do an indexed find, but not a scan, and sometimes it would fail to execute all queries. The logs during this period either looked like the one above or were something like "invalid BSONObject size".

To get out of this state, we sometimes could recover by restarting the daemon, sometimes by repairing the database, and sometimes we were never able to recover. Every single machine with profiling on would eventually enter this state if the cluster had heavy load. The problem occurred even after blasting the entire state of the cluster and starting from scratch. It took a while to determine what exactly was causing the corruption, but we haven't seen the issue with profiling off, and if I turn it on for a single data node, we inevitably see corruption occur there with an hour.

The workaround for us is to never turn on profiling. This is annoying because we would like to use it and because we never had any profiling issues on 2.0.4. Please let me know if there is any more information I can provide.

Generated at Thu Feb 08 03:13:34 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.