[SERVER-2508] Database gets corrupted and reports duplicate collection names Created: 09/Feb/11  Updated: 30/Mar/12  Resolved: 20/Jun/11

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 1.6.5
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Reinaldo Giudici Assignee: Aaron Staple
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Linux cloud-mongo22 2.6.32-28-server #55-Ubuntu SMP Mon Jan 10 23:57:16 UTC 2011 x86_64 GNU/Linux
2 shards each one with 3 replica members, 3 config servers, mongos running with tomcat API servers.


Attachments: File data.tgz     File mongod.node1.log.3.gz    
Issue Links:
Depends
Operating System: ALL
Participants:

 Description   

At least three times on different 1.6 versions we have gotten into this situation. Attached are the database files with the corrupt data.

The general flow is that the LEADERBOARD_ collections get created and dropped based on time and other variables, they last for about 24h.
There is a group() that is run on these tables every 5 minutes.

For example:
query ee895ca8156f4c7fa4104b1b4c9a8c38.$cmd ntoreturn:1 command: { group: { ns: "LEADEBOARD_xxxxxxxxxxx", key:

{ value: true }

, cond: null, $reduce: "function(doc,prev)

{ prev.numEntries ++ }

", initial:

{ numEntries: 0 }

} } reslen:26486 1103ms

to generate some data on the rank.

Here is a sample of magnitude of the problem: 49111 collections with only 318 unique names:

echo "show collections" | mongo localhost:27017/ee895ca8156f4c7fa4104b1b4c9a8c38 | sort -u | wc -l
318
echo "show collections" | mongo localhost:27017/ee895ca8156f4c7fa4104b1b4c9a8c38 | wc -l
49111

show collections will show (partial output)
LEADEBOARD_cityofwonder-10-2_l
LEADEBOARD_cityofwonder-10-2_l
LEADEBOARD_cityofwonder-10-2_l
LEADEBOARD_cityofwonder-10-2_l
LEADEBOARD_cityofwonder-10-2_l
LEADEBOARD_cityofwonder-10-2_l
LEADEBOARD_cityofwonder-10-2_l
LEADEBOARD_cityofwonder-10-2_l
LEADEBOARD_cityofwonder-10-2_l
LEADEBOARD_cityofwonder-10-2_l
LEADEBOARD_cityofwonder-10-2_l
LEADEBOARD_cityofwonder-10-2_l
LEADEBOARD_cityofwonder-10-2_l
LEADEBOARD_cityofwonder-10-2_l
LEADEBOARD_cityofwonder-10-2_l
LEADEBOARD_cityofwonder-10-2_l
LEADEBOARD_cityofwonder-10-2_l
LEADEBOARD_cityofwonder-10-2_l
LEADEBOARD_cityofwonder-10-2_l
LEADEBOARD_cityofwonder-10-2_l
LEADEBOARD_cityofwonder-10-2_l
LEADEBOARD_cityofwonder-10-2_l
LEADEBOARD_cityofwonder-10-2_l
LEADEBOARD_cityofwonder-10-2_l
LEADEBOARD_cityofwonder-10-2_l
LEADEBOARD_cityofwonder-10-2_l
LEADEBOARD_cityofwonder-10-2_l
LEADEBOARD_cityofwonder-10-2_l
LEADEBOARD_cityofwonder-10-2_l
LEADEBOARD_cityofwonder-10-2_l
LEADEBOARD_cityofwonder-10-2_l
LEADEBOARD_cityofwonder-10-2_l
LEADEBOARD_cityofwonder-10-2_l
LEADEBOARD_cityofwonder-10-2_l
LEADEBOARD_cityofwonder-10-2_l
LEADEBOARD_cityofwonder-10-2_l
LEADEBOARD_cityofwonder-10-2_l
LEADEBOARD_cityofwonder-10-2_l
LEADEBOARD_cityofwonder-10-2_l
LEADEBOARD_cityofwonder-10-2_l
LEADEBOARD_cityofwonder-10-2_l
LEADEBOARD_cityofwonder-10-2_l
LEADEBOARD_cityofwonder-10-2_l
LEADEBOARD_cityofwonder-10-2_l
LEADEBOARD_cityofwonder-10-2_l
LEADEBOARD_cityofwonder-10-2_l
LEADEBOARD_cityofwonder-10-2_l
LEADEBOARD_cityofwonder-10-2_l
LEADEBOARD_cityofwonder-10-2_l
LEADEBOARD_cityofwonder-10-2_l
LEADEBOARD_cityofwonder-10-2_l
LEADEBOARD_cityofwonder-10-2_l
LEADEBOARD_cityofwonder-10-2_l
LEADEBOARD_cityofwonder-10-2_l
LEADEBOARD_cityofwonder-10-2_l
LEADEBOARD_cityofwonder-10-2_l
LEADEBOARD_cityofwonder-10-2_l
LEADEBOARD_cityofwonder-10-2_l
LEADEBOARD_cityofwonder-10-2_l
LEADEBOARD_cityofwonder-10-2_l
LEADEBOARD_cityofwonder-10-2_l



 Comments   
Comment by Aaron Staple [ 20/Jun/11 ]

Hi Reinaldo - please reopen if you have further problems.

Comment by Aaron Staple [ 14/Jun/11 ]

Hi Reinaldo - could you send the log from 1.7.5?

Comment by Aaron Staple [ 22/Feb/11 ]

Hi Reinaldo - would it be possible to attach the log from 1.7.5? Thanks

Comment by Reinaldo Giudici [ 16/Feb/11 ]

Attached is the full log

Comment by Eliot Horowitz (Inactive) [ 16/Feb/11 ]

Can you send the full logs?

Comment by Reinaldo Giudici [ 16/Feb/11 ]

This did happened again once after the upgrade 1.7.5.
We were dropping some collections before as part of the normal flow of the apps. This was causing this bug to show on our production cluster.
We have changed the code to remove the data instead of dropping the collections, and we have not seen this issue happen again, we are still monitoring.
I think there is definitely some bug on dropping collections in connection to group operations
SERVER-2532 also went away when we stop dropping the collection

Comment by Eliot Horowitz (Inactive) [ 10/Feb/11 ]

There were a number of fixes it could be.
Hard to know without spending a lot of time debugging, so ideally we could try 1.7 first.

Comment by Reinaldo Giudici [ 10/Feb/11 ]

I do not have a standalone test that is able to reproduce this. it has happened on production a few times.
We will be upgrading to 1.7. in the coming days, I'll report if we do see the issue again.
Is there any particular jira entry that would be related to this fix ?

Comment by Eliot Horowitz (Inactive) [ 10/Feb/11 ]

I think this is already fixed in 1.7.x
Can you try 1.7.5 in a staging environment?

Comment by Reinaldo Giudici [ 09/Feb/11 ]

And digging even further on the logs:
Seems like a "duplicate" call to drop a collection might be the trigger for this condition.

Wed Feb 9 01:00:24 [conn9812] CMD: drop ee895ca8156f4c7fa4104b1b4c9a8c38.LEADEBOARD_cityofwonder-22-6-103_l
Wed Feb 9 01:00:24 [conn9850] CMD: drop ee895ca8156f4c7fa4104b1b4c9a8c38.LEADEBOARD_cityofwonder-22-7_l
Wed Feb 9 01:00:24 [conn9816] CMD: drop ee895ca8156f4c7fa4104b1b4c9a8c38.LEADEBOARD_cityofwonder-22-7-2_l
Wed Feb 9 01:00:24 [conn9850] CMD: drop ee895ca8156f4c7fa4104b1b4c9a8c38.LEADEBOARD_cityofwonder-22-7-301_l
Wed Feb 9 01:00:24 [conn9858] CMD: drop ee895ca8156f4c7fa4104b1b4c9a8c38.LEADEBOARD_cityofwonder-22-7-301_l
Wed Feb 9 01:00:24 [conn9858] getFile(): n=-2
Wed Feb 9 01:00:24 [conn9858] Assertion: 10295:getFile(): bad file number value (corrupt db?): run repair
0x540c7e 0x605b8f 0x6ec8de 0x72c9f3 0x72d0d9 0x79c8bf 0x797596 0x798538 0x5fb7e5 0x60029f 0x7074ba 0x70aaf6 0x82691b 0x83a4b0 0x7fca805379ca 0x7fca7fae670d
/usr/bin/mongod(_ZN5mongo11msgassertedEiPKc+0x1de) [0x540c7e]
/usr/bin/mongod(_ZN5mongo8Database7getFileEiib+0x8f) [0x605b8f]
/usr/bin/mongod(_ZN5mongo11DataFileMgr9getExtentERKNS_7DiskLocE+0x5e) [0x6ec8de]
/usr/bin/mongod(_ZN5mongo6dropNSERKSs+0xe33) [0x72c9f3]
/usr/bin/mongod(_ZN5mongo14dropCollectionERKSsRSsRNS_14BSONObjBuilderE+0x239) [0x72d0d9]
/usr/bin/mongod(_ZN5mongo7CmdDrop3runERKSsRNS_7BSONObjERSsRNS_14BSONObjBuilderEb+0x11f) [0x79c8bf]
/usr/bin/mongod(_ZN5mongo11execCommandEPNS_7CommandERNS_6ClientEiPKcRNS_7BSONObjERNS_14BSONObjBuilderEb+0xa16) [0x797596]
/usr/bin/mongod(_ZN5mongo12_runCommandsEPKcRNS_7BSONObjERNS_10BufBuilderERNS_14BSONObjBuilderEbi+0x798) [0x798538]
/usr/bin/mongod(_ZN5mongo11runCommandsEPKcRNS_7BSONObjERNS_5CurOpERNS_10BufBuilderERNS_14BSONObjBuilderEbi+0x35) [0x5fb7e5]
/usr/bin/mongod(ZN5mongo8runQueryERNS_7MessageERNS_12QueryMessageERNS_5CurOpES1+0x1bbf) [0x60029f]
/usr/bin/mongod() [0x7074ba]
/usr/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_8SockAddrE+0x14d6) [0x70aaf6]
/usr/bin/mongod(_ZN5mongo10connThreadEPNS_13MessagingPortE+0x30b) [0x82691b]
/usr/bin/mongod(thread_proxy+0x80) [0x83a4b0]
/lib/libpthread.so.0(+0x69ca) [0x7fca805379ca]
/lib/libc.so.6(clone+0x6d) [0x7fca7fae670d]
Wed Feb 9 01:00:24 [conn9826] CMD: drop ee895ca8156f4c7fa4104b1b4c9a8c38.LEADEBOARD_cityofwonder-22-8_l
Wed Feb 9 01:00:24 [conn9826] ee895ca8156f4c7fa4104b1b4c9a8c38 Assertion failure a.ext()->xprev.isNull() db/pdfile.cpp 720
0x534a81 0x54163f 0x72c9c7 0x773f6b 0x7961d5 0x72d182 0x79c8bf 0x797596 0x798538 0x5fb7e5 0x60029f 0x7074ba 0x70aaf6 0x82691b 0x83a4b0 0x7fca805379ca 0x7fca7fae670d
/usr/bin/mongod(_ZN5mongo12sayDbContextEPKc+0xb1) [0x534a81]
/usr/bin/mongod(_ZN5mongo8assertedEPKcS1_j+0x10f) [0x54163f]
/usr/bin/mongod(_ZN5mongo6dropNSERKSs+0xe07) [0x72c9c7]

Comment by Reinaldo Giudici [ 09/Feb/11 ]

Here are the errors on the logs when this was happening.

Wed Feb 9 11:25:10 [conn10156] getFile(): n=-2
Wed Feb 9 11:25:10 [conn10156] Assertion: 10295:getFile(): bad file number value (corrupt db?): run repair
0x540c7e 0x605b8f 0x727d14 0x728202 0x72f52f 0x621e1e 0x622b6c 0x705ba4 0x70adf2 0x70b494 0x5588a2 0x7a1a95 0x797596 0x798538 0x5fb7e5 0x60029f 0x7074ba 0x70aaf6 0x82691b 0x83a4b0
/usr/bin/mongod(_ZN5mongo11msgassertedEiPKc+0x1de) [0x540c7e]
/usr/bin/mongod(_ZN5mongo8Database7getFileEiib+0x8f) [0x605b8f]
/usr/bin/mongod(_ZN5mongo11DataFileMgr17allocFromFreeListEPKcib+0x334) [0x727d14]
/usr/bin/mongod(_ZN5mongo11DataFileMgr6insertEPKcPKvibRKNS_11BSONElementEb+0x192) [0x728202]
/usr/bin/mongod(_ZN5mongo11DataFileMgr16insertWithObjModEPKcRNS_7BSONObjEb+0x5f) [0x72f52f]
/usr/bin/mongod(_ZN5mongo14_updateObjectsEbPKcRKNS_7BSONObjES2_bbbRNS_7OpDebugEPNS_11RemoveSaverE+0x354e) [0x621e1e]
/usr/bin/mongod(_ZN5mongo13updateObjectsEPKcRKNS_7BSONObjES2_bbbRNS_7OpDebugE+0x11c) [0x622b6c]
/usr/bin/mongod(_ZN5mongo14receivedUpdateERNS_7MessageERNS_5CurOpE+0x4d4) [0x705ba4]
/usr/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_8SockAddrE+0x17d2) [0x70adf2]
/usr/bin/mongod(_ZN5mongo14DBDirectClient3sayERNS_7MessageE+0x64) [0x70b494]
/usr/bin/mongod(_ZN5mongo12DBClientBase6updateERKSsNS_5QueryENS_7BSONObjEbb+0x2a2) [0x5588a2]
/usr/bin/mongod(_ZN5mongo16CmdFindAndModify3runERKSsRNS_7BSONObjERSsRNS_14BSONObjBuilderEb+0x625) [0x7a1a95]
/usr/bin/mongod(_ZN5mongo11execCommandEPNS_7CommandERNS_6ClientEiPKcRNS_7BSONObjERNS_14BSONObjBuilderEb+0xa16) [0x797596]
/usr/bin/mongod(_ZN5mongo12_runCommandsEPKcRNS_7BSONObjERNS_10BufBuilderERNS_14BSONObjBuilderEbi+0x798) [0x798538]
/usr/bin/mongod(_ZN5mongo11runCommandsEPKcRNS_7BSONObjERNS_5CurOpERNS_10BufBuilderERNS_14BSONObjBuilderEbi+0x35) [0x5fb7e5]
/usr/bin/mongod(ZN5mongo8runQueryERNS_7MessageERNS_12QueryMessageERNS_5CurOpES1+0x1bbf) [0x60029f]
/usr/bin/mongod() [0x7074ba]
/usr/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_8SockAddrE+0x14d6) [0x70aaf6]
/usr/bin/mongod(_ZN5mongo10connThreadEPNS_13MessagingPortE+0x30b) [0x82691b]
/usr/bin/mongod(thread_proxy+0x80) [0x83a4b0]
Wed Feb 9 11:25:10 [conn10216] getFile(): n=-2
Wed Feb 9 11:25:10 [conn10216] Assertion: 10295:getFile(): bad file number value (corrupt db?): run repair
0x540c7e 0x605b8f 0x727d14 0x728202 0x72f52f 0x621e1e 0x622b6c 0x705ba4 0x70adf2 0x70b494 0x5588a2 0x7a1a95 0x797596 0x798538 0x5fb7e5 0x60029f 0x7074ba 0x70aaf6 0x82691b 0x83a4b0
/usr/bin/mongod(_ZN5mongo11msgassertedEiPKc+0x1de) [0x540c7e]
/usr/bin/mongod(_ZN5mongo8Database7getFileEiib+0x8f) [0x605b8f]
/usr/bin/mongod(_ZN5mongo11DataFileMgr17allocFromFreeListEPKcib+0x334) [0x727d14]
/usr/bin/mongod(_ZN5mongo11DataFileMgr6insertEPKcPKvibRKNS_11BSONElementEb+0x192) [0x728202]
/usr/bin/mongod(_ZN5mongo11DataFileMgr16insertWithObjModEPKcRNS_7BSONObjEb+0x5f) [0x72f52f]
/usr/bin/mongod(_ZN5mongo14_updateObjectsEbPKcRKNS_7BSONObjES2_bbbRNS_7OpDebugEPNS_11RemoveSaverE+0x354e) [0x621e1e]
/usr/bin/mongod(_ZN5mongo13updateObjectsEPKcRKNS_7BSONObjES2_bbbRNS_7OpDebugE+0x11c) [0x622b6c]
/usr/bin/mongod(_ZN5mongo14receivedUpdateERNS_7MessageERNS_5CurOpE+0x4d4) [0x705ba4]
/usr/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_8SockAddrE+0x17d2) [0x70adf2]
/usr/bin/mongod(_ZN5mongo14DBDirectClient3sayERNS_7MessageE+0x64) [0x70b494]
/usr/bin/mongod(_ZN5mongo12DBClientBase6updateERKSsNS_5QueryENS_7BSONObjEbb+0x2a2) [0x5588a2]
/usr/bin/mongod(_ZN5mongo16CmdFindAndModify3runERKSsRNS_7BSONObjERSsRNS_14BSONObjBuilderEb+0x625) [0x7a1a95]
/usr/bin/mongod(_ZN5mongo11execCommandEPNS_7CommandERNS_6ClientEiPKcRNS_7BSONObjERNS_14BSONObjBuilderEb+0xa16) [0x797596]
/usr/bin/mongod(_ZN5mongo12_runCommandsEPKcRNS_7BSONObjERNS_10BufBuilderERNS_14BSONObjBuilderEbi+0x798) [0x798538]
/usr/bin/mongod(_ZN5mongo11runCommandsEPKcRNS_7BSONObjERNS_5CurOpERNS_10BufBuilderERNS_14BSONObjBuilderEbi+0x35) [0x5fb7e5]
/usr/bin/mongod(ZN5mongo8runQueryERNS_7MessageERNS_12QueryMessageERNS_5CurOpES1+0x1bbf) [0x60029f]
/usr/bin/mongod() [0x7074ba]
/usr/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_8SockAddrE+0x14d6) [0x70aaf6]
/usr/bin/mongod(_ZN5mongo10connThreadEPNS_13MessagingPortE+0x30b) [0x82691b]
/usr/bin/mongod(thread_proxy+0x80) [0x83a4b0]

Generated at Thu Feb 08 03:00:14 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.