Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-9995

corruption on primaries after upgrade to 2.4.4

    • Type: Icon: Bug Bug
    • Resolution: Duplicate
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 2.4.4
    • Component/s: None
    • Labels:
      None
    • ALL

      I'm currently running a handful of 2.2.4 clusters. On June 11th, I upgraded one primary and one secondary apiece on two clusters to 2.4.4.

      A couple days later, I started getting reports of data corruption. Looking through my logs, I saw a ton of these on one of the 2.4.4 primaries:

      Wed Jun 12 16:18:51.804 [conn1265710] getFile(): n=-2
      Wed Jun 12 16:18:51.804 [conn1265710] Assertion: 10295:getFile(): bad file number value (corrupt db?): run repair
      0xdd2331 0xd93c6b 0x8cdf45 0xb841e2 0x80996e 0xb3eca1 0xb49632 0xb4e6d1 0xb4ee8a 0xb51604 0xb51e56 0xb580c9 0xb5cd48 0xb5cf4e 0xa7f6da 0xa827d8 0x9f6059 0x9f7572 0x6e7978 0xdbea9e
       /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xdd2331]
       /usr/bin/mongod(_ZN5mongo11msgassertedEiPKc+0x9b) [0xd93c6b]
       /usr/bin/mongod(_ZN5mongo8Database7getFileEiib+0x395) [0x8cdf45]
       /usr/bin/mongod(_ZNK5mongo7DiskLoc3recEv+0x42) [0xb841e2]
       /usr/bin/mongod(_ZNK5mongo12IndexDetails10keyPatternEv+0x1e) [0x80996e]
       /usr/bin/mongod(_ZN5mongo16QueryUtilIndexed11indexUsefulERKNS_17FieldRangeSetPairEPNS_16NamespaceDetailsEiRKNS_7BSONObjE+0x41) [0xb3eca1]
       /usr/bin/mongod(_ZN5mongo18QueryPlanGenerator16addFallbackPlansEv+0x152) [0xb49632]
       /usr/bin/mongod(_ZN5mongo18QueryPlanGenerator15addInitialPlansEv+0x81) [0xb4e6d1]
       /usr/bin/mongod(_ZN5mongo12QueryPlanSet4makeEPKcSt8auto_ptrINS_17FieldRangeSetPairEES5_RKNS_7BSONObjES8_RKN5boost10shared_ptrIKNS_11ParsedQueryEEES8_NS_18QueryPlanGenerator18RecordedPlanPolicyES8_S8_b+0x10a) [0xb4ee8a]
       /usr/bin/mongod(_ZN5mongo16MultiPlanScanner4initERKNS_7BSONObjES3_S3_+0xf4) [0xb51604]
       /usr/bin/mongod(_ZN5mongo16MultiPlanScanner4makeERKNS_10StringDataERKNS_7BSONObjES6_RKN5boost10shared_ptrIKNS_11ParsedQueryEEES6_NS_18QueryPlanGenerator18RecordedPlanPolicyES6_S6_+0x76) [0xb51e56]
       /usr/bin/mongod(_ZN5mongo15CursorGenerator19setMultiPlanScannerEv+0xe9) [0xb580c9]
       /usr/bin/mongod(_ZN5mongo15CursorGenerator8generateEv+0x98) [0xb5cd48]
       /usr/bin/mongod(_ZN5mongo25NamespaceDetailsTransient9getCursorERKNS_10StringDataERKNS_7BSONObjES6_RKNS_24QueryPlanSelectionPolicyERKN5boost10shared_ptrIKNS_11ParsedQueryEEEbPNS_16QueryPlanSummaryE+0x3e) [0xb5cf4e]
       /usr/bin/mongod(_ZN5mongo23queryWithQueryOptimizerEiRKSsRKNS_7BSONObjERNS_5CurOpES4_S4_RKN5boost10shared_ptrINS_11ParsedQueryEEES4_RKNS_12ChunkVersionERNS7_10scoped_ptrINS_25PageFaultRetryableSectionEEERNSG_INS_19NoPageFaultsAllowedEEERNS_7MessageE+0x12a) [0xa7f6da]
       /usr/bin/mongod(_ZN5mongo8runQueryERNS_7MessageERNS_12QueryMessageERNS_5CurOpES1_+0x1ac8) [0xa827d8]
       /usr/bin/mongod() [0x9f6059]
       /usr/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0x382) [0x9f7572]
       /usr/bin/mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x98) [0x6e7978]
       /usr/bin/mongod(_ZN5mongo17PortMessageServer17handleIncomingMsgEPv+0x42e) [0xdbea9e]
      Wed Jun 12 16:18:51.845 [conn1265710] assertion 10295 getFile(): bad file number value (corrupt db?): run repair ns:appdata84.app_35e3c2d0-297e-4725-8083-4af9a534c2a6:JobEmployee query:{ $query: {}, $orderby: { _created_at: -1 }, $maxScan: 9000000 }
      

      I failed over to another primary (also 2.4.4), and within 4 days it started generating the same assertions.

      Fri Jun 21 08:37:03.362 [conn774102]  appdata92 warning assertion failure n == 1 src/mongo/db/index.cpp 221
      0xdd2331 0xd9217a 0x9ced75 0x9d71df 0x8e9a18 0x8d44fa 0x8d71e3 0x8d7ef2 0xa7d1c0 0xa81a8c 0x9f6059 0x9f7572 0x6e7978 0xdbea9e 0x7fd43601be9a 0x7fd43532ecbd
       /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xdd2331]
       /usr/bin/mongod(_ZN5mongo9wassertedEPKcS1_j+0x11a) [0xd9217a]
       /usr/bin/mongod(_ZN5mongo12IndexDetails8kill_idxEv+0xe75) [0x9ced75]
       /usr/bin/mongod(_ZN5mongo11dropIndexesEPNS_16NamespaceDetailsEPKcS3_RSsRNS_14BSONObjBuilderEb+0x74f) [0x9d71df]
       /usr/bin/mongod(_ZN5mongo14CmdDropIndexes3runERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0x7c8) [0x8e9a18]
       /usr/bin/mongod(_ZN5mongo12_execCommandEPNS_7CommandERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0x3a) [0x8d44fa]
       /usr/bin/mongod(_ZN5mongo7Command11execCommandEPS0_RNS_6ClientEiPKcRNS_7BSONObjERNS_14BSONObjBuilderEb+0x1023) [0x8d71e3]
       /usr/bin/mongod(_ZN5mongo12_runCommandsEPKcRNS_7BSONObjERNS_11_BufBuilderINS_16TrivialAllocatorEEERNS_14BSONObjBuilderEbi+0x5f2) [0x8d7ef2]
       /usr/bin/mongod(_ZN5mongo11runCommandsEPKcRNS_7BSONObjERNS_5CurOpERNS_11_BufBuilderINS_16TrivialAllocatorEEERNS_14BSONObjBuilderEbi+0x40) [0xa7d1c0]
       /usr/bin/mongod(_ZN5mongo8runQueryERNS_7MessageERNS_12QueryMessageERNS_5CurOpES1_+0xd7c) [0xa81a8c]
       /usr/bin/mongod() [0x9f6059]
       /usr/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0x382) [0x9f7572]
       /usr/bin/mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x98) [0x6e7978]
       /usr/bin/mongod(_ZN5mongo17PortMessageServer17handleIncomingMsgEPv+0x42e) [0xdbea9e]
       /lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a) [0x7fd43601be9a]
       /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7fd43532ecbd]
      Fri Jun 21 08:37:03.461 [conn774102] Assertion: 10334:BSONObj size: 0 (0x00000000) is invalid. Size must be between 0 and 16793600(16MB) First element: EOO
      0xdd2331 0xd93c6b 0xd941ac 0x6ebf2f 0x8099aa 0x9d7098 0x8e9a18 0x8d44fa 0x8d71e3 0x8d7ef2 0xa7d1c0 0xa81a8c 0x9f6059 0x9f7572 0x6e7978 0xdbea9e 0x7fd43601be9a 0x7fd43532ecbd
       /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xdd2331]
       /usr/bin/mongod(_ZN5mongo11msgassertedEiPKc+0x9b) [0xd93c6b]
       /usr/bin/mongod() [0xd941ac]
       /usr/bin/mongod(_ZNK5mongo7BSONObj14_assertInvalidEv+0x5bf) [0x6ebf2f]
       /usr/bin/mongod(_ZNK5mongo12IndexDetails10keyPatternEv+0x5a) [0x8099aa]
       /usr/bin/mongod(_ZN5mongo11dropIndexesEPNS_16NamespaceDetailsEPKcS3_RSsRNS_14BSONObjBuilderEb+0x608) [0x9d7098]
       /usr/bin/mongod(_ZN5mongo14CmdDropIndexes3runERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0x7c8) [0x8e9a18]
       /usr/bin/mongod(_ZN5mongo12_execCommandEPNS_7CommandERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0x3a) [0x8d44fa]
       /usr/bin/mongod(_ZN5mongo7Command11execCommandEPS0_RNS_6ClientEiPKcRNS_7BSONObjERNS_14BSONObjBuilderEb+0x1023) [0x8d71e3]
       /usr/bin/mongod(_ZN5mongo12_runCommandsEPKcRNS_7BSONObjERNS_11_BufBuilderINS_16TrivialAllocatorEEERNS_14BSONObjBuilderEbi+0x5f2) [0x8d7ef2]
       /usr/bin/mongod(_ZN5mongo11runCommandsEPKcRNS_7BSONObjERNS_5CurOpERNS_11_BufBuilderINS_16TrivialAllocatorEEERNS_14BSONObjBuilderEbi+0x40) [0xa7d1c0]
       /usr/bin/mongod(_ZN5mongo8runQueryERNS_7MessageERNS_12QueryMessageERNS_5CurOpES1_+0xd7c) [0xa81a8c]
       /usr/bin/mongod() [0x9f6059]
       /usr/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0x382) [0x9f7572]
       /usr/bin/mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x98) [0x6e7978]
       /usr/bin/mongod(_ZN5mongo17PortMessageServer17handleIncomingMsgEPv+0x42e) [0xdbea9e]
       /lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a) [0x7fd43601be9a]
       /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7fd43532ecbd]
      
      
      Fri Jun 21 08:52:53.727 [conn773922] getFile(): n=-2
      Fri Jun 21 08:52:53.727 [conn773922] Assertion: 10295:getFile(): bad file number value (corrupt db?): run repair
      0xdd2331 0xd93c6b 0x8cdf45 0xb841e2 0x80996e 0xb3eca1 0xb49632 0xb4e6d1 0xb4ee8a 0xb51604 0xb51e56 0xb580c9 0xb5cd48 0xb5cf4e 0xa7f6da 0xa827d8 0x9f6059 0x9f7572 0x6e7978 0xdbea9e
       /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xdd2331]
       /usr/bin/mongod(_ZN5mongo11msgassertedEiPKc+0x9b) [0xd93c6b]
       /usr/bin/mongod(_ZN5mongo8Database7getFileEiib+0x395) [0x8cdf45]
       /usr/bin/mongod(_ZNK5mongo7DiskLoc3recEv+0x42) [0xb841e2]
       /usr/bin/mongod(_ZNK5mongo12IndexDetails10keyPatternEv+0x1e) [0x80996e]
       /usr/bin/mongod(_ZN5mongo16QueryUtilIndexed11indexUsefulERKNS_17FieldRangeSetPairEPNS_16NamespaceDetailsEiRKNS_7BSONObjE+0x41) [0xb3eca1]
       /usr/bin/mongod(_ZN5mongo18QueryPlanGenerator16addFallbackPlansEv+0x152) [0xb49632]
       /usr/bin/mongod(_ZN5mongo18QueryPlanGenerator15addInitialPlansEv+0x81) [0xb4e6d1]
       /usr/bin/mongod(_ZN5mongo12QueryPlanSet4makeEPKcSt8auto_ptrINS_17FieldRangeSetPairEES5_RKNS_7BSONObjES8_RKN5boost10shared_ptrIKNS_11ParsedQueryEEES8_NS_18QueryPlanGenerator18RecordedPlanPolicyES8_S8_b+0x10a) [0xb4ee8a]
       /usr/bin/mongod(_ZN5mongo16MultiPlanScanner4initERKNS_7BSONObjES3_S3_+0xf4) [0xb51604]
       /usr/bin/mongod(_ZN5mongo16MultiPlanScanner4makeERKNS_10StringDataERKNS_7BSONObjES6_RKN5boost10shared_ptrIKNS_11ParsedQueryEEES6_NS_18QueryPlanGenerator18RecordedPlanPolicyES6_S6_+0x76) [0xb51e56]
       /usr/bin/mongod(_ZN5mongo15CursorGenerator19setMultiPlanScannerEv+0xe9) [0xb580c9]
       /usr/bin/mongod(_ZN5mongo15CursorGenerator8generateEv+0x98) [0xb5cd48]
       /usr/bin/mongod(_ZN5mongo25NamespaceDetailsTransient9getCursorERKNS_10StringDataERKNS_7BSONObjES6_RKNS_24QueryPlanSelectionPolicyERKN5boost10shared_ptrIKNS_11ParsedQueryEEEbPNS_16QueryPlanSummaryE+0x3e) [0xb5cf4e]
       /usr/bin/mongod(_ZN5mongo23queryWithQueryOptimizerEiRKSsRKNS_7BSONObjERNS_5CurOpES4_S4_RKN5boost10shared_ptrINS_11ParsedQueryEEES4_RKNS_12ChunkVersionERNS7_10scoped_ptrINS_25PageFaultRetryableSectionEEERNSG_INS_19NoPageFaultsAllowedEEERNS_7MessageE+0x12a) [0xa7f6da]
       /usr/bin/mongod(_ZN5mongo8runQueryERNS_7MessageERNS_12QueryMessageERNS_5CurOpES1_+0x1ac8) [0xa827d8]
       /usr/bin/mongod() [0x9f6059]
       /usr/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0x382) [0x9f7572]
       /usr/bin/mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x98) [0x6e7978]
       /usr/bin/mongod(_ZN5mongo17PortMessageServer17handleIncomingMsgEPv+0x42e) [0xdbea9e]
      Fri Jun 21 08:52:53.734 [conn773922] assertion 10295 getFile(): bad file number value (corrupt db?): run repair ns:appdata92.app_df7688a2-419b-470e-8adb-f1071e960753:Answer query:{ $query: {}, $orderby: { _created_at: -1 }, $maxScan: 9000000 }
      

      I just checked on my other 2.4.4 primary on a totally different replica set, and sure enough, it has a shitload of "corrupt db" errors too. Thousands.

      asya suggested this may be due to index corruption, not data corruption, so I'm going to try rebuilding the indexes on these nodes once I can take them offline.

            Assignee:
            Unassigned Unassigned
            Reporter:
            charity@parse.com charity majors
            Votes:
            1 Vote for this issue
            Watchers:
            10 Start watching this issue

              Created:
              Updated:
              Resolved: