[SERVER-11219] mongod crash during index build leaves orphaned index Created: 16/Oct/13  Updated: 10/Dec/14  Resolved: 28/Oct/13

Status: Closed
Project: Core Server
Component/s: Index Maintenance
Affects Version/s: 2.4.5
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Arkadiy Kukarkin Assignee: J Rassi
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-2771 Background index builds on replica se... Closed
is related to SERVER-11391 repairDatabase should be disallowed o... Closed
Operating System: ALL
Steps To Reproduce:

Index was built in the background and presumably worked before?

Participants:

 Description   

[ed: original title was "Index shown in getIndexes() does not actually exist"]

I've been trying to track down a mysterious performance issue on one of our shards for a little while, and it turns out the culprit is an index that doesn't actually exist:

events2:PRIMARY> db.listen.getIndexes();
[
        {
                "v" : 1,
                "key" : {
                        "_id" : 1
                },
                "ns" : "site_events.listen",
                "name" : "_id_"
        },
        {
                "v" : 1,
                "key" : {
                        "uid" : 1
                },
                "ns" : "site_events.listen",
                "name" : "uid_1"
        },
        {
                "v" : 1,
                "key" : {
                        "list.i" : 1 // list.i, okay
                },
                "ns" : "site_events.listen",
                "name" : "list.i_1",
                "background" : true
        }
]
events2:PRIMARY>

events2:PRIMARY> db.listen.stats();
{
        "ns" : "site_events.listen",
        "count" : 616369,
        "size" : 7789669296,
        "avgObjSize" : 12637.996550767479,
        "storageSize" : 9305386992,
        "numExtents" : 24,
        "nindexes" : 2,
        "lastExtentSize" : 2146426864,
        "paddingFactor" : 1.0000000000920757,
        "systemFlags" : 1,
        "userFlags" : 1,
        "totalIndexSize" : 45098816,
        "indexSizes" : { // no list.i here!
                "_id_" : 22026144,
                "uid_1" : 23072672
        },
        "ok" : 1
}

Queries that should use this index do a full scan instead.



 Comments   
Comment by J Rassi [ 28/Oct/13 ]

Thanks.

Copying earlier resolve message: "Issue fixed by refactor in SERVER-2771, resolving as dup."

Comment by Arkadiy Kukarkin [ 28/Oct/13 ]

Cool, wanted to make sure you didn't need any more data from me. I didn't look at getIndexes output after dropIndex, so it's possible it was partially successful. Building normally now.

Comment by J Rassi [ 28/Oct/13 ]

Hmm. I can confirm that repairDatabase on 2.2.3 fetches the index list for the rebuild in the same way that db.collection.getIndexes() does (that is, a query on db.system.indexes), so I'm currently convinced that you got rid of the system.indexes entry somehow (the dropIndex attempt?), sometime between posting the ticket and running repairDatabase. Anyway, you have a fresh database on this member now, so I'd just create the index on it as usual if you haven't already.

Comment by Arkadiy Kukarkin [ 28/Oct/13 ]

Finished without errors, however index was not rebuilt:

Fri Oct 25 15:44:59.397 [conn17] build index site_events.listen { _id: 1 }
Fri Oct 25 15:45:09.007 [conn17]                Index: (1/3) External Sort Progress: 159900/620987      25%
Fri Oct 25 15:45:17.158 [DataFileSync] flushing mmaps took 26058ms  for 40 files
Fri Oct 25 15:45:19.005 [conn17]                Index: (1/3) External Sort Progress: 347400/620987      55%
Fri Oct 25 15:45:29.015 [conn17]                Index: (1/3) External Sort Progress: 481600/620987      77%
Fri Oct 25 15:45:39.037 [conn17]                Index: (1/3) External Sort Progress: 542700/620987      87%
Fri Oct 25 15:45:49.046 [conn17]                Index: (1/3) External Sort Progress: 568500/620987      91%
Fri Oct 25 15:45:59.028 [conn17]                Index: (1/3) External Sort Progress: 592200/620987      95%
Fri Oct 25 15:46:06.210 [conn17]         external sort used : 0 files  in 66 secs
Fri Oct 25 15:46:06.511 [conn17]         done building bottom layer, going to commit
Fri Oct 25 15:46:06.513 [conn17]         fastBuildIndex dupsToDrop:0
Fri Oct 25 15:46:06.590 [conn17] build index done.  scanned 620987 total records. 67.193 secs
Fri Oct 25 15:46:06.647 [conn17] build index site_events.listen { uid: 1.0 }
Fri Oct 25 15:46:16.004 [conn17]                Index: (1/3) External Sort Progress: 265800/620987      42%
Fri Oct 25 15:46:26.018 [conn17]                Index: (1/3) External Sort Progress: 453700/620987      73%
Fri Oct 25 15:46:36.006 [conn17]                Index: (1/3) External Sort Progress: 526600/620987      84%
Fri Oct 25 15:46:46.041 [conn17]                Index: (1/3) External Sort Progress: 559400/620987      90%
Fri Oct 25 15:46:56.040 [conn17]                Index: (1/3) External Sort Progress: 584200/620987      94%
Fri Oct 25 15:47:06.223 [conn17]         external sort used : 0 files  in 59 secs
Fri Oct 25 15:47:06.937 [conn17]         done building bottom layer, going to commit
Fri Oct 25 15:47:06.939 [conn17]         fastBuildIndex dupsToDrop:0
Fri Oct 25 15:47:07.013 [conn17] build index done.  scanned 620987 total records. 60.365 secs
Fri Oct 25 15:47:07.404 [conn17] removeJournalFiles
Fri Oct 25 15:47:08.073 [conn17] command site_events.$cmd command: { repairDatabase: 1.0 } ntoreturn:1 keyUpdates:0 locks(micros) W:428434157 reslen:37 428434ms

        "indexSizes" : {
                "_id_" : 18142544,
                "uid_1" : 15648864
        }

Comment by J Rassi [ 26/Oct/13 ]

repairDatabase finish without errors?

Comment by J Rassi [ 25/Oct/13 ]

Yeah that's a bug – filed SERVER-11391.

Comment by Arkadiy Kukarkin [ 25/Oct/13 ]

On a side note, an error similar to other "not master" errors would probably be more helpful.

Comment by Arkadiy Kukarkin [ 25/Oct/13 ]

Restarting as standalone fixed the issue, never mind!

Comment by J Rassi [ 25/Oct/13 ]

Also (perhaps relatedly), I just noticed the SECONDARY prompt from your shell excerpt. You'll need to restart the secondary in standalone mode (without the --replSet flag) to perform write operations on it. You may have already – sometimes the shell doesn't update the prompt properly.

Comment by J Rassi [ 25/Oct/13 ]

Can you post the log excerpt for the repairDatabase, with at least a few minutes of context? And the member is still running 2.2.2, yes?

Comment by Arkadiy Kukarkin [ 25/Oct/13 ]

Ok, attempting to repairDatabase yields this:

events2:SECONDARY> db.repairDatabase();
{
"ok" : 0,
"errmsg" : "clone failed for site_events with error: namespace query failed site_events.system.namespaces"
}

Comment by J Rassi [ 17/Oct/13 ]

Issue fixed by refactor in SERVER-2771, resolving as dup.

Comment by J Rassi [ 17/Oct/13 ]

I'll go back on my earlier advice: run repairDatabase instead of a reIndex. reIndex only drops the index collections that are registered with namespace details, so after the reIndex finishes the list.i_1 index collection will still have the old index data around on disk (though the old data will be inaccessible to queries). repairDatabase does a rebuild-everything-from-scratch, which is the more robust fix (also, a reIndex on a collection with half-built indexes doesn't work in 2.4). I also decided not to file a request for the user mechanism to remove half-built indexes: it wouldn't be backported since it's a feature, and the SERVER-2771 fix for the original problem is already in master (I see much saner behavior when running the index build abort experiments in 2.5.2 than in 2.2 or 2.4). More generally, users with database metadata out of sync should be running repair.

I'll resolve this ticket then, assuming you'll run repairDatabase when you can next take a maintenance window for that host; feel free to re-open the ticket if you run into any problems.

Comment by Arkadiy Kukarkin [ 16/Oct/13 ]

Taking this member down is harder than it sounds, because there is an unrelated issue with the secondary (which does have the index correctly committed, sigh) where it absolutely refuses to catch up at above realtime speed and has been staying in STARTUP2 all day. Disk and network are at under 200k/s max but CPU is pegged and serverStatus takes 2000ms+ to run :\

Will post this one to the mailing list, I'm guessing it's a locking issue.

Comment by Arkadiy Kukarkin [ 16/Oct/13 ]

Okay, that makes sense. The index collection does indeed have some objects, but fewer than expected:
events2:PRIMARY> db.listen.$list.i_1.stats();
{
"ns" : "site_events.listen.$list.i_1",
"count" : 180916,
"size" : 1479169216,
"avgObjSize" : 8176,
"storageSize" : 1777704960,
"numExtents" : 18,
"nindexes" : 0,
"lastExtentSize" : 467083264,
"paddingFactor" : 1,
"systemFlags" : 0,
"userFlags" : 0,
"totalIndexSize" : 0,
"indexSizes" : {

},
"ok" : 1
}

No idea why that machine would run of memory – it has plenty of RAM and doesn't really run anything else. I seem vaguely remember unusually high resource usage at around this time, but no details. I'll dig around a bit to see if I can find anything.

mongod version is 2.2.2 at this time

Comment by J Rassi [ 16/Oct/13 ]

Reproduced on 2.2.3, verified that reIndex fixes the issue. [edit: run repairDatabase instead of reIndex; see comment below]

Comment by J Rassi [ 16/Oct/13 ]

Nope, not an OOM kill, an assertion failure (13524). 13524 is an assertion in the journaling subsystem during allocation of a journal commit – it's raised when posix_memalign() returns an error (presumably ENOMEM). At the time of the assertion failure, the external sort had finished and the index build was being committed to disk. Fatal assertion failures generate an abort() call, which is why you see signal 6. I'm guessing that the amount of memory requested was less than 256MB, since warnings are printed whenever the journal commit size exceeds that, but I can't tell without seeing if there are any in the full log. And it's not clear why the system was out of memory at that time either; system memory usage stats from that time would probably help diagnose that. Committing a index build to disk that uses external sort isn't memory-intensive by itself.

I see that the index has an entry in db.system.indexes (from your collection.getIndexes() output), and I presume that a collection was allocated for the index (site_events.listen.$list.i_1), but I see no entry for the index in the database's namespace details (from your collection.stats() output). The namespace details is the "definitive" listing – since it's not there, it's not being updated or used for queries. So, the process aborted during the index build, you have a half-built index of sorts. Unfortunately, in 2.2 (which I presume you were running, since the log says February) mongod doesn't continue aborted index builds on restart (fixed in 2.5.0 as part of SERVER-2771).

You can't remove it with dropIndex; you probably want to take the member down for maintenance and run reIndex. I'll file a feature request for a user mechanism to remove half-built indexes and link it here.

Comment by Arkadiy Kukarkin [ 16/Oct/13 ]

Can't drop it, I guess:

events2:PRIMARY> db.listen.dropIndex("list.i_1");

{ "ok" : 0, "errmsg" : "index not found" }
Comment by Arkadiy Kukarkin [ 16/Oct/13 ]

This appears to be the most relevant piece:

Mon Feb  4 14:00:39 [rsSync] info: indexing in foreground on this replica; was a background index build on the primary
Mon Feb  4 14:00:39 [rsSync] build index site_events.listen { list.i: 1.0 }
Mon Feb  4 14:00:49 [rsSync]            10400/542808    1%
Mon Feb  4 14:00:59 [rsSync]            19200/542808    3%
Mon Feb  4 14:01:09 [rsSync]            32200/542808    5%
Mon Feb  4 14:01:19 [rsSync]            40100/542808    7%
Mon Feb  4 14:01:29 [rsSync]            48300/542808    8%
Mon Feb  4 14:01:39 [rsSync]            56900/542808    10%
Mon Feb  4 14:01:49 [rsSync]            66100/542808    12%
Mon Feb  4 14:01:59 [rsSync]            74100/542808    13%
Mon Feb  4 14:02:09 [rsSync]            79100/542808    14%
Mon Feb  4 14:02:19 [rsSync]            85600/542808    15%
Mon Feb  4 14:02:29 [rsSync]            91000/542808    16%
Mon Feb  4 14:02:39 [rsSync]            98800/542808    18%
Mon Feb  4 14:02:49 [rsSync]            103200/542808   19%
...
Mon Feb  4 14:45:05 [rsSync]            541700/542808   99%
Mon Feb  4 14:45:17 [rsSync]            542100/542808   99%
Mon Feb  4 14:45:24 [conn4] command admin.$cmd command: { writebacklisten: ObjectId('50eb53d4ebd15aeb711110b3') } ntoreturn:1 keyUpdates:0  reslen:44 300023ms
Mon Feb  4 14:45:24 [conn6] command admin.$cmd command: { writebacklisten: ObjectId('50eb53d45fc10a60703a1624') } ntoreturn:1 keyUpdates:0  reslen:44 300023ms
Mon Feb  4 14:45:24 [conn9] command admin.$cmd command: { writebacklisten: ObjectId('50eb53d4dcd61786e232686e') } ntoreturn:1 keyUpdates:0  reslen:44 300023ms
Mon Feb  4 14:45:24 [conn8] command admin.$cmd command: { writebacklisten: ObjectId('50eb53d4118adb037b03b00f') } ntoreturn:1 keyUpdates:0  reslen:44 300023ms
Mon Feb  4 14:45:24 [conn5] command admin.$cmd command: { writebacklisten: ObjectId('50eb53d4bc11a54c80b014b5') } ntoreturn:1 keyUpdates:0  reslen:44 300023ms
Mon Feb  4 14:45:24 [conn7] command admin.$cmd command: { writebacklisten: ObjectId('50eb53d4264e99abafbf248b') } ntoreturn:1 keyUpdates:0  reslen:44 300023ms
Mon Feb  4 14:45:25 [conn13] command admin.$cmd command: { writebacklisten: ObjectId('50eb53d4d5f2d867054dc8dc') } ntoreturn:1 keyUpdates:0  reslen:44 300000ms
Mon Feb  4 14:45:26 [conn15] command admin.$cmd command: { writebacklisten: ObjectId('50eb5ac22a322e1d7f3d5615') } ntoreturn:1 keyUpdates:0  reslen:44 300000ms
Mon Feb  4 14:45:26 [conn14] command admin.$cmd command: { writebacklisten: ObjectId('50eb53d4e0a159b1be288105') } ntoreturn:1 keyUpdates:0  reslen:44 300000ms
Mon Feb  4 14:45:27 [conn19] command admin.$cmd command: { writebacklisten: ObjectId('50eb53d4af5af7b70653e3c4') } ntoreturn:1 keyUpdates:0  reslen:44 300000ms
Mon Feb  4 14:45:27 [conn22] command admin.$cmd command: { writebacklisten: ObjectId('50eb53d4360dd1b77808555c') } ntoreturn:1 keyUpdates:0  reslen:44 300000ms
Mon Feb  4 14:45:27 [conn21] command admin.$cmd command: { writebacklisten: ObjectId('50eb53d441ed387bbb58ce0a') } ntoreturn:1 keyUpdates:0  reslen:44 300000ms
Mon Feb  4 14:45:27 [conn20] command admin.$cmd command: { writebacklisten: ObjectId('50eb53d419e60064e2b0ebe7') } ntoreturn:1 keyUpdates:0  reslen:44 300000ms
Mon Feb  4 14:45:28 [rsSync]            542500/542808   99%
Mon Feb  4 14:45:37 [rsSync]     external sort used : 119 files  in 2697 secs
...
Mon Feb  4 14:45:47 [rsSync]            750800/64526821 1%
Mon Feb  4 14:45:57 [rsSync]            1444000/64526821        2%
Mon Feb  4 14:46:07 [rsSync]            2232000/64526821        3%
Mon Feb  4 14:46:17 [rsSync]            2951300/64526821        4%
Mon Feb  4 14:46:27 [rsSync]            3731500/64526821        5%
Mon Feb  4 14:46:37 [rsSync]            4521100/64526821        7%
Mon Feb  4 14:46:47 [rsSync]            5223600/64526821        8%
Mon Feb  4 14:46:57 [rsSync]            6030000/64526821        9%
Mon Feb  4 14:47:07 [rsSync]            6695600/64526821        10%
Mon Feb  4 14:47:17 [rsSync]            7443800/64526821        11%
Mon Feb  4 14:47:27 [rsSync]            8202200/64526821        12%
Mon Feb  4 14:47:37 [rsSync]            8938700/64526821        13%
Mon Feb  4 14:47:47 [rsSync]            9730700/64526821        15%
Mon Feb  4 14:47:57 [rsSync]            10432700/64526821       16%
Mon Feb  4 14:48:07 [rsSync]            11243500/64526821       17%
Mon Feb  4 14:48:17 [rsSync]            12058100/64526821       18%
Mon Feb  4 14:48:21 [rsSync] old journal file will be removed: /web/HYPE/mongodb/journal/j._0
...
Mon Feb  4 14:59:37 [rsSync]            63125600/64526821       97%
Mon Feb  4 14:59:47 [rsSync]            63796200/64526821       98%
Mon Feb  4 14:59:56 [rsSync]     done building bottom layer, going to commit
Mon Feb  4 15:00:17 [rsSync] Assertion: 13524:out of memory AlignedBuilder
0xaffd31 0xac5eb9 0xac4c2c 0xac4c59 0x73af9e 0x73b362 0x730f6b 0x73149a 0x72eeac 0x72efbc 0x6476d6 0x647b0d 0x79ea2b 0x7996ae 0x79b277 0x86b3e3 0x86c606 0x86df23 0x6736f9 0x6786a0
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xaffd31]
 /usr/bin/mongod(_ZN5mongo11msgassertedEiPKc+0x99) [0xac5eb9]
 /usr/bin/mongod() [0xac4c2c]
 /usr/bin/mongod(_ZN5mongo14AlignedBuilder8_reallocEjj+0x29) [0xac4c59]
 /usr/bin/mongod(_ZN5mongo3dur7Journal7journalERKNS0_11JSectHeaderERKNS_14AlignedBuilderE+0x4e) [0x73af9e]
 /usr/bin/mongod(_ZN5mongo3dur14WRITETOJOURNALENS0_11JSectHeaderERNS_14AlignedBuilderE+0x32) [0x73b362]
 /usr/bin/mongod() [0x730f6b]
 /usr/bin/mongod(_ZN5mongo3dur11DurableImpl9commitNowEv+0x1a) [0x73149a]
 /usr/bin/mongod(_ZN5mongo3dur11DurableImpl16_aCommitIsNeededEv+0x14c) [0x72eeac]
 /usr/bin/mongod(_ZN5mongo3dur11DurableImpl14commitIfNeededEb+0x4c) [0x72efbc]
 /usr/bin/mongod(_ZN5mongo12BtreeBuilderINS_12BtreeData_V1EE14buildNextLevelENS_7DiskLocE+0x276) [0x6476d6]
 /usr/bin/mongod(_ZN5mongo12BtreeBuilderINS_12BtreeData_V1EE6commitEv+0x1d) [0x647b0d]
 /usr/bin/mongod(_ZN5mongo24buildBottomUpPhases2And3INS_12BtreeData_V1EEEvbRNS_12IndexDetailsERNS_21BSONObjExternalSorterEbRSt3setINS_7DiskLocESt4lessIS7_ESaIS7_EEPNS_5CurOpEPNS_12SortPhaseOneERNS_19ProgressMeterHolderERNS_5TimerE+0x2ab) [0x79ea2b]
 /usr/bin/mongod(_ZN5mongo14fastBuildIndexEPKcPNS_16NamespaceDetailsERNS_12IndexDetailsEi+0xcbe) [0x7996ae]
 /usr/bin/mongod(_ZN5mongo12buildAnIndexESsPNS_16NamespaceDetailsERNS_12IndexDetailsEib+0x217) [0x79b277]
 /usr/bin/mongod(_ZN5mongo16insert_makeIndexEPNS_16NamespaceDetailsERKSsRKNS_7DiskLocE+0x263) [0x86b3e3]
 /usr/bin/mongod(_ZN5mongo11DataFileMgr6insertEPKcPKvibbPb+0x946) [0x86c606]
 /usr/bin/mongod(_ZN5mongo11DataFileMgr16insertWithObjModEPKcRNS_7BSONObjEb+0x43) [0x86df23]
 /usr/bin/mongod(_ZN5mongo6Cloner4copyEPKcS2_bbbbbbNS_5QueryE+0x3c9) [0x6736f9]
 /usr/bin/mongod(_ZN5mongo6Cloner2goEPKcRKNS_12CloneOptionsERSt3setISsSt4lessISsESaISsEERSsPi+0x2d60) [0x6786a0]
Mon Feb  4 15:00:18 [rsSync] dbexception in groupCommit causing immediate shutdown: 13524 out of memory AlignedBuilder
Mon Feb  4 15:00:18 gc1
Mon Feb  4 15:00:18 Got signal: 6 (Aborted).
Mon Feb  4 15:00:18 Backtrace:
0xaffd31 0x558bb9 0x7f6ded5ddaf0 0x7f6ded5dda75 0x7f6ded5e15c0 0x7a7a57 0x73127a 0x73149a 0x72eeac 0x72efbc 0x6476d6 0x647b0d 0x79ea2b 0x7996ae 0x79b277 0x86b3e3 0x86c606 0x86df23 0x6736f9 0x6786a0
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xaffd31]
 /usr/bin/mongod(_ZN5mongo10abruptQuitEi+0x399) [0x558bb9]
 /lib/libc.so.6(+0x33af0) [0x7f6ded5ddaf0]
 /lib/libc.so.6(gsignal+0x35) [0x7f6ded5dda75]
 /lib/libc.so.6(abort+0x180) [0x7f6ded5e15c0]
 /usr/bin/mongod(_ZN5mongo10mongoAbortEPKc+0x47) [0x7a7a57]
 /usr/bin/mongod() [0x73127a]
 /usr/bin/mongod(_ZN5mongo3dur11DurableImpl9commitNowEv+0x1a) [0x73149a]
 /usr/bin/mongod(_ZN5mongo3dur11DurableImpl16_aCommitIsNeededEv+0x14c) [0x72eeac]
 /usr/bin/mongod(_ZN5mongo3dur11DurableImpl14commitIfNeededEb+0x4c) [0x72efbc]
 /usr/bin/mongod(_ZN5mongo12BtreeBuilderINS_12BtreeData_V1EE14buildNextLevelENS_7DiskLocE+0x276) [0x6476d6]
 /usr/bin/mongod(_ZN5mongo12BtreeBuilderINS_12BtreeData_V1EE6commitEv+0x1d) [0x647b0d]
 /usr/bin/mongod(_ZN5mongo24buildBottomUpPhases2And3INS_12BtreeData_V1EEEvbRNS_12IndexDetailsERNS_21BSONObjExternalSorterEbRSt3setINS_7DiskLocESt4lessIS7_ESaIS7_EEPNS_5CurOpEPNS_12SortPhaseOneERNS_19ProgressMeterHolderERNS_5TimerE+0x2ab) [0x79ea2b]
 /usr/bin/mongod(_ZN5mongo14fastBuildIndexEPKcPNS_16NamespaceDetailsERNS_12IndexDetailsEi+0xcbe) [0x7996ae]
 /usr/bin/mongod(_ZN5mongo12buildAnIndexESsPNS_16NamespaceDetailsERNS_12IndexDetailsEib+0x217) [0x79b277]
 /usr/bin/mongod(_ZN5mongo16insert_makeIndexEPNS_16NamespaceDetailsERKSsRKNS_7DiskLocE+0x263) [0x86b3e3]
 /usr/bin/mongod(_ZN5mongo11DataFileMgr6insertEPKcPKvibbPb+0x946) [0x86c606]
 /usr/bin/mongod(_ZN5mongo11DataFileMgr16insertWithObjModEPKcRNS_7BSONObjEb+0x43) [0x86df23]
 /usr/bin/mongod(_ZN5mongo6Cloner4copyEPKcS2_bbbbbbNS_5QueryE+0x3c9) [0x6736f9]
 /usr/bin/mongod(_ZN5mongo6Cloner2goEPKcRKNS_12CloneOptionsERSt3setISsSt4lessISsESaISsEERSsPi+0x2d60) [0x6786a0]
 
 
 
***** SERVER RESTARTED *****
...
Mon Feb  4 15:08:41 [rsBackgroundSync] replSet syncing to: g19i:27018
Mon Feb  4 15:08:41 [rsSync] build index local.replset.minvalid { _id: 1 }
Mon Feb  4 15:08:41 [rsSync] build index done.  scanned 0 total records. 0 secs
Mon Feb  4 15:08:42 [rsSyncNotifier] replset setting oplog notifier to g19i:27018
Mon Feb  4 15:08:46 [conn14] command admin.$cmd command: { ismaster: 1 } ntoreturn:1 keyUpdates:0  reslen:246 101ms
Mon Feb  4 15:08:55 [conn41] command admin.$cmd command: { replSetGetStatus: 1 } ntoreturn:1 keyUpdates:0  reslen:603 299ms
Mon Feb  4 15:08:56 [conn16] command admin.$cmd command: { replSetGetStatus: 1 } ntoreturn:1 keyUpdates:0  reslen:603 165ms
Mon Feb  4 15:08:56 [conn43] command admin.$cmd command: { replSetGetStatus: 1 } ntoreturn:1 keyUpdates:0  reslen:603 152ms
Mon Feb  4 15:09:36 [PeriodicTask::Runner] task: DBConnectionPool-cleaner took: 71ms
Mon Feb  4 15:09:36 [PeriodicTask::Runner] task: WriteBackManager::cleaner took: 19ms
Mon Feb  4 15:09:39 [journal] Assertion: 13524:out of memory AlignedBuilder
Mon Feb  4 15:09:39 [TTLMonitor] query local.system.indexes query: { expireAfterSeconds: { $exists: true } } ntoreturn:0 ntoskip:0 nscanned:2 keyUpdates:0 locks(micros) r:351893 nreturned:0 reslen:20 351ms
0xaffd31 0xac5eb9 0xac4c2c 0xac4c59 0x73dcc5 0x73dfd4 0x7305b0 0x7313d9 0x731a84 0xb45ba9 0x7f70aa1359ca 0x7f70a94dc70d
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xaffd31]
 /usr/bin/mongod(_ZN5mongo11msgassertedEiPKc+0x99) [0xac5eb9]
 /usr/bin/mongod() [0xac4c2c]
 /usr/bin/mongod(_ZN5mongo14AlignedBuilder8_reallocEjj+0x29) [0xac4c59]
 /usr/bin/mongod() [0x73dcc5]
 /usr/bin/mongod(_ZN5mongo3dur13PREPLOGBUFFERERNS0_11JSectHeaderERNS_14AlignedBuilderE+0x214) [0x73dfd4]
 /usr/bin/mongod(_ZN5mongo3dur27groupCommitWithLimitedLocksEv+0xa0) [0x7305b0]
 /usr/bin/mongod() [0x7313d9]
 /usr/bin/mongod(_ZN5mongo3dur9durThreadEv+0x364) [0x731a84]
 /usr/bin/mongod() [0xb45ba9]
 /lib/libpthread.so.0(+0x69ca) [0x7f70aa1359ca]
 /lib/libc.so.6(clone+0x6d) [0x7f70a94dc70d]
Mon Feb  4 15:09:39 [journal] dbexception in groupCommitLL causing immediate shutdown: 13524 out of memory AlignedBuilder
Mon Feb  4 15:09:39 dur1
Mon Feb  4 15:09:39 Got signal: 6 (Aborted).
 
Mon Feb  4 15:09:39 Backtrace:
0xaffd31 0x558bb9 0x7f70a9429af0 0x7f70a9429a75 0x7f70a942d5c0 0x7a7a57 0x730a7f 0x7313d9 0x731a84 0xb45ba9 0x7f70aa1359ca 0x7f70a94dc70d
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xaffd31]
 /usr/bin/mongod(_ZN5mongo10abruptQuitEi+0x399) [0x558bb9]
 /lib/libc.so.6(+0x33af0) [0x7f70a9429af0]
 /lib/libc.so.6(gsignal+0x35) [0x7f70a9429a75]
 /lib/libc.so.6(abort+0x180) [0x7f70a942d5c0]
 /usr/bin/mongod(_ZN5mongo10mongoAbortEPKc+0x47) [0x7a7a57]
 /usr/bin/mongod(_ZN5mongo3dur27groupCommitWithLimitedLocksEv+0x56f) [0x730a7f]
 /usr/bin/mongod() [0x7313d9]
 /usr/bin/mongod(_ZN5mongo3dur9durThreadEv+0x364) [0x731a84]
 /usr/bin/mongod() [0xb45ba9]
 /lib/libpthread.so.0(+0x69ca) [0x7f70aa1359ca]
 /lib/libc.so.6(clone+0x6d) [0x7f70a94dc70d]
 
 
 
***** SERVER RESTARTED *****

So it looks like the process may have gotten OOM killed (though that should be 9, not 6?) while attempting to commit the index, restarted, tried again (and got killed again) then gave up.

I don't remember how this machine was spun up, but I am guessing initial dataset copy from EC2 snapshot.

Should I drop and rebuild the index now?

Comment by J Rassi [ 16/Oct/13 ]

It'd look something like

build index on: site_events.listen properties: { "v" : 1, "key" : { "list.i" : 1 }, "ns" : "site_events.listen", "name" : "list.i_1", "background" : true }

And, do you know if this member was brought up from an initial sync?

Comment by Arkadiy Kukarkin [ 16/Oct/13 ]

Hey Rassi! This was indexed a while ago, but maybe the logs are still around. What string should I grep for?

Comment by J Rassi [ 16/Oct/13 ]

Do you have the mongod log excerpt for the list.i_1 index build?

Generated at Thu Feb 08 03:25:13 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.