[SERVER-18005] Unclear why Primary mongod just exited - killed by oom killer Created: 13/Apr/15  Updated: 23/Apr/15  Resolved: 23/Apr/15

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Kam Lall Assignee: Ramon Fernandez Marina
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Participants:

 Description   

The primary/secondary in one of our shards were running perfectly. Then at 4 am the primary mongod exited but there was no crash at all. Here is the log below. Note that I see a line that says: [ReplicationExecutor] stepping down; another primary was elected more recently

But it is unclear why the process exited.

2015-04-13T03:59:50.981+0000 I COMMAND  [conn24621] command local.$cmd command: saslContinue { saslContinue: 1, payload: BinData(0, 633D626977732C723D672B523057643766366C34576F2F594A75347639505765486B5039467867303446486672684258717644726E452B6F4161587A49756E675A2B6E446832...), conversationId: 1 } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:168 locks:{} 3051ms
2015-04-13T03:59:51.346+0000 I QUERY    [conn24614] query cubes_prod.accounts query: { UserKey: ObjectId('5520c855498bf1aa0b38e96e'), _id: ObjectId('5520c855498bf1aa0b38e970') } planSummary: IXSCAN { _id: 1 } ntoskip:0 nscanned:1 nscannedObjects:1 keyUpdates:0 writeConflicts:0 numYields:0 nreturned:1 reslen:948 locks:{ Global: { acquireCount: { r: 1 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { r: 1 } } } 9948ms
2015-04-13T03:59:50.981+0000 I COMMAND  [conn24619] command local.$cmd command: saslContinue { saslContinue: 1, payload: BinData(0, ), conversationId: 1 } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:138 locks:{} 2492ms
2015-04-13T03:59:50.981+0000 I COMMAND  [conn24618] command local.$cmd command: saslContinue { saslContinue: 1, payload: BinData(0, ), conversationId: 1 } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:138 locks:{} 2492ms
2015-04-13T03:59:51.152+0000 I NETWORK  [initandlisten] connection accepted from 10.0.8.116:25158 #24628 (447 connections now open)
2015-04-13T03:59:50.981+0000 I COMMAND  [conn24622] command local.$cmd command: saslContinue { saslContinue: 1, payload: BinData(0, 633D626977732C723D62544543755243674F4D4D75365350417775586642706536514835536A46394A42342F4A4256784835666F433741784D533449457A467543754C6B754D...), conversationId: 1 } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:168 locks:{} 3051ms
2015-04-13T03:59:51.357+0000 I QUERY    [conn24613] query cubes_prod.accounts query: { UserKey: ObjectId('5521de8dbbefd95324a8212d'), _id: ObjectId('5521e974498bf1aa0b38e9ca') } planSummary: IXSCAN { _id: 1 } ntoskip:0 nscanned:1 nscannedObjects:1 keyUpdates:0 writeConflicts:0 numYields:0 nreturned:1 reslen:769 locks:{ Global: { acquireCount: { r: 1 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { r: 1 } } } 10003ms
2015-04-13T03:59:51.393+0000 I QUERY    [conn2748] getmore local.oplog.rs cursorid:92163486263 ntoreturn:0 keyUpdates:0 writeConflicts:0 numYields:0 nreturned:5 reslen:905 locks:{ Global: { acquireCount: { r: 1 } }, Database: { acquireCount: { r: 1 } }, oplog: { acquireCount: { r: 1 } } } 3768ms
2015-04-13T03:59:52.235+0000 I REPL     [ReplicationExecutor] Member ip-10-0-21-50.ec2.internal:27017 is now in state PRIMARY
2015-04-13T03:59:52.235+0000 I COMMAND  [conn10173] command admin.$cmd command: getCmdLineOpts { getCmdLineOpts: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:677 locks:{} 237ms
2015-04-13T03:59:53.012+0000 I NETWORK  [initandlisten] connection accepted from 10.0.8.116:25159 #24629 (448 connections now open)
2015-04-13T03:59:53.099+0000 I COMMAND  [conn24625] command local.$cmd command: saslStart { saslStart: 1, mechanism: "SCRAM-SHA-1", payload: BinData(0, 6E2C2C6E3D5F5F73797374656D2C723D6C5469597A637074596F4D5963746D74586B37477A656E47535337362B303846) } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:239 locks:{} 1669ms
2015-04-13T03:59:53.256+0000 I ACCESS   [conn24621] Successfully authenticated as principal __system on local
2015-04-13T03:59:53.904+0000 I COMMAND  [conn2859] command admin.$cmd command: ping { ping: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:97 locks:{} 164ms
2015-04-13T03:59:54.143+0000 I COMMAND  [conn24626] command local.$cmd command: saslStart { saslStart: 1, mechanism: "SCRAM-SHA-1", payload: BinData(0, 6E2C2C6E3D5F5F73797374656D2C723D546F54595378306F74596E6D6F385530745A566E69397435554A633753435371) } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:239 locks:{} 1944ms
2015-04-13T03:59:54.335+0000 I REPL     [ReplicationExecutor] replset: another primary seen with election time 552b3f43:1 my election time is 5526695a:1
2015-04-13T03:59:54.643+0000 I NETWORK  [initandlisten] connection accepted from 10.0.8.116:25160 #24630 (449 connections now open)
2015-04-13T03:59:55.031+0000 I ACCESS   [conn24622] Successfully authenticated as principal __system on local
2015-04-13T03:59:56.316+0000 I REPL     [ReplicationExecutor] stepping down; another primary was elected more recently
2015-04-13T03:59:56.731+0000 I COMMAND  [conn24624] command local.$cmd command: saslContinue { saslContinue: 1, payload: BinData(0, 633D626977732C723D5679634B4C4A675141742F43546C324D6E5739683858344A74507452534E7862626B6768455534424953636F48373347646652416836574F78414A4477...), conversationId: 1 } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:168 locks:{} 2910ms
2015-04-13T03:59:56.731+0000 I COMMAND  [conn24623] command local.$cmd command: saslContinue { saslContinue: 1, payload: BinData(0, 633D626977732C723D737669324D3244795063542F35454632546E7743667A46586F744F32454356656B304E4E546944785459662F6D526D6D74464E70445455526842474862...), conversationId: 1 } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:168 locks:{} 3213ms
2015-04-13T03:59:56.800+0000 I COMMAND  [conn24627] command local.$cmd command: saslStart { saslStart: 1, mechanism: "SCRAM-SHA-1", payload: BinData(0, 6E2C2C6E3D5F5F73797374656D2C723D615946656A7A75466F2B426A384B48754F4D675341354E49592B4A4F67507942) } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:239 locks:{} 1820ms
2015-04-13T03:59:58.087+0000 I COMMAND  [conn24621] command local.$cmd command: saslContinue { saslContinue: 1, payload: BinData(0, ), conversationId: 1 } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:138 locks:{} 2468ms
2015-04-13T03:59:58.382+0000 I REPL     [ReplicationExecutor] Stepping down from primary in response to heartbeat
2015-04-13T03:59:58.689+0000 I NETWORK  [initandlisten] connection accepted from 10.0.8.116:25161 #24631 (450 connections now open)
2015-04-13T04:00:01.536+0000 I COMMAND  [conn24622] command local.$cmd command: saslContinue { saslContinue: 1, payload: BinData(0, ), conversationId: 1 } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:138 locks:{} 4243ms
2015-04-13T04:00:01.874+0000 I ACCESS   [conn24624] Successfully authenticated as principal __system on local
2015-04-13T04:00:02.625+0000 I COMMAND  [conn24628] command local.$cmd command: saslStart { saslStart: 1, mechanism: "SCRAM-SHA-1", payload: BinData(0, 6E2C2C6E3D5F5F73797374656D2C723D4E2B3177683151512B484F39665A2F42794B437279454E422B6A7464792F7153) } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:239 locks:{} 4630ms
2015-04-13T04:00:02.920+0000 I COMMAND  [conn10173] command admin.$cmd command: logout { logout: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:97 locks:{} 1067ms
2015-04-13T04:00:02.920+0000 I COMMAND  [conn2859] command admin.$cmd command: saslStart { saslStart: 1, mechanism: "SCRAM-SHA-1", payload: BinData(0, 6E2C2C6E3D696E626F786375626564622C723D4E74436937457948) } keyUpdates:0 writeConflicts:0 numYields:0 reslen:215 locks:{} 3210ms
2015-04-13T04:00:03.106+0000 I ACCESS   [conn24623] Successfully authenticated as principal __system on local
2015-04-13T04:00:03.731+0000 I NETWORK  [initandlisten] connection accepted from 10.0.8.116:25162 #24632 (451 connections now open)
2015-04-13T04:00:04.129+0000 I COMMAND  [conn2682] command admin.$cmd command: ping { ping: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:97 locks:{} 289ms
2015-04-13T04:00:04.523+0000 I COMMAND  [conn24629] command local.$cmd command: saslStart { saslStart: 1, mechanism: "SCRAM-SHA-1", payload: BinData(0, 6E2C2C6E3D5F5F73797374656D2C723D776F553972343872324D5334384F41482B6F36334765454642426B69416F5678) } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:239 locks:{} 4167ms
2015-04-13T04:00:04.869+0000 I COMMAND  [conn24625] command local.$cmd command: saslContinue { saslContinue: 1, payload: BinData(0, 633D626977732C723D6C5469597A637074596F4D5963746D74586B37477A656E47535337362B3038462F63714139757356666A2B3156694D4D394946646D6D334653566E5474...), conversationId: 1 } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:168 locks:{} 6077ms
2015-04-13T04:00:07.060+0000 I COMMAND  [conn24626] command local.$cmd command: saslContinue { saslContinue: 1, payload: BinData(0, 633D626977732C723D546F54595378306F74596E6D6F385530745A566E69397435554A63375343537139413866773530666732365047667079526D585034306A4F4C4C743132...), conversationId: 1 } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:168 locks:{} 6417ms
2015-04-13T04:00:08.087+0000 I COMMAND  [conn24624] command local.$cmd command: saslContinue { saslContinue: 1, payload: BinData(0, ), conversationId: 1 } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:138 locks:{} 4613ms
2015-04-13T04:00:09.507+0000 I COMMAND  [conn22458] command admin.$cmd command: isMaster { ismaster: 1 } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:514 locks:{} 15102ms
2015-04-13T04:00:09.730+0000 I NETWORK  [initandlisten] connection accepted from 10.0.8.116:25163 #24633 (452 connections now open)
2015-04-13T11:45:45.839+0000 I CONTROL  ***** SERVER RESTARTED *****
2015-04-13T11:45:45.943+0000 I STORAGE  [initandlisten] wiredtiger_open config: create,cache_size=8G,session_max=20000,eviction=(threads_max=4),statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=zlib),checkpoint=(wait=60,log_size=2GB),statistics_log=(wait=0),



 Comments   
Comment by Ramon Fernandez Marina [ 23/Apr/15 ]

klall, please see this response from Asya in mongodb-user.

I'm going to resolve this ticket, since we keep the SERVER project for reporting bugs or feature suggestions for the MongoDB server. For MongoDB-related support discussion please post on the mongodb-user group or Stack Overflow with the mongodb tag, where your question will reach a larger audience. A question like this involving more discussion would be best posted on the mongodb-user group.

Regards,
Ramón.

Comment by Kam Lall [ 13/Apr/15 ]

I am curious why the wired tiger cache size defaults to 1/2 of the available RAM. Is it wise to increase this on our system?

I can stick with the default and just create swap space however of size a 8 Gig I suppose.

Comment by Kam Lall [ 13/Apr/15 ]

Yes, that was it in /var/log/messages at 4 am UTC

Apr 13 04:00:13 ip-10-0-13-119 kernel: [29163857.113398] Out of memory: Kill process 21806 (mongod) score 981 or sacrifice child
Apr 13 04:00:13 ip-10-0-13-119 kernel: [29163857.113447] Killed process 21806 (mongod) total-vm:18587096kB, anon-rss:17194132kB, file-rss:0kB

I will increase swap space now.

Comment by Kam Lall [ 13/Apr/15 ]

Ok, I will look for that now. Thank you.

Comment by Ramon Fernandez Marina [ 13/Apr/15 ]

klall, it seems your mongod process may have been killed by the operating system. Please search your system logs for messages that look like:

Out of memory: Kill process NNN (mongod)

You may be hitting SERVER-17424 where WiredTiger may, in you case, consume more than the 8GB of cache you specified. One possibility to work around the issue would be to add more swap space, so the operating system doesn't run out of memory.

Generated at Thu Feb 08 03:46:16 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.