[SERVER-24709] mongod crashes on mongodump run(probably due to low memory) Created: 22/Jun/16  Updated: 21/Jun/17  Resolved: 24/May/17

Status: Closed
Project: Core Server
Component/s: Stability, Tools
Affects Version/s: 3.2.6
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Or Khafi [X] Assignee: Kelsey Schubert
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Steps To Reproduce:

Run mongodump and then stress the memory of the OS.

Participants:

 Description   

Server - AWS EC2 t2.small, CentOS Linux release 7.2.1511 (Core)
DB Version - MongoDB community 3.2.6
Topology: Replica Set

When running mongodump command on the secondary node, the db crashes in 10% of the times.

I stressed the memory to reproduce, and looks like it happens due to low memory (though I have enough free memory when running/during the dump).

The line when dump fails:
2016-06-22T12:19:18.871+0300 Failed: error reading collection: EOF && Mux ending but selectCases still open 2

The last messages in mongod.log:

2016-06-22T12:19:13.162+0300 D STORAGE  [conn10] WT begin_transaction
2016-06-22T12:19:13.187+0300 D QUERY    [conn12] Running getMore, cursorid: 225913259574
2016-06-22T12:19:13.187+0300 D STORAGE  [conn12] WT begin_transaction
2016-06-22T12:19:13.224+0300 D STORAGE  [WTJournalFlusher] flushed journal
2016-06-22T12:19:13.324+0300 D STORAGE  [WTJournalFlusher] flushed journal
2016-06-22T12:19:13.394+0300 D EXECUTOR [ReplicationExecutor] Scheduling remote request: RemoteCommand 115 -- target:production-app03:27017 db:admin cmd:{ replSetHeartbeat: "prod", configVersion: 3, from: "production-db02:27017", fromId: 1, term: 5 }
2016-06-22T12:19:13.394+0300 D ASIO     [ReplicationExecutor] startCommand: RemoteCommand 115 -- target:production-app03:27017 db:admin expDate:2016-06-22T12:19:23.393+0300 cmd:{ replSetHeartbeat: "prod", configVersion: 3, from: "production-db02:27017", fromId: 1, term: 5 }
2016-06-22T12:19:13.394+0300 D ASIO     [NetworkInterfaceASIO-Replication-0] Initiating asynchronous command: RemoteCommand 115 -- target:production-app03:27017 db:admin expDate:2016-06-22T12:19:23.393+0300 cmd:{ replSetHeartbeat: "prod", configVersion: 3, from: "production-db02:27017", fromId: 1, term: 5 }
2016-06-22T12:19:13.394+0300 D ASIO     [NetworkInterfaceASIO-Replication-0] Starting asynchronous command 115 on host production-app03:27017
2016-06-22T12:19:13.396+0300 D EXECUTOR [NetworkInterfaceASIO-Replication-0] Received remote response: RemoteResponse --  cmd:{ ok: 1.0, state: 7, v: 3, hbmsg: "", set: "prod", term: 5, primaryId: 0, durableOpTime: { ts: Timestamp 1466525772000|2012, t: 5 }, opTime: { ts: Timestamp 1466525772000|2012, t: 5 } }
2016-06-22T12:19:13.396+0300 D ASIO     [NetworkInterfaceASIO-Replication-0] Failed to time operation 115 out: Operation aborted.
2016-06-22T12:19:13.396+0300 D REPL     [ReplicationExecutor] setUpValues: heartbeat response good for member _id:2, msg:  
2016-06-22T12:19:13.396+0300 D REPL     [ReplicationExecutor] Scheduling heartbeat to production-app03:27017 at 2016-06-22T09:19:18.396Z
2016-06-22T12:19:13.415+0300 D EXECUTOR [ReplicationExecutor] Scheduling remote request: RemoteCommand 117 -- target:production-db01:27017 db:admin cmd:{ replSetHeartbeat: "prod", configVersion: 3, from: "production-db02:27017", fromId: 1, term: 5 }
2016-06-22T12:19:13.415+0300 D ASIO     [ReplicationExecutor] startCommand: RemoteCommand 117 -- target:production-db01:27017 db:admin expDate:2016-06-22T12:19:23.415+0300 cmd:{ replSetHeartbeat: "prod", configVersion: 3, from: "production-db02:27017", fromId: 1, term: 5 }
2016-06-22T12:19:13.415+0300 D ASIO     [NetworkInterfaceASIO-Replication-0] Initiating asynchronous command: RemoteCommand 117 -- target:production-db01:27017 db:admin expDate:2016-06-22T12:19:23.415+0300 cmd:{ replSetHeartbeat: "prod", configVersion: 3, from: "production-db02:27017", fromId: 1, term: 5 }
2016-06-22T12:19:13.415+0300 D ASIO     [NetworkInterfaceASIO-Replication-0] Starting asynchronous command 117 on host production-db01:27017
2016-06-22T12:19:13.417+0300 D EXECUTOR [NetworkInterfaceASIO-Replication-0] Received remote response: RemoteResponse --  cmd:{ ok: 1.0, electionTime: new Date(6293366302669012993), state: 1, v: 3, hbmsg: "", set: "prod", term: 5, primaryId: 0, durableOpTime: { ts: Timestamp 1466525772000|2012, t: 5 }, opTime: { ts: Timestamp 1466525772000|2012, t: 5 } }
2016-06-22T12:19:13.417+0300 D ASIO     [NetworkInterfaceASIO-Replication-0] Failed to time operation 117 out: Operation aborted.
2016-06-22T12:19:13.417+0300 D REPL     [ReplicationExecutor] Canceling election timeout callback at 2016-06-22T12:19:19.650+0300
2016-06-22T12:19:13.417+0300 D REPL     [ReplicationExecutor] Scheduling election timeout callback at 2016-06-22T12:19:24.755+0300
2016-06-22T12:19:13.833+0300 D REPL     [ReplicationExecutor] setUpValues: heartbeat response good for member _id:0, msg:  
2016-06-22T12:19:13.833+0300 D REPL     [ReplicationExecutor] Scheduling heartbeat to production-db01:27017 at 2016-06-22T09:19:18.417Z
2016-06-22T12:19:13.833+0300 D STORAGE  [WTJournalFlusher] flushed journal
2016-06-22T12:19:13.833+0300 D STORAGE  [conn13] WT rollback_transaction
2016-06-22T12:19:13.833+0300 D STORAGE  [conn13] WT begin_transaction
2016-06-22T12:19:13.846+0300 D STORAGE  [conn10] WT rollback_transaction
2016-06-22T12:19:13.846+0300 D STORAGE  [conn10] WT begin_transaction
2016-06-22T12:19:13.846+0300 D STORAGE  [conn12] WT rollback_transaction
2016-06-22T12:19:13.846+0300 D STORAGE  [conn12] WT begin_transaction
2016-06-22T12:19:13.846+0300 D STORAGE  [conn12] WT rollback_transaction
2016-06-22T12:19:13.846+0300 D STORAGE  [conn12] WT begin_transaction
2016-06-22T12:19:13.859+0300 D STORAGE  [conn12] WT rollback_transaction
2016-06-22T12:19:13.859+0300 D STORAGE  [conn12] WT begin_transaction
2016-06-22T12:19:13.859+0300 D STORAGE  [conn13] WT rollback_transaction
2016-06-22T12:19:13.859+0300 D STORAGE  [conn13] WT begin_transaction



 Comments   
Comment by Kelsey Schubert [ 24/May/17 ]

Hi KafKafOwn,

We haven’t heard back from you for some time, so I’m going to mark this ticket as resolved. If this is still an issue for you, please provide the diagnostic.data with heapProfilingEnabled=true and we will reopen the ticket.

Regards,
Thomas

Comment by Kelsey Schubert [ 06/Apr/17 ]

Hi mango,

Thanks for reporting this issue. I've examined the metrics file you've uploaded and determined that this is a separate issue. So we can continue to investigate would you please open a new ticket?

If you're able to reliably reproduce this issue, would you please enable the heap profiler as described above? After encountering this issue again, would you please upload the complete archive of the diagnostic.data as well as the complete log files?

Thank you for your help,
Thomas

Comment by Jan S. [ 04/Apr/17 ]

I also have memory problems while dumping the database. But in difference to Or Khafi I am already using 3.4 , but with the MMAPv1 engine:

db version v3.4.3
git version: f07437fb5a6cca07c10bafa78365456eb1d6d5e1
OpenSSL version: OpenSSL 1.0.2g 1 Mar 2016
allocator: tcmalloc
modules: none
build environment:
distmod: ubuntu1604
distarch: x86_64
target_arch: x86_64
The VM has 3.9GB RAM and mongo

While dumping the database no other database operations were running. The system has 2GB (nearly unused) swap.
I uploaded the "metrics.2017-04-04T07-56-28Z-00000" using your upload portal.

Comment by Kelsey Schubert [ 13/Mar/17 ]

Hi KafKafOwn,

Unfortunately, we have not yet been able to conclusively determine the root cause of this behavior from the diagnostic.data you provided. However, we have made significant improvements to MongoDB since 3.2.6 that may correct this issue. If this is still a concern for you, would you please upgrade to MongoDB 3.2.12?

MongoDB 3.2.12 also includes a new parameter, which will allow us to collect additional diagnostic.data to better understand what is happening. If you are encounter this issue on MongoDB 3.2.12, would you please reproduce with the following parameter set?

mongod --setParameter heapProfilingEnabled=true

Then, please upload the complete log files and the diagnostic.data to this new secure upload portal.

Thank you again for your help,
Thomas

Comment by Kelsey Schubert [ 08/Dec/16 ]

Hi Paolo,

I'd recommend upgrading to MongoDB 3.4 to take advantage of SERVER-23391.

Kind regards,
Thomas

Comment by Paolo [X] [ 08/Dec/16 ]

Hi guys, any update on this? I'm having the same issue on Digital Ocean instance with SSD and 1GB RAM, and running:

db version v3.2.10
git version: 79d9b3ab5ce20f51c272b4411202710a082d0317
OpenSSL version: OpenSSL 1.0.1f 6 Jan 2014
allocator: tcmalloc
modules: none
build environment:
distmod: ubuntu1404
distarch: x86_64
target_arch: x86_64

Comment by Or Khafi [X] [ 22/Jun/16 ]

Uploaded.

Comment by Kelsey Schubert [ 22/Jun/16 ]

Certainly, I've created this secure portal for you to use.

Kind regards,
Thomas

Comment by Or Khafi [X] [ 22/Jun/16 ]

Hey Thomas,

I know you are not collecting any secret data in your diagnostics, but can you please give me a private link to upload the data?

Thanks,
Or

Comment by Kelsey Schubert [ 22/Jun/16 ]

Hi KafKafOwn,

Thanks for the additional information. So we can continue to investigate, would you please archive (tar or zip) the $dbpath/diagnostic.data directory and attach it to this ticket?

Also, would please attach the output of top so we can see what other processes are running on your machine?

Thank you,
Thomas

Comment by Or Khafi [X] [ 22/Jun/16 ]

Out of memory: Kill process 16402 (mongod) score 765 or sacrifice child
Killed process 16402 (mongod) total-vm:2232172kB, anon-rss:1437888kB, file-rss:0kB

A few more details to help understand linux oom killer decision:
1) We have no swap(ec2 instance default), 2 GB of memory.
2) We are running only mongod and mongodump on the server.
3) /proc/sys/vm/overcommit_memory = 0.

Comment by Ramon Fernandez Marina [ 22/Jun/16 ]

KafKafOwn, can you check your system logs for the following string?

Out of memory: Kill process XXXXX (mongod)

Generated at Thu Feb 08 04:07:11 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.