[SERVER-25299] shut down Created: 27/Jul/16  Updated: 24/Aug/16  Resolved: 24/Aug/16

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.2.5
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Ian Sharp [X] Assignee: Eric Milkie
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

linux ec2


Issue Links:
Duplicate
duplicates SERVER-24536 nodes can run an election while stepp... Closed
Operating System: ALL
Steps To Reproduce:

make 3 mongo instances (each on their own EC2 linux amazon AMI), behind an AWS load-balancer ELB using HTTPS and Amazon's SSL certs

Participants:

 Description   

a mongoDB instance shutdown while running as part of a replica set on an Amazon EC2 AMI instance

uname -a

Linux ip-........amzn1.x86_64 #1 SMP Wed Dec 17 18:36:15 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

df -h

Filesystem      Size  Used Avail Use% Mounted on
/dev/xvda1      7.8G  3.1G  4.7G  40% /
devtmpfs        491M   56K  491M   1% /dev
tmpfs           499M     0  499M   0% /dev/shm

mongo.log

2016-07-26T01:34:28.320+0000 F -        [replExecDBWorker-1] Got signal: 6 (Aborted).
 
 0x12ff892 0x12fe7b9 0x12fefc2 0x7fd78198b5b0 0x7fd78160abe9 0x7fd78160bfe8 0x12863cb 0xebb750 0xee01a4 0xee72d6 0xee5772 0xeea439 0xe445ae 0xf18caa 0xf1997e 0x1291441 0x1291da9 0x1292900 0x1b2bdb0 0x7fd781983f18 0x7fd7816b9b2d
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"400000","o":"EFF892","s":"_ZN5mongo15printStackTraceERSo"},{"b":"400000","o":"EFE7B9"},{"b":"400000","o":"EFEFC2"},{"b":"7FD78197C000","o":"F5B0"},{"b":"7FD7815D7000","o":"33BE9","s":"gsignal"},{"b":"7FD7815D7000","o":"34FE8","s":"abort"},{"b":"400000","o":"E863CB","s":"_ZN5mongo15invariantFailedEPKcS1_j"},{"b":"400000","o":"ABB750","s":"_ZN5mongo4repl26ReplicationCoordinatorImpl48_updateMemberStateFromTopologyCoordinator_inlockEv"},{"b":"400000","o":"AE01A4","s":"_ZN5mongo4repl26ReplicationCoordinatorImpl15_stepDownFinishERKNS_8executor12TaskExecutor12CallbackArgsERKNS3_11EventHandleE"},{"b":"400000","o":"AE72D6","s":"_ZN5mongo4repl19ReplicationExecutor12_doOperationEPNS_16OperationContextERKNS_6StatusERKNS_8executor12TaskExecutor14CallbackHandleEPSt4listINS1_8WorkItemESaISD_EEPSt5mutex"},{"b":"400000","o":"AE5772"},{"b":"400000","o":"AEA439"},{"b":"400000","o":"A445AE"},{"b":"400000","o":"B18CAA"},{"b":"400000","o":"B1997E","s":"_ZN5mongo4repl10TaskRunner9_runTasksEv"},{"b":"400000","o":"E91441","s":"_ZN5mongo10ThreadPool10_doOneTaskEPSt11unique_lockISt5mutexE"},{"b":"400000","o":"E91DA9","s":"_ZN5mongo10ThreadPool13_consumeTasksEv"},{"b":"400000","o":"E92900","s":"_ZN5mongo10ThreadPool17_workerThreadBodyEPS0_RKSs"},{"b":"400000","o":"172BDB0","s":"execute_native_thread_routine"},{"b":"7FD78197C000","o":"7F18"},{"b":"7FD7815D7000","o":"E2B2D","s":"clone"}],"processInfo":{ "mongodbVersion" : "3.2.5", "gitVersion" : "34e65e5383f7ea1726332cb175b73077ec4a1b02", "compiledModules" : [], "uname" : { "sysname" : "Linux", "release" : "3.14.27-25.47.amzn1.x86_64", "version" : "#1 SMP Wed Dec 17 18:36:15 UTC 2014", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000", "buildId" : "528FE51A69E5765F1B0B9D9135E94AACED3A8FA2" }, { "b" : "7FFFCE2DA000", "elfType" : 3, "buildId" : "5B71B0789FD0186B57D4D168F54A1F8DB17B489F" }, { "b" : "7FD78289C000", "path" : "/usr/lib64/libssl.so.10", "elfType" : 3, "buildId" : "A5C6FCE56D21542B9535D5FBCFE74EBA71D24613" }, { "b" : "7FD7824B8000", "path" : "/lib64/libcrypto.so.10", "elfType" : 3, "buildId" : "75565934551BECB21A027BE0A3AE28A62AEEB7EE" }, { "b" : "7FD7822B0000", "path" : "/lib64/librt.so.1", "elfType" : 3, "buildId" : "CC3700E5EACA3A520D1FD1BA3B469F2EFB1B8F11" }, { "b" : "7FD7820AC000", "path" : "/lib64/libdl.so.2", "elfType" : 3, "buildId" : "3D5EB706F159BCCB9A3B3E5F528B512E2C500F90" }, { "b" : "7FD781DAE000", "path" : "/lib64/libm.so.6", "elfType" : 3, "buildId" : "226E54A756FDC288551092B4FDE818C8D6D6FC04" }, { "b" : "7FD781B98000", "path" : "/lib64/libgcc_s.so.1", "elfType" : 3, "buildId" : "00FA2883FB47B1327397BBF167C52F51A723D013" }, { "b" : "7FD78197C000", "path" : "/lib64/libpthread.so.0", "elfType" : 3, "buildId" : "A8E8015F84374807F6920DF95D16BBAAAC02AB41" }, { "b" : "7FD7815D7000", "path" : "/lib64/libc.so.6", "elfType" : 3, "buildId" : "D40AD0538A2FFAB96E5540EF73816C675CC0AF5E" }, { "b" : "7FD782B09000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "37F70DE2ABEA184F24518EB5CC538FF0F54E081F" }, { "b" : "7FD781394000", "path" : "/lib64/libgssapi_krb5.so.2", "elfType" : 3, "buildId" : "72C1DB5E2447A90D1BF34065BCC031B7263FFBAC" }, { "b" : "7FD7810AF000", "path" : "/lib64/libkrb5.so.3", "elfType" : 3, "buildId" : "2B8787E8E0C317CF820E5D830D923BC744E497F4" }, { "b" : "7FD780EAC000", "path" : "/usr/lib64/libcom_err.so.2", "elfType" : 3, "buildId" : "7D435F06640E70A09EA7C1A62D25CE3507435263" }, { "b" : "7FD780C81000", "path" : "/lib64/libk5crypto.so.3", "elfType" : 3, "buildId" : "B10FBFEC246C4EAD1719D16090D0BE54904BBFC9" }, { "b" : "7FD780A6B000", "path" : "/lib64/libz.so.1", "elfType" : 3, "buildId" : "89C6AF118B6B4FB6A73AE1813E2C8BDD722956D1" }, { "b" : "7FD780860000", "path" : "/lib64/libkrb5support.so.0", "elfType" : 3, "buildId" : "7292C0673D7C116E3389D3FFA67087A6B9287A71" }, { "b" : "7FD78065D000", "path" : "/lib64/libkeyutils.so.1", "elfType" : 3, "buildId" : "37A58210FA50C91E09387765408A92909468D25B" }, { "b" : "7FD780446000", "path" : "/lib64/libresolv.so.2", "elfType" : 3, "buildId" : "F01971089728188EFA4AA473D8B2DE22CA6DE963" }, { "b" : "7FD780225000", "path" : "/usr/lib64/libselinux.so.1", "elfType" : 3, "buildId" : "F5054DC94443326819FBF3065CFDF5E4726F57EE" } ] }}
 mongod(_ZN5mongo15printStackTraceERSo+0x32) [0x12ff892]
 mongod(+0xEFE7B9) [0x12fe7b9]
 mongod(+0xEFEFC2) [0x12fefc2]
 libpthread.so.0(+0xF5B0) [0x7fd78198b5b0]
 libc.so.6(gsignal+0x39) [0x7fd78160abe9]
 libc.so.6(abort+0x148) [0x7fd78160bfe8]
 mongod(_ZN5mongo15invariantFailedEPKcS1_j+0xCB) [0x12863cb]
 mongod(_ZN5mongo4repl26ReplicationCoordinatorImpl48_updateMemberStateFromTopologyCoordinator_inlockEv+0x340) [0xebb750]
 mongod(_ZN5mongo4repl26ReplicationCoordinatorImpl15_stepDownFinishERKNS_8executor12TaskExecutor12CallbackArgsERKNS3_11EventHandleE+0x84) [0xee01a4]
 mongod(_ZN5mongo4repl19ReplicationExecutor12_doOperationEPNS_16OperationContextERKNS_6StatusERKNS_8executor12TaskExecutor14CallbackHandleEPSt4listINS1_8WorkItemESaISD_EEPSt5mutex+0x206) [0xee72d6]
 mongod(+0xAE5772) [0xee5772]
 mongod(+0xAEA439) [0xeea439]
 mongod(+0xA445AE) [0xe445ae]
 mongod(+0xB18CAA) [0xf18caa]
 mongod(_ZN5mongo4repl10TaskRunner9_runTasksEv+0x8E) [0xf1997e]
 mongod(_ZN5mongo10ThreadPool10_doOneTaskEPSt11unique_lockISt5mutexE+0x121) [0x1291441]
 mongod(_ZN5mongo10ThreadPool13_consumeTasksEv+0xA9) [0x1291da9]
 mongod(_ZN5mongo10ThreadPool17_workerThreadBodyEPS0_RKSs+0x100) [0x1292900]
 mongod(execute_native_thread_routine+0x20) [0x1b2bdb0]
 libpthread.so.0(+0x7F18) [0x7fd781983f18]
 libc.so.6(clone+0x6D) [0x7fd7816b9b2d]
-----  END BACKTRACE  -----



 Comments   
Comment by Eric Milkie [ 08/Aug/16 ]

Hi Ian,
There can be many reasons for why your server is slow. However, keeping database objects alive, as you describe, is unlikely to be a cause. Instead, the culprit most likely lies with one of the usual three things being insufficient: CPU, I/O, or network. To diagnose further, we'd need to look into each of those things.

Comment by Ian Sharp [X] [ 30/Jul/16 ]

Thanks Eric Milkie, any ideas why the queries were taking 17 seconds, instead of 10's of milliseconds ?

i noticed some of the queries with several embedded fields taking a few seconds (but not 17). Or could it possibly be the fact that sometimes i open a database connection, then do a gauntlet of analytics on the data (taking sometimes up to 30 seconds) and then close the database connection? (probably doesn't matter since the database response still comes quickly? or does leaving the database connection open while analytics run, give the risk of replica set failovers and loss of service?) ...

for example, i may have a function that leads to 6 or so analytics callbacks, and i pass the database object onto each successive analytics function, so that i can keep querying the replicaset without 'closing & re-connecting to mongodb replica set again'. I thought this would be efficient, but perhaps this keeps the database object alive too long and i'd be better off getting a new database object by closing the previous connection and connecting to mongo again?

Comment by Eric Milkie [ 29/Jul/16 ]

Hi Ian,
Thank you for reporting this. It is a known bug, SERVER-24536. We are actively working to fix it.
To avoid this issue in the meantime, you can attempt to avoid overloading your instances. I noticed in the server log you provided, certain operations that normally take 10's of millliseconds were taking over 17 seconds (17,000 milliseconds) to complete; this level of responsiveness can result in replica set failovers and loss of service, regardless of the presence of this bug.

Comment by Ian Sharp [X] [ 29/Jul/16 ]

This is how i start each instance

sudo mongod --fork --port 27017 --replSet go --logpath ~/log/mongo.log

Comment by Ian Sharp [X] [ 29/Jul/16 ]

Andy Schwerin, I uploaded:

instance1_rs_conf.txt
instance2_rs_conf.txt
instance3_rs_conf.txt

Comment by Andy Schwerin [ 28/Jul/16 ]

DrIanSharp@gmail.com, can you share the replica set configuration for your replica set? I believe you can use rs.conf() in the shell to get it.

Comment by Ian Sharp [X] [ 28/Jul/16 ]

idk if I got low on RAM?

free -m
total used free shared buffers cached
Mem: 996 832 163 0 133 374
-/+ buffers/cache: 325 671
Swap: 0 0 0

Comment by Ian Sharp [X] [ 28/Jul/16 ]

i'm also pretty sure i followed the directions here correctly: https://docs.mongodb.com/ecosystem/platforms/amazon-ec2/

which helped mongo from shutting down in the past

Comment by Ian Sharp [X] [ 27/Jul/16 ]

Thanks Thomas Schubert. I uploaded mongo.log it's about 155MB

Comment by Kelsey Schubert [ 27/Jul/16 ]

Hi DrIanSharp@gmail.com,

I have created a secure upload portal for you to use. Files uploaded to this portal are only visible to MongoDB employees investigating this issue and are routinely deleted after some time. Would you please upload the complete logs to this portal?

Thank you,
Thomas

Comment by Ian Sharp [X] [ 27/Jul/16 ]

Hi Andy Schwerin, there is no more log after the end of backtrace.

there is more log before it.
could i email the log to someone so the log is not posted publicly? (the log reveals schema definitions, collection names, etc)

Comment by Andy Schwerin [ 27/Jul/16 ]

Could you attach about 50 more log lines above what you posted, or the complete logs if you have them available?

Generated at Thu Feb 08 04:08:49 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.