[SERVER-19273] Short range scan using YCSB cause mongodb crash Created: 02/Jul/15  Updated: 25/Aug/15  Resolved: 25/Aug/15

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: 3.0.4
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Milind Shah Assignee: Ramon Fernandez Marina
Resolution: Duplicate Votes: 1
Labels: crash
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

RAM: 128GB
Disks: 8 disks, 7200 RPM, 1TB - in a RAID 0 array
CPU: Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz - 16 physical cores, 32 HT cores
Network: 1x10gbe
OS: CentOS 6.5
JVM: 1.7.0_67


Attachments: File 2.0.9-workloade-mongodb-threads_5-conns_5-run.err     File 2.0.9-workloade-mongodb-threads_5-conns_5-run.out     File 2.0.9-workloade-mongodb-threads_5-conns_5-run.stack     File 2.2.7-workloade-mongodb-threads_5-conns_5-run.err     File 2.2.7-workloade-mongodb-threads_5-conns_5-run.out     File 2.2.7-workloade-mongodb-threads_5-conns_5-run.stack     File 2.4.14-workloade-mongodb-threads_5-conns_5-run.err     File 2.4.14-workloade-mongodb-threads_5-conns_5-run.out     File 3.0.4-workloade-mongodb-threads_5-conns_5-run.err     File 3.0.4-workloade-mongodb-threads_5-conns_5-run.out     File integration-test.sh     File mongod-2.0.9.log     File mongod-2.2.7.log     File mongod-2.4.14.log     File mongod-3.0.4.log    
Issue Links:
Duplicate
duplicates SERVER-17386 Cursor cache causes excessive memory ... Closed
Related
is related to JAVA-1882 Driver can exhaust query pool if limi... Closed
Operating System: Linux
Steps To Reproduce:

1) Insert 30 million rows of 1field - 100 bytes size
2) Run YCSB scan with following command:
-db com.yahoo.ycsb.db.MongoDbClient -P scan.zipfian.50Row -s -p recordcount=30000000 -p operationcount=30000000 -p maxexecutiontime=1200 -p threadcount=25 -p fieldcount=1 -p fieldlength=100 -p mongodatabase.url=mongodb://<hostname>:27016/ycsb?w=1 -p mongodb.database=ycsb -p mongodb.writeConcern=acknowledged

Participants:

 Description   

I am running YCSB on a 3 node setup.
Node 1) Runs 3 config servers
Node 2) Runs mongod and mongos
Node 3) Runs YCSB clients.

I insert 30 million rows as mentioned in the blog post https://www.mongodb.com/blog/post/performance-testing-mongodb-30-part-1-throughput-improvements-measured-ycsb here. The row has 1 field of 100 bytes of data. After inserting the data, I run YCSB short range scan with scan length of 50. During the scan my mongod process went down and in the mongod log, I saw:

2015-07-01T17:32:41.506-0700 E STORAGE  [conn1] WiredTiger (12) [1435797161:506959][20793:0x7f03e896a700], connection.open_session: only configured to sup
port 20010 sessions (including 10 internal): Cannot allocate memory
2015-07-01T17:32:41.507-0700 I -        [conn1] Invariant failure: ret resulted in status UnknownError 12: Cannot allocate memory at src/mongo/db/storage/
wiredtiger/wiredtiger_session_cache.cpp 49
2015-07-01T17:32:41.507-0700 E STORAGE  [conn88] WiredTiger (12) [1435797161:507173][20793:0x7f03e1302700], connection.open_session: only configured to su
pport 20010 sessions (including 10 internal): Cannot allocate memory
2015-07-01T17:32:41.507-0700 I -        [conn88] Invariant failure: ret resulted in status UnknownError 12: Cannot allocate memory at src/mongo/db/storage
/wiredtiger/wiredtiger_session_cache.cpp 49
2015-07-01T17:32:41.524-0700 I CONTROL  [conn48]
 0xf77369 0xf140f1 0xef99fa 0xd8eae0 0xd8efb6 0xd8a0ce 0xd8a115 0xd78165 0xa83e58 0xa078e2 0xa0801d 0xa1e595 0x9fbb7d 0xbd5714 0xbd5ac4 0xba3a54 0xab4e90
0x7fb81d 0xf2883b 0x7f03f89559d1 0x7f03f74a786d
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"400000","o":"B77369"},{"b":"400000","o":"B140F1"},{"b":"400000","o":"AF99FA"},{"b":"400000","o":"98EAE0"},{"b":"400000","o":"98EFB6"},{"b":"400000","o":"98A0CE"},{"b":"400000","o":"98A115"},{"b":"400000","o":"978165"},{"b":"400000","o":"683E58"},{"b":"400000","o":"6078E2"},{"b":"400000","o":"60801D"},{"b":"400000","o":"61E595"},{"b":"400000","o":"5FBB7D"},{"b":"400000","o":"7D5714"},{"b":"400000","o":"7D5AC4"},{"b":"400000","o":"7A3A54"},{"b":"400000","o":"6B4E90"},{"b":"400000","o":"3FB81D"},{"b":"400000","o":"B2883B"},{"b":"7F03F894E000","o":"79D1"},{"b":"7F03F73BF000","o":"E886D"}],"processInfo":{ "mongodbVersion" : "3.0.4", "gitVersion" : "0481c958daeb2969800511e7475dc66986fa9ed5", "uname" : { "sysname" : "Linux", "release" : "2.6.32-431.29.2.el6.x86_64", "version" : "#1 SMP Tue Sep 9 21:36:05 UTC 2014", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000", "buildId" : "7702B4740E91E1BA6F701DB2A42553968AC09562" }, { "b" : "7FFF97C38000", "elfType" : 3, "buildId" : "5474F0D8DAF3D6177E2C4B06F3892745CB43B4D5" }, { "b" : "7F03F894E000", "path" : "/lib64/libpthread.so.0", "elfType" : 3, "buildId" : "211321F78CA244BE2B2B1B8584B460F9933BA76B" }, { "b" : "7F03F86E2000", "path" : "/usr/lib64/libssl.so.10", "elfType" : 3, "buildId" : "40BEA6554E64FC0C3D5C7D0CD91362730515102F" }, { "b" : "7F03F82FF000", "path" : "/usr/lib64/libcrypto.so.10", "elfType" : 3, "buildId" : "FC4EFD7502ACB3B9D213D28272D15A165857AD5A" }, { "b" : "7F03F80F7000", "path" : "/lib64/librt.so.1", "elfType" : 3, "buildId" : "B26528BF6C0636AC1CAE5AC50BDBC07E60851DF4" }, { "b" : "7F03F7EF3000", "path" : "/lib64/libdl.so.2", "elfType" : 3, "buildId" : "AFC7448F2F2F6ED4E5BC82B1BD8A7320B84A9D48" }, { "b" : "7F03F7BED000", "path" : "/usr/lib64/libstdc++.so.6", "elfType" : 3, "buildId" : "F07F2E7CF4BFB393CC9BBE8CDC6463652E14DB07" }, { "b" : "7F03F7969000", "path" : "/lib64/libm.so.6", "elfType" : 3, "buildId" : "98B028A725D6E93253F25DF00B794DFAA66A3145" }, { "b" : "7F03F7753000", "path" : "/lib64/libgcc_s.so.1", "elfType" : 3, "buildId" : "246C3BAB0AB093AFD59D34C8CBF29E786DE4BE97" }, { "b" : "7F03F73BF000", "path" : "/lib64/libc.so.6", "elfType" : 3, "buildId" : "59640F8CD5A70CF0391A7C64275D84336935E6AA" }, { "b" : "7F03F8B6B000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "57BF668F99B7F5917B8D55FBB645173C9A644575" }, { "b" : "7F03F717B000", "path" : "/lib64/libgssapi_krb5.so.2", "elfType" : 3, "buildId" : "54BA6B78A9220344E77463947215E42F0EABCC62" }, { "b" : "7F03F6E95000", "path" : "/lib64/libkrb5.so.3", "elfType" : 3, "buildId" : "6797403AA5F8FAD8ADFF683478B45F528CE4FB0E" }, { "b" : "7F03F6C91000", "path" : "/lib64/libcom_err.so.2", "elfType" : 3, "buildId" : "8CE28F280150E62296240E70ECAC64E4A57AB826" }, { "b" : "7F03F6A65000", "path" : "/lib64/libk5crypto.so.3", "elfType" : 3, "buildId" : "05733977F4E41652B86070B27A0CFC2C1EA7719D" }, { "b" : "7F03F684F000", "path" : "/lib64/libz.so.1", "elfType" : 3, "buildId" : "5FA8E5038EC04A774AF72A9BB62DC86E1049C4D6" }, { "b" : "7F03F6644000", "path" : "/lib64/libkrb5support.so.0", "elfType" : 3, "buildId" : "E3FA235F3BA3F776A01A18ECA737C9890F445923" }, { "b" : "7F03F6441000", "path" : "/lib64/libkeyutils.so.1", "elfType" : 3, "buildId" : "AF374BAFB7F5B139A0B431D3F06D82014AFF3251" }, { "b" : "7F03F6227000", "path" : "/lib64/libresolv.so.2", "elfType" : 3, "buildId" : "A91A53E16DEABDFE05F28F7D04DAB5FFAA013767" }, { "b" : "7F03F6008000", "path" : "/lib64/libselinux.so.1", "elfType" : 3, "buildId" : "E6798A06BEE17CF102BBA44FD512FF8B805CEAF1" } ] }}
 mongod(_ZN5mongo15printStackTraceERSo+0x29) [0xf77369]
 mongod(_ZN5mongo10logContextEPKc+0xE1) [0xf140f1]
 mongod(_ZN5mongo17invariantOKFailedEPKcRKNS_6StatusES1_j+0xDA) [0xef99fa]
 mongod(_ZN5mongo17WiredTigerSessionC1EP15__wt_connectionii+0xA0) [0xd8eae0]
 mongod(_ZN5mongo22WiredTigerSessionCache10getSessionEv+0x4C6) [0xd8efb6]
 mongod(_ZN5mongo22WiredTigerRecoveryUnit10getSessionEPNS_16OperationContextE+0x3E) [0xd8a0ce]
 mongod(_ZN5mongo16WiredTigerCursorC1ERKSsmbPNS_16OperationContextE+0x35) [0xd8a115]
 mongod(_ZNK5mongo21WiredTigerIndexUnique9newCursorEPNS_16OperationContextEi+0x55) [0xd78165]
 mongod(_ZNK5mongo22BtreeBasedAccessMethod9newCursorEPNS_16OperationContextERKNS_13CursorOptionsEPPNS_11IndexCursorE+0x28) [0xa83e58]
 mongod(_ZN5mongo9IndexScan13initIndexScanEv+0x62) [0xa078e2]
 mongod(_ZN5mongo9IndexScan4workEPm+0x4D) [0xa0801d]
 mongod(_ZN5mongo16ShardFilterStage4workEPm+0x55) [0xa1e595]
 mongod(_ZN5mongo10FetchStage4workEPm+0xCD) [0x9fbb7d]
 mongod(_ZN5mongo12PlanExecutor18getNextSnapshottedEPNS_11SnapshottedINS_7BSONObjEEEPNS_8RecordIdE+0xA4) [0xbd5714]
 mongod(_ZN5mongo12PlanExecutor7getNextEPNS_7BSONObjEPNS_8RecordIdE+0x34) [0xbd5ac4]
 mongod(_ZN5mongo8runQueryEPNS_16OperationContextERNS_7MessageERNS_12QueryMessageERKNS_15NamespaceStringERNS_5CurOpES3_+0xA74) [0xba3a54]
 mongod(_ZN5mongo16assembleResponseEPNS_16OperationContextERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0xB10) [0xab4e90]
 mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0xDD) [0x7fb81d]
 mongod(_ZN5mongo17PortMessageServer17handleIncomingMsgEPv+0x34B) [0xf2883b]
 libpthread.so.0(+0x79D1) [0x7f03f89559d1]
 libc.so.6(clone+0x6D) [0x7f03f74a786d]
-----  END BACKTRACE  -----
2015-07-01T17:32:41.524-0700 I -        [conn48] 



 Comments   
Comment by Ramon Fernandez Marina [ 29/Jul/15 ]

I forgot to add that SERVER-17386 fixes the issue of cursors consuming excessive amounts of memory on WiredTiger. The fix is included in 3.0.5, which was released earlier this week – so rather than trying the workarounds posted above I'd encourage users to first try using 3.0.5.

Cheers,
Ramón.

Comment by Ramon Fernandez Marina [ 29/Jul/15 ]

milindshah, the error message

only configured to support 20010 sessions (including 10 internal): Cannot allocate memory

appeared earlier in SERVER-17364.

You may be able to work around it by increasing session_max, for example:

--wiredTigerEngineConfigString="session_max=40000"

Note that this will increase memory consumption. The second alternative is to set a low timeout for idle cursors; for example, to close cursors after 60 seconds of inactivity:

use admin
db.runCommand({setParameter:1, cursorTimeoutMillis: 60000})

Please note that setting a low value for cursorTimeoutMillis may negatively impact applications that rely on idle cursors remaining open for values higher than cursorTimeoutMillis.

robert.j.moore@allanbank.com, after using 3.0.3-SNAPSHOT, does the server still aborts with those Cannot allocate memory errors?

Thanks,
Ramón.

Comment by J Rassi [ 09/Jul/15 ]

martin.bligh: could you weigh in for triage, please? I don't have much context on this issue, but Mathias suggests that this is either a dup of SERVER-17364, or that it is caused by a defect in the WiredTiger session cache that he believes you to be working on.

Comment by Jeffrey Yemin [ 08/Jul/15 ]

I reproduced this issue and discovered a defect in the Java driver, which is now linked to this ticket.

Comment by Robert Moore [ 08/Jul/15 ]

I spoke too soon. The stalls are still happening with all of the threads in:

	- parking to wait for  <0x000000076d4fa6d8> (a java.util.concurrent.Semaphore$FairSync)
	at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
	at java.util.concurrent.Semaphore.tryAcquire(Semaphore.java:409)
	at com.mongodb.internal.connection.ConcurrentPool.acquirePermit(ConcurrentPool.java:180)
	at com.mongodb.internal.connection.ConcurrentPool.get(ConcurrentPool.java:126)

Comment by Robert Moore [ 08/Jul/15 ]

jeff.yemin -

I just ran the same tests with the 3.0.3-SNAPSHOT jar and it did not stall during the scans operations.

Could the extra calls have been causing the connections to not get returned to the connection pool?

Rob.

Comment by Jeffrey Yemin [ 08/Jul/15 ]

robert.j.moore@allanbank.com the "killcursors: found 0 of 1" messages are due to JAVA-1854, which is fixed in the 3.0.3-SNAPSHOT release (https://oss.sonatype.org/content/repositories/snapshots/org/mongodb/mongo-java-driver/3.0.3-SNAPSHOT/). Essentially, when a limit is set on a query, the driver sends OP_KILL_CURSOR twice for the same cursor, once when the limit is reached and once when MongoCursor#close is called.

Comment by Mark Callaghan [ 07/Jul/15 ]

problem being discussed as next YCSB release is being tested
https://github.com/brianfrankcooper/YCSB/issues/266#issuecomment-112930542

Comment by Asya Kamsky [ 06/Jul/15 ]

Ah, sorry just saw 25 threads in the repro section.

Will try that.

Comment by Asya Kamsky [ 06/Jul/15 ]

computingfreak@gmail.com

What's the exact YCSB command - how many threads are you using? I'd like to reproduce this here.

Comment by Robert Moore [ 05/Jul/15 ]

Attaching the script I used to run the tests. It downloads the YCSB, builds and runs the workloads. Feel free to modify to just run the mongodb driver with 5 threads and 5 connections which shows the worst behaviour.

Run with:

sh integration-test.sh 0.2.0-RC3

Comment by Robert Moore [ 05/Jul/15 ]

The tests finally finished and I see the same behaviour for the cursors with 3.0.4:

2015-07-05T11:59:39.256-0400 I COMMAND  [conn647] killcursors: found 0 of 1
2015-07-05T11:59:39.258-0400 I COMMAND  [conn645] killcursors: found 0 of 1
2015-07-05T11:59:39.258-0400 I COMMAND  [conn646] killcursors: found 0 of 1
2015-07-05T11:59:39.258-0400 I COMMAND  [conn646] killcursors: found 0 of 1

Comment by Robert Moore [ 05/Jul/15 ]

milindshah -

As a workaround you can use the `mongodb-async` driver with YCSB. You might also see a performance increase as a result.

Just change

  ycsb run mongodb ...

to

  ycsb run mongodb-async ...

Rob.

Comment by Robert Moore [ 05/Jul/15 ]

Logs for 2.2.7 and 2.4.14. 2.6.10 is showing the same behaviour but is still running. I don't have a stack trace for 2.4.14 since it happened in the middle of the night.

Comment by Robert Moore [ 05/Jul/15 ]

asya -

I am seeing something strange with the out of the box workloade using the MongoDB Inc driver. After having jeff.yemin look at the code I am pretty sure the cursors are getting closed, or trying, but I am not convinced they are the right cursors.

I am going through all of the production MongoDB versions since 1.8 and testing the YCSB 0.2.0-RC3 and for the workloade I am getting timeouts trying to get connections. I will attach the logs as I get them but for the 2.0.9 server I see this from the script:

2015-07-04 20:53:11: 2.0.9    : Load: workloade with mongodb-async thread=  5 conns= 5 -->         27 seconds
2015-07-04 20:53:38: 2.0.9    :  Run: workloade with mongodb-async thread=  5 conns= 5 -->         26 seconds
2015-07-04 20:54:05: 2.0.9    : Load: workloade with mongodb       thread=  5 conns= 5 -->         26 seconds
2015-07-04 20:54:31: 2.0.9    :  Run: workloade with mongodb       thread=  5 conns= 5 -->      14786 seconds
2015-07-05 01:00:57: 2.0.9    : Load: workloade with mongodb-async thread= 10 conns= 5 -->         26 seconds
2015-07-05 01:01:23: 2.0.9    :  Run: workloade with mongodb-async thread= 10 conns= 5 -->         27 seconds
2015-07-05 01:01:50: 2.0.9    : Load: workloade with mongodb       thread= 10 conns= 5 -->         26 seconds
2015-07-05 01:02:16: 2.0.9    :  Run: workloade with mongodb       thread= 10 conns= 5 -->        246 seconds
2015-07-05 01:06:22: 2.0.9    : Load: workloade with mongodb-async thread= 10 conns=10 -->         26 seconds
2015-07-05 01:06:48: 2.0.9    :  Run: workloade with mongodb-async thread= 10 conns=10 -->         26 seconds
2015-07-05 01:07:14: 2.0.9    : Load: workloade with mongodb       thread= 10 conns=10 -->         27 seconds
2015-07-05 01:07:41: 2.0.9    :  Run: workloade with mongodb       thread= 10 conns=10 -->        625 seconds
2015-07-05 01:18:07: 2.0.9    : Load: workloade with mongodb-async thread= 20 conns=10 -->         26 seconds
2015-07-05 01:18:33: 2.0.9    :  Run: workloade with mongodb-async thread= 20 conns=10 -->         26 seconds
2015-07-05 01:18:59: 2.0.9    : Load: workloade with mongodb       thread= 20 conns=10 -->         26 seconds
2015-07-05 01:19:25: 2.0.9    :  Run: workloade with mongodb       thread= 20 conns=10 -->        246 seconds

(Ignore the 20 to 01 hour change. I am using the wrong date expression in the script. It is 20 to 21 hours.)

In the MongoDB logs for the 2015-07-04 20:54:31 period I see lots of:

Sat Jul  4 20:08:38 [conn625] killcursors: found 0 of 1
Sat Jul  4 20:08:38 [conn625] ClientCursor::find(): cursor not found in map 3603954716566437622 (ok after a drop)
killcursors: found 0 of 1

It appears in the YCSB logs (which I will attach) that the driver lost track of the connections completely and that caused all of the threads to eventually have to timeout waiting for a connection. I'll attach a stacktrace showing the threads are all waiting for a connection.

I know that 2.0.9 is ancient but I am seeing the same thing with 2.2.7 so I think it is something systemic. I will report back once all of the server versions finish if I see the same thing on more recent versions.

Rob.

Comment by Milind Shah [ 03/Jul/15 ]

Thanks Asya. I am using the latest YCSB from https://github.com/brianfrankcooper/YCSB. I will not use the scan test in that case.

Comment by Asya Kamsky [ 02/Jul/15 ]

milindshah

Based on the Google Groups discussion, if you want to use "scan" operation, you should NOT be using a hashed shard key - hashing the shard key removes the ability to do efficient scans over the index.

In addition, it appears that the official YCSB master may have introduced a cursor leak in the scan method, so I would recommend not including "scan" operation in your tests (at least for the moment).

Asya

Comment by Asya Kamsky [ 02/Jul/15 ]

Which YCSB version are you using for these tests?

Generated at Thu Feb 08 03:50:25 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.