[SERVER-16998] rc5 mongod stops reading client TCP socket Created: 22/Jan/15  Updated: 03/Apr/15  Resolved: 03/Apr/15

Status: Closed
Project: Core Server
Component/s: Concurrency, Networking, Storage
Affects Version/s: 2.8.0-rc5
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Quentin Conner Assignee: Unassigned
Resolution: Incomplete Votes: 0
Labels: 28qa
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Steps To Reproduce:

Longevity workload (at least 12 hours?) with sysbench or YCSB.
Both of these use the Java 2.12.4 but symptoms also show with mongo.

Participants:

 Description   

Seen with sysbench and YCSB workloads during standalone, single-node longevity testing, the mongod process can enter a persistent error condition where it appears to stop reading sockets from client connections. sysbench and YCSB use the 2.12.4 java driver. Client and server are on the same machine.

Nothing useful logged on server log. Tried db.runCommand(

{ setParameter: 1, logLevel: 5 }

) from mongo client but still nothing.

mongo client can run "show collections" but, after displaying the collection anems, hangs.

netstat shows evidence of waiting data in a socket for mongod:

[qconner@slave-4.perf.ny.cbi ~/apps/mongo]$ netstat -n|grep ESTAB
tcp        0      0 172.30.200.12:60219         184.106.28.85:443           ESTABLISHED
tcp        0      0 172.30.200.12:46284         54.209.155.229:3232         ESTABLISHED
tcp   221981      0 127.0.0.1:27017             127.0.0.1:40262             ESTABLISHED
tcp   221981      0 127.0.0.1:27017             127.0.0.1:40256             ESTABLISHED
tcp        0      0 172.30.200.12:22            172.30.200.8:61008          ESTABLISHED
tcp        0      0 127.0.0.1:27017             127.0.0.1:40261             ESTABLISHED
tcp        0      0 127.0.0.1:27017             127.0.0.1:40253             ESTABLISHED
tcp        0      0 127.0.0.1:27017             127.0.0.1:40257             ESTABLISHED
tcp        0      0 127.0.0.1:27017             127.0.0.1:40259             ESTABLISHED
tcp        0      0 127.0.0.1:27017             127.0.0.1:40258             ESTABLISHED
tcp        0      0 127.0.0.1:27017             127.0.0.1:40255             ESTABLISHED
tcp        0      0 172.30.200.12:56842         68.203.14.106:4506          ESTABLISHED
tcp        0      0 172.30.200.12:45664         68.203.14.106:4505          ESTABLISHED
tcp        0      0 172.30.200.12:57205         54.209.155.229:3233         ESTABLISHED
tcp        0      0 127.0.0.1:27017             127.0.0.1:40254             ESTABLISHED
tcp        0      0 172.30.200.12:58762         68.203.14.106:4506          ESTABLISHED

strace of mongod shows only calls to poll: or select():

[qconner@slave-4.perf.ny.cbi ~/apps/mongo]$ strace -p 26297
Process 26297 attached - interrupt to quit
select(8, [6 7], NULL, NULL, {0, 5891}) = 0 (Timeout)
select(8, [6 7], NULL, NULL, {0, 10000}) = 0 (Timeout)
select(8, [6 7], NULL, NULL, {0, 10000}) = 0 (Timeout)
select(8, [6 7], NULL, NULL, {0, 10000}) = 0 (Timeout)
select(8, [6 7], NULL, NULL, {0, 10000}) = 0 (Timeout)
select(8, [6 7], NULL, NULL, {0, 10000}) = 0 (Timeout)

Bottom of mongod log file is unremarkable:

2015-01-21T22:50:13.239+0000 I COMMAND  [conn5] command sbtest.$cmd command: insert { insert: "sbtest6", ordered: true, documents: 1000 } keyUpdates:0 writeConflicts:0 n
umYields:0 reslen:40 1294ms
2015-01-21T22:50:17.213+0000 I WRITE    [conn5] insert sbtest.sbtest6 query: { _id: 17816202, k: 47394, c: "38386020571-30959444150-14559542433-02413298585-47740660800-2
7829143017-40155805950-73693273374-88438959896-98232277051", pad: "25264851037-22995234305-81217374636-99284513390-23152612121" } ninserted:1 keyUpdates:0 writeConflicts
:0 numYields:0 143ms
2015-01-21T22:50:17.242+0000 I COMMAND  [conn5] command sbtest.$cmd command: insert { insert: "sbtest6", ordered: true, documents: 1000 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:40 179ms
2015-01-22T16:32:57.907+0000 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:42321 #10 (10 connections now open)
2015-01-22T16:33:07.237+0000 I NETWORK  [conn10] end connection 127.0.0.1:42321 (9 connections now open)
2015-01-22T16:33:08.497+0000 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:42322 #11 (10 connections now open)
2015-01-22T16:34:13.257+0000 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:42324 #12 (11 connections now open)
2015-01-22T16:34:13.258+0000 I NETWORK  [conn12] end connection 127.0.0.1:42324 (10 connections now open)
2015-01-22T16:34:49.896+0000 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:42326 #13 (11 connections now open)
2015-01-22T16:50:15.766+0000 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:42345 #14 (12 connections now open)
2015-01-22T16:50:15.766+0000 I NETWORK  [conn14] end connection 127.0.0.1:42345 (11 connections now open)
2015-01-22T16:52:55.395+0000 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:42375 #15 (12 connections now open)
2015-01-22T16:53:10.491+0000 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:42376 #16 (13 connections now open)
2015-01-22T16:53:10.492+0000 I NETWORK  [conn16] end connection 127.0.0.1:42376 (12 connections now open)
2015-01-22T16:55:48.084+0000 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:42380 #17 (13 connections now open)

Sorry I don't have more to go on. Running it again with symbols so gdb attach can be leveraged.



 Comments   
Comment by Quentin Conner [ 22/Jan/15 ]

sorry this is RC5. We will be re-testing of course with RC6. Description updated.

Generated at Thu Feb 08 03:42:57 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.