[SERVER-13495] Concurrent GETMORE and KILLCURSORS operations can cause race condition and server crash Created: 04/Apr/14  Updated: 11/Jul/16  Resolved: 07/Apr/14

Status: Closed
Project: Core Server
Component/s: Stability
Affects Version/s: 2.6.0-rc3
Fix Version/s: 2.6.1, 2.7.0

Type: Bug Priority: Critical - P2
Reporter: A. Jesse Jiryu Davis Assignee: Eliot Horowitz (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File SERVER-13495.py     Text File db27017.log    
Issue Links:
Related
Operating System: ALL
Backport Completed:
Steps To Reproduce:

Running Motor's test_del_on_main_greenlet test. Motor is killing a cursor on one connection, and on another connection it's continuously issuing OP_GETMORE with that cursorId to know when the cursor has died.

Participants:

 Description   
Issue Status as of April 15, 2014

ISSUE SUMMARY
Issuing an OP_GETMORE at the same time as an OP_KILLCURSORS may in some very rare cases trigger a race condition that can dereference a bad pointer in the server and cause it to crash.

USER IMPACT
Very low, but the bug can crash the running process.

WORKAROUNDS
None

RESOLUTION
In CollectionCursorCache, ClientCursor::_pinValue needs to be guarded by a mutex.

AFFECTED VERSIONS
Version 2.6.0 is affected by this bug.

PATCHES
The patch is included in the 2.6.1 production release.

Original description

Running Motor's test suite, a primary running version v2.6.0-rc4-pre-
hash 8a71a0 crashed while killing a cursor. The log ends:

2014-04-04T22:22:13.936+0000 [conn224] query motor_test.test_collection planSummary: COLLSCAN cursorid:189743471636 ntoreturn:0 ntoskip:0 keyUpdates:0 numYields:0 locks(micros) r:380 nreturned:101 reslen:1434 0ms
2014-04-04T22:22:13.937+0000 [conn224] motor_test.test_collection warning assertion failure false src/mongo/db/clientcursor.cpp 127
2014-04-04T22:22:13.937+0000 [conn34] getmore motor_test.test_collection cursorid:189743471636 ntoreturn:0 keyUpdates:0 numYields:0 locks(micros) r:25 nreturned:0 reslen:20 0ms
2014-04-04T22:22:13.937+0000 [conn34] run command motor_test.$cmd { delete: "test_collection", deletes: [ { q: {}, limit: 0 } ] }
2014-04-04T22:22:13.937+0000 [conn34] parseAndRemoveImpersonatedUserField: command: { delete: "test_collection", deletes: [ { q: {}, limit: 0 } ] }
2014-04-04T22:22:13.943+0000 [conn224] motor_test.test_collection 0xe63563 0xe1de60 0xe067e8 0x9337d8 0x91023c 0x9105b1 0xaa52cf 0xaaa4e0 0x84955b 0xe2a3fc 0x7f8676c52f18 0x7f867591ae0d
 /mnt/jenkins/mongodb/unstable/unstable-release/bin/mongod(_ZN5mongo15printStackTraceERSo+0x23) [0xe63563]
 /mnt/jenkins/mongodb/unstable/unstable-release/bin/mongod(_ZN5mongo10logContextEPKc+0x190) [0xe1de60]
 /mnt/jenkins/mongodb/unstable/unstable-release/bin/mongod(_ZN5mongo9wassertedEPKcS1_j+0x118) [0xe067e8]
 /mnt/jenkins/mongodb/unstable/unstable-release/bin/mongod(_ZN5mongo12ClientCursorD1Ev+0xf8) [0x9337d8]
 /mnt/jenkins/mongodb/unstable/unstable-release/bin/mongod(_ZN5mongo19GlobalCursorIdCache11eraseCursorExb+0x26c) [0x91023c]
 /mnt/jenkins/mongodb/unstable/unstable-release/bin/mongod(_ZN5mongo21CollectionCursorCache29eraseCursorGlobalIfAuthorizedEiPx+0x31) [0x9105b1]
 /mnt/jenkins/mongodb/unstable/unstable-release/bin/mongod(_ZN5mongo19receivedKillCursorsERNS_7MessageE+0xcf) [0xaa52cf]
 /mnt/jenkins/mongodb/unstable/unstable-release/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0x5d0) [0xaaa4e0]
 /mnt/jenkins/mongodb/unstable/unstable-release/bin/mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x8b) [0x84955b]
 /mnt/jenkins/mongodb/unstable/unstable-release/bin/mongod(_ZN5mongo17PortMessageServer17handleIncomingMsgEPv+0x38c) [0xe2a3fc]
 /lib64/libpthread.so.0(+0x7f18) [0x7f8676c52f18]
 /lib64/libc.so.6(clone+0x6d) [0x7f867591ae0d]
2014-04-04T22:22:13.943+0000 [conn224] SEVERE: Invalid access at address: 0x17
2014-04-04T22:22:13.956+0000 [conn224] SEVERE: Got signal: 11 (Segmentation fault).
Backtrace:0xe63563 0xe630b8 0xe63156 0x7f8676c5a5b0 0x9337e4 0x91023c 0x9105b1 0xaa52cf 0xaaa4e0 0x84955b 0xe2a3fc 0x7f8676c52f18 0x7f867591ae0d
 /mnt/jenkins/mongodb/unstable/unstable-release/bin/mongod(_ZN5mongo15printStackTraceERSo+0x23) [0xe63563]
 /mnt/jenkins/mongodb/unstable/unstable-release/bin/mongod() [0xe630b8]
 /mnt/jenkins/mongodb/unstable/unstable-release/bin/mongod() [0xe63156]
 /lib64/libpthread.so.0(+0xf5b0) [0x7f8676c5a5b0]
 /mnt/jenkins/mongodb/unstable/unstable-release/bin/mongod(_ZN5mongo12ClientCursorD1Ev+0x104) [0x9337e4]
 /mnt/jenkins/mongodb/unstable/unstable-release/bin/mongod(_ZN5mongo19GlobalCursorIdCache11eraseCursorExb+0x26c) [0x91023c]
 /mnt/jenkins/mongodb/unstable/unstable-release/bin/mongod(_ZN5mongo21CollectionCursorCache29eraseCursorGlobalIfAuthorizedEiPx+0x31) [0x9105b1]
 /mnt/jenkins/mongodb/unstable/unstable-release/bin/mongod(_ZN5mongo19receivedKillCursorsERNS_7MessageE+0xcf) [0xaa52cf]
 /mnt/jenkins/mongodb/unstable/unstable-release/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0x5d0) [0xaaa4e0]
 /mnt/jenkins/mongodb/unstable/unstable-release/bin/mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x8b) [0x84955b]
 /mnt/jenkins/mongodb/unstable/unstable-release/bin/mongod(_ZN5mongo17PortMessageServer17handleIncomingMsgEPv+0x38c) [0xe2a3fc]
 /lib64/libpthread.so.0(+0x7f18) [0x7f8676c52f18]
 /lib64/libc.so.6(clone+0x6d) [0x7f867591ae0d]



 Comments   
Comment by Githook User [ 09/Apr/14 ]

Author:

{u'username': u'erh', u'name': u'Eliot Horowitz', u'email': u'eliot@10gen.com'}

Message: SERVER-13495: fix ClientCursor::_pinValue concurrency

(cherry picked from commit 97bead396f78b168eae2774af5b784827d8341c6)
Branch: v2.6
https://github.com/mongodb/mongo/commit/1bdf9bb5115bdc993828dc9addc27dd8e4488e1b

Comment by A. Jesse Jiryu Davis [ 07/Apr/14 ]

With the fix in place, my script has run for 500 seconds+ without reproducing the segfault.

Comment by Githook User [ 07/Apr/14 ]

Author:

{u'username': u'erh', u'name': u'Eliot Horowitz', u'email': u'eliot@10gen.com'}

Message: SERVER-13495: fix ClientCursor::_pinValue concurrency
Branch: master
https://github.com/mongodb/mongo/commit/97bead396f78b168eae2774af5b784827d8341c6

Comment by A. Jesse Jiryu Davis [ 07/Apr/14 ]

This script reproduces the crash more quickly, by making many attempts per second. My first run crashed mongod after 72 seconds on my Macbook Pro, the second time after 109 seconds, the third time after 83 seconds.

Run it with Python 2.6 or 2.7, PyMongo 2.7.

Each attempt creates a cursor, and sends OP_KILLCURSORS on the main thread while sending OP_GETMORE repeatedly on a worker thread. The attempt prints something like:

OP_GETMORE
cursor dead
109.50 sec

"OP_GETMORE" indicates one successful getmore from the worker thread before the cursor died. "cursor dead" indicates a failed getmore from the worker thread; the main thread's OP_KILLCURSORS has completed. The script will end this attempt and begin the next. It prints the running time so far, in seconds.

Some attempts show that one OP_GETMORE finished before the cursor died, in other attempts no OP_GETMOREs complete, the cursor dies too quickly.

When mongod crashes the script exits with "Connection refused".

Comment by A. Jesse Jiryu Davis [ 04/Apr/14 ]

It's rare: the test has passed a dozen or more times with the same configuration.

Generated at Thu Feb 08 03:31:54 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.