[SERVER-13495] Concurrent GETMORE and KILLCURSORS operations can cause race condition and server crash Created: 04/Apr/14 Updated: 11/Jul/16 Resolved: 07/Apr/14 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Stability |
| Affects Version/s: | 2.6.0-rc3 |
| Fix Version/s: | 2.6.1, 2.7.0 |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | A. Jesse Jiryu Davis | Assignee: | Eliot Horowitz (Inactive) |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||
| Issue Links: |
|
||||
| Operating System: | ALL | ||||
| Backport Completed: | |||||
| Steps To Reproduce: | Running Motor's test_del_on_main_greenlet test. Motor is killing a cursor on one connection, and on another connection it's continuously issuing OP_GETMORE with that cursorId to know when the cursor has died. |
||||
| Participants: | |||||
| Description |
| Comments |
| Comment by Githook User [ 09/Apr/14 ] | |||
|
Author: {u'username': u'erh', u'name': u'Eliot Horowitz', u'email': u'eliot@10gen.com'}Message: (cherry picked from commit 97bead396f78b168eae2774af5b784827d8341c6) | |||
| Comment by A. Jesse Jiryu Davis [ 07/Apr/14 ] | |||
|
With the fix in place, my script has run for 500 seconds+ without reproducing the segfault. | |||
| Comment by Githook User [ 07/Apr/14 ] | |||
|
Author: {u'username': u'erh', u'name': u'Eliot Horowitz', u'email': u'eliot@10gen.com'}Message: | |||
| Comment by A. Jesse Jiryu Davis [ 07/Apr/14 ] | |||
|
This script reproduces the crash more quickly, by making many attempts per second. My first run crashed mongod after 72 seconds on my Macbook Pro, the second time after 109 seconds, the third time after 83 seconds. Run it with Python 2.6 or 2.7, PyMongo 2.7. Each attempt creates a cursor, and sends OP_KILLCURSORS on the main thread while sending OP_GETMORE repeatedly on a worker thread. The attempt prints something like:
"OP_GETMORE" indicates one successful getmore from the worker thread before the cursor died. "cursor dead" indicates a failed getmore from the worker thread; the main thread's OP_KILLCURSORS has completed. The script will end this attempt and begin the next. It prints the running time so far, in seconds. Some attempts show that one OP_GETMORE finished before the cursor died, in other attempts no OP_GETMOREs complete, the cursor dies too quickly. When mongod crashes the script exits with "Connection refused". | |||
| Comment by A. Jesse Jiryu Davis [ 04/Apr/14 ] | |||
|
It's rare: the test has passed a dozen or more times with the same configuration. |