[SERVER-42823] Output between print() function and ProgramOutputMultiplexer::appendLine() can be interleaved, leading to a deadlock on Windows DEBUG Created: 14/Aug/19  Updated: 07/Dec/21  Resolved: 26/Mar/20

Status: Closed
Project: Core Server
Component/s: Shell
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Max Hirschhorn Assignee: Gabriel Russell (Inactive)
Resolution: Done Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-43997 allow the crtDebugCallback to be call... Closed
Operating System: ALL
Steps To Reproduce:

python buildscripts/resmoke.py --suite=no_server repro_interleaved_output.js | sed 's/x\+/xxx/g'

repro_interleaved_output.js

(function() {
    'use strict';
 
    // _INTERNAL_BUFSIZ=4096 is the default buffer size for streams on Windows.
    TestData.largeStr = 'x'.repeat(4096);
 
    const awaitShell = startParallelShell(function() {
        while (true) {
            print(`inside parallel shell: ${TestData.largeStr}`);
        }
    }, undefined, true);
 
    while (true) {
        print(`main shell: ${TestData.largeStr}`);
    }
})();

Sprint: Dev Tools 2019-08-26, Dev Tools 2019-09-09, Dev Tools 2020-03-23
Participants:

 Description   

The print() function acquires the ShellConsoleAppender::mx mutex when writing to std::cout as part of the LogstreamBuilder going out of scope. The ProgramOutputMultiplexer::appendLine() function is holding the mongoProgramOutputMutex mutex when writing directly to std::cout. This allows the main thread in the mongo shell to be writing text to std::cout in the print() function while a ProgramRunner reader thread is writing subprocess output to std::cout.

Interleaving the writes to std::cout from these two different threads corrupts the underlying output stream. It causes the following assertion to fire on --dbg=on builds on Windows.

fwrite.cpp

// Write blocks of data from the buffer until there is no more data left:
size_t remaining_bytes = total_bytes;
while (remaining_bytes != 0)
{
    // If the buffer is big and is not full, copy data into the buffer:
    if (stream.has_big_buffer() && stream->_cnt != 0)
    {
        if (stream->_cnt < 0)
        {
            _ASSERTE(("Inconsistent Stream Count. Flush between consecutive read and write", stream->_cnt >= 0));
            stream.set_flags(_IOERROR);
            return (total_bytes - remaining_bytes) / element_size;
        }

Since the crtDebugCallback() function also calls log(), we actually end up hanging because the main thread calling print() is holding the ShellConsoleAppender::mx mutex and waiting for the critical section on std::cout, but the ProgramRunner read thread which triggered the callback is holding the critical section on std::cout and is blocked trying to acquire the ShellConsoleAppender::mx mutex.


[js_test:repro_interleaved_output] 2019-08-14T20:56:14.922+0100 main shell: xxx
[js_test:repro_interleaved_output] 2019-08-14T20:56:14.922+0100 sh2676| inside parallel shell: xxxmain shell: xxx
[js_test:repro_interleaved_output] 2019-08-14T20:56:14.923+0100 xxx
[js_test:repro_interleaved_output] 2019-08-14T20:56:14.923+0100 main shell: xxx
[js_test:repro_interleaved_output] 2019-08-14T20:56:14.923+0100 main shell: xxx



 Comments   
Comment by Ratika Gandhi [ 23/Mar/20 ]

gabriel.russell, As discussed in the sprint planning meeting please close this ticket by appropriately linking the tickets that resolved the BFs linked in this ticket. 

Comment by Gabriel Russell (Inactive) [ 14/Oct/19 ]

I think that the solution for the interleaving lines coming out of ProgramOutputMultiplexer::appendLine is that that function should actually call in to the logging system. And I think that the solution for the crtDebugCallback() function deadlocking is that we use recursive locks in the logging system. I don't know how easy it will be to use recursive locks in logv2, but we'll investigate it. In order to not muddy the history, I leave this ticket mostly as is and make it dependent on two new tickets.

Generated at Thu Feb 08 05:01:31 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.