[SERVER-25475] Crash in single-node CSRS on Windows with auth Created: 08/Aug/16  Updated: 25/Jan/17  Resolved: 23/Aug/16

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 3.3.12

Type: Bug Priority: Major - P3
Reporter: A. Jesse Jiryu Davis Assignee: Spencer Brody (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File dbpath.tgz     File mongod.2016-08-08T00-47-31.mdmp     Text File mongod.log    
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Sharding 2016-08-29
Participants:

 Description   

Seen here while Mongo Orchestration is trying to start a sharded cluster with two mongos servers, two standalone shards, and a single-node config server as replica set:

https://evergreen.mongodb.com/version/57a7b8373ff12252cd00010a

The config server crashes:

2016-08-08T01:47:30.552+0100 I CONTROL  [rsSync] *** unhandled exception (access violation) at 0x000000013FE6F0A3, terminating
2016-08-08T01:47:30.552+0100 I CONTROL  [rsSync] *** access violation was a read from 0x0000000000000000
2016-08-08T01:47:30.552+0100 I CONTROL  [rsSync] *** stack trace for unhandled exception:
2016-08-08T01:47:31.077+0100 I CONTROL  [rsSync] mongod.exe    ...\src\mongo\db\repl\replication_coordinator_external_state_impl.cpp(499)       mongo::repl::ReplicationCoordinatorExternalStateImpl::shardingOnDrainingStateHook+0xa3

It appears not to happen unless auth is enabled. This is a nightly build from git hash 9cf1165c. Log and minidump attached. I believe that I can repro on demand.



 Comments   
Comment by Githook User [ 23/Aug/16 ]

Author:

{u'username': u'stbrody', u'name': u'Spencer T Brody', u'email': u'spencer@mongodb.com'}

Message: SERVER-25475 SERVER-25155 Don't perform write for FeatureCompabilityVersion until replication is started up
Branch: master
https://github.com/mongodb/mongo/commit/dd74cff611986d34464fd4476e67dbcd6443d3ca

Comment by Spencer Brody (Inactive) [ 23/Aug/16 ]

jesse, this should be fixed now, can you confirm that this passes your tests now?

Comment by Githook User [ 23/Aug/16 ]

Author:

{u'username': u'stbrody', u'name': u'Spencer T Brody', u'email': u'spencer@mongodb.com'}

Message: SERVER-25475 Re-order mongod initialization to avoid segfault.

This ensures that the sharding system is initialized before the ReplicationCoordinator is started,
thus avoiding the possibility of the becoming primary before the ShardingCatalogManager has been
installed on the Grid object.
Branch: master
https://github.com/mongodb/mongo/commit/c38c8a9ef540b337cce414f8db4a7b93890926b7

Comment by Spencer Brody (Inactive) [ 12/Aug/16 ]

Okay, I was able to reproduce locally using the data files Jesse provided and with adding a sleep in db.cpp:_initAndListen after the replication coordinator is started up and before we initializeGlobalShardingState. Looks like it's a race between the initialization of the replication system and the sharding system.

Comment by A. Jesse Jiryu Davis [ 12/Aug/16 ]

Yes, it seems to crash most of the time that I restart mongod with this config file. I've attached a zip of its dbpath.

Comment by A. Jesse Jiryu Davis [ 12/Aug/16 ]

Its config file is:

dbpath=c:\cygwin\tmp\mongo-uovchv
replSet=546cf2af-3138-44c9-ab94-9d2d5bb062de
logpath=c:\cygwin\tmp\mongo-uovchv\mongod.log
enableMajorityReadConcern=true
oplogSize=100
bind_ip=127.0.0.1
configsvr=true
setParameter = enableTestCommands=1
keyFile=c:\cygwin\tmp\tmpgubboq\key
port=1026

Comment by A. Jesse Jiryu Davis [ 12/Aug/16 ]

Logs:

2016-08-12T18:00:19.692+0100 I REPL     [rsSync] transition to RECOVERING
2016-08-12T18:00:19.695+0100 I REPL     [rsSync] transition to SECONDARY
2016-08-12T18:00:19.695+0100 I REPL     [rsSync] conducting a dry run election to see if we could be elected
2016-08-12T18:00:19.695+0100 I REPL     [ReplicationExecutor] dry election run succeeded, running for election
2016-08-12T18:00:19.695+0100 I REPL     [ReplicationExecutor] election succeeded, assuming primary role in term 2
2016-08-12T18:00:19.695+0100 I REPL     [ReplicationExecutor] transition to PRIMARY
2016-08-12T18:00:19.696+0100 I REPL     [ReplicationExecutor] Could not access any nodes within timeout when checking for additional ops to apply before finishing transition to primary. Will move forward with becoming primary anyway.
2016-08-12T18:00:20.693+0100 I CONTROL  [rsSync] *** unhandled exception (access violation) at 0x000000014020F0A3, terminating
2016-08-12T18:00:20.693+0100 I CONTROL  [rsSync] *** access violation was a read from 0x0000000000000000
2016-08-12T18:00:20.693+0100 I CONTROL  [rsSync] *** stack trace for unhandled exception:
2016-08-12T18:00:20.763+0100 I FTDC     [initandlisten] Initializing full-time diagnostic data capture with directory 'c:/cygwin/tmp/mongo-uovchv/diagnostic.data'
2016-08-12T18:00:20.765+0100 I SHARDING [thread1] creating distributed lock ping thread for process ConfigServer (sleeping for 30000ms)
2016-08-12T18:00:20.765+0100 I SHARDING [shard registry reload] Periodic reload of shard registry failed  :: caused by :: 134 could not get updated shard list from config server due to Read concern majority reads are currently not possible.; will retry after 30s
2016-08-12T18:00:20.765+0100 I NETWORK  [thread2] waiting for connections on port 1026
2016-08-12T18:00:21.087+0100 I NETWORK  [thread2] connection accepted from 127.0.0.1:49333 #1 (1 connection now open)
2016-08-12T18:00:21.187+0100 I NETWORK  [thread2] connection accepted from 127.0.0.1:49331 #2 (2 connections now open)
2016-08-12T18:00:21.187+0100 I -        [conn2] AssertionException handling request, closing client connection: 6 Recv failed
2016-08-12T18:00:21.187+0100 I -        [conn2] end connection 127.0.0.1:49331 (1 connection now open)
2016-08-12T18:00:21.189+0100 I NETWORK  [thread2] connection accepted from 127.0.0.1:49334 #3 (2 connections now open)
2016-08-12T18:00:21.239+0100 I NETWORK  [thread2] connection accepted from 127.0.0.1:49332 #4 (3 connections now open)
2016-08-12T18:00:21.274+0100 I CONTROL  [rsSync] mongod.exe    ...\src\mongo\db\repl\replication_coordinator_external_state_impl.cpp(499)       mongo::repl::ReplicationCoordinatorExternalStateImpl::shardingOnDrainingStateHook+0xa3
2016-08-12T18:00:21.274+0100 I CONTROL  [rsSync] mongod.exe    ...\src\mongo\db\repl\replication_coordinator_impl.cpp(850)                      mongo::repl::ReplicationCoordinatorImpl::signalDrainComplete+0x11e
2016-08-12T18:00:21.274+0100 I CONTROL  [rsSync] mongod.exe    ...\src\mongo\db\repl\sync_tail.cpp(829)                                         mongo::repl::SyncTail::oplogApplication+0x259
2016-08-12T18:00:21.274+0100 I CONTROL  [rsSync] mongod.exe    ...\src\mongo\db\repl\rs_sync.cpp(149)                                           mongo::repl::RSDataSync::_run+0x239
2016-08-12T18:00:21.274+0100 I CONTROL  [rsSync] mongod.exe    c:\program files (x86)\microsoft visual studio 14.0\vc\include\thr\xthread(247)  std::_LaunchPad<std::unique_ptr<std::tuple<void (__cdecl mongo::repl::RSDataSync::*)(void) __ptr64,mongo::repl::RSDataSync * __ptr64>,std::default_delete<std::tuple<void (__cdecl mongo::repl::RSDataSync::*)(void) __ptr64,mongo::repl::RSDataSync * __ptr64> > > >::_Run+0x6f
2016-08-12T18:00:21.274+0100 I CONTROL  [rsSync] mongod.exe    c:\program files (x86)\microsoft visual studio 14.0\vc\include\thr\xthread(210)  std::_Pad::_Call_func+0x9
2016-08-12T18:00:21.274+0100 I CONTROL  [rsSync] mongod.exe    d:\th\minkernel\crts\ucrt\src\appcrt\startup\thread.cpp(115)                     thread_start<unsigned int (__cdecl*)(void * __ptr64)>+0x5d
2016-08-12T18:00:21.274+0100 I CONTROL  [rsSync] kernel32.dll                                                                                   BaseThreadInitThunk+0xd
2016-08-12T18:00:21.274+0100 I -        [rsSync]
2016-08-12T18:00:21.275+0100 I CONTROL  [rsSync] writing minidump diagnostic file c:\mongodb\bin\mongod.2016-08-12T17-00-21.mdmp
2016-08-12T18:00:21.834+0100 I CONTROL  [rsSync] *** immediate exit due to unhandled exception

Comment by A. Jesse Jiryu Davis [ 12/Aug/16 ]

I've just reproduced it on a VS 2015 spawnhost:

git clone https://github.com/mongodb/mongo-c-driver.git
cd mongo-c-driver
 
curl -s http://downloads.mongodb.org/win32/mongodb-win32-x86_64-latest.zip --output mongo-archive.tgz
/cygdrive/c/Progra~2/7-Zip/7z.exe x mongo-archive.tgz
mv mongodb* mongodb
chmod +x ./mongodb/bin/mongod.exe
cp -r mongodb /cygdrive/c/mongodb
export MONGOC_TEST_USER=bob
export MONGOC_TEST_PASSWORD=pwd123
export MONGO_SHELL_CONNECTION_FLAGS="-ubob -ppwd123"
 
export ORCHESTRATION_FILE="orchestration_configs/sharded_clusters/auth.json"
export ORCHESTRATION_URL="http://localhost:8889/v1/sharded_clusters"
export MONGO_ORCHESTRATION_HOME=/tmp/orchestration-home
 
python -m virtualenv venv
cd venv
. Scripts/activate
git clone https://github.com/10gen/mongo-orchestration.git
cd mongo-orchestration
pip install .
cd ../..
 
mkdir -p /cygdrive/c/$MONGO_ORCHESTRATION_HOME
rm -f /cygdrive/c/$MONGO_ORCHESTRATION_HOME/server.pid
echo "{ \"releases\": { \"default\": \"c:\\\\mongodb\\\\bin\" }}" > orchestration.config
mongo-orchestration -f orchestration.config -e default -s wsgiref start --socket-timeout-ms=60000 --bind=127.0.0.1 --enable-majority-read-concern
curl -s http://localhost:8889/
curl -s --data @"$ORCHESTRATION_FILE" "$ORCHESTRATION_URL"

Comment by Spencer Brody (Inactive) [ 11/Aug/16 ]

jesse I'm having trouble reproducing this. Can you give any more information on the series of events that led to this crash?

Also, once this happens, does it crash in the same way every time you start up the binary on the same data files? If so, would it be possible to share those data files?

Comment by Eric Milkie [ 08/Aug/16 ]

It looks like we need to check the return statuses from these function calls before dereferencing:

status = Grid::get(txn)->catalogManager()->initializeConfigDatabaseIfNeeded(txn);

Generated at Thu Feb 08 04:09:17 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.