[SERVER-19854] Mongod failed to open connection, remained in hung state, when running WT, during recovery Created: 10/Aug/15  Updated: 11/Aug/15  Resolved: 11/Aug/15

Status: Closed
Project: Core Server
Component/s: Storage, WiredTiger
Affects Version/s: 3.1.7
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Jonathan Abrahams Assignee: Unassigned
Resolution: Done Votes: 0
Labels: 32powercycle, 32qa
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-16796 Increase logging activity for journal... Closed
Operating System: ALL
Steps To Reproduce:

Command to run this test:

powertest.sh -s <ipaddr>  -c 20 -v 3 -D -A -N 10 -l 3  -F <mongo root> -R -i 10000000

Participants:

 Description   

The powercycle test was applied to WiredTiger, with a seed of 10M documents. After 1 loop of start/crash/start, the connection was not made available, with mongod still active, during recovery phase.

Server log:

2015-08-10T15:59:41.879-0400 I CONTROL  [main] ***** SERVER RESTARTED *****
2015-08-10T15:59:41.955-0400 W -        [initandlisten] Detected unclean shutdown - /home/jonathan/pt-317-wt/data/wiredTiger/mongod.lock is not empty.
2015-08-10T15:59:41.955-0400 W STORAGE  [initandlisten] Recovering data from the last clean checkpoint.
2015-08-10T15:59:41.955-0400 I STORAGE  [initandlisten] wiredtiger_open config: create,cache_size=1G,session_max=20000,eviction=(threads_max=4),statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000),checkpoint=(wait=60,log_size=2GB),statistics_log=(wait=0),

Mongod threads:

jonathan@CAP-JA-Ubuntu14:~$ sudo gdb -p 1817 -batch -ex "thread apply all bt"
[New LWP 7290]
[New LWP 1819]
[New LWP 1818]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f35c294acc3 in pread64 () at ../sysdeps/unix/syscall-template.S:81
81	../sysdeps/unix/syscall-template.S: No such file or directory.
 
Thread 4 (Thread 0x7f35c22ac700 (LWP 1818)):
#0  0x00007f35c294b0d1 in do_sigwait (sig=0x7f35c22ab8fc, set=<optimized out>) at ../nptl/sysdeps/unix/sysv/linux/../../../../../sysdeps/unix/sysv/linux/sigwait.c:60
#1  __sigwait (set=0x24bf980 <mongo::(anonymous namespace)::asyncSignals>, sig=0x7f35c22ab8fc) at ../nptl/sysdeps/unix/sysv/linux/../../../../../sysdeps/unix/sysv/linux/sigwait.c:97
#2  0x00000000011d86b6 in mongo::(anonymous namespace)::signalProcessingThread() ()
#3  0x00000000019cf600 in execute_native_thread_routine ()
#4  0x00007f35c2943182 in start_thread (arg=0x7f35c22ac700) at pthread_create.c:312
#5  0x00007f35c267047d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
 
Thread 3 (Thread 0x7f35c1aab700 (LWP 1819)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1  0x000000000192e4d6 in __wt_cond_wait ()
#2  0x0000000001912fb6 in __evict_server ()
#3  0x00007f35c2943182 in start_thread (arg=0x7f35c1aab700) at pthread_create.c:312
#4  0x00007f35c267047d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
 
Thread 2 (Thread 0x7f35c12aa700 (LWP 7290)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1  0x000000000192e4d6 in __wt_cond_wait ()
#2  0x0000000001912d22 in __evict_worker ()
#3  0x00007f35c2943182 in start_thread (arg=0x7f35c12aa700) at pthread_create.c:312
#4  0x00007f35c267047d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
 
Thread 1 (Thread 0x7f35c398fcc0 (LWP 1817)):
#0  0x00007f35c294acc3 in pread64 () at ../sysdeps/unix/syscall-template.S:81
#1  0x000000000192f886 in __wt_read ()
#2  0x00000000018ad72e in __wt_block_read_off ()
#3  0x00000000018ad9c4 in __wt_bm_read ()
#4  0x00000000018c7f56 in __wt_bt_read ()
#5  0x00000000018cce88 in __wt_cache_read ()
#6  0x00000000018c9f2b in __wt_page_in_func ()
#7  0x00000000018e6209 in __wt_row_search ()
#8  0x00000000018bfb36 in __wt_btcur_remove ()
#9  0x00000000018fd96b in __curfile_remove ()
#10 0x0000000001968a90 in __txn_log_recover ()
#11 0x0000000001918337 in __wt_log_scan ()
#12 0x0000000001969556 in __wt_txn_recover ()
#13 0x00000000018f5896 in __wt_connection_workers ()
#14 0x00000000018ef4ae in wiredtiger_open ()
#15 0x0000000000f79157 in mongo::WiredTigerKVEngine::WiredTigerKVEngine(std::string const&, std::string const&, bool, bool) ()
#16 0x0000000000f7653e in mongo::(anonymous namespace)::WiredTigerFactory::create(mongo::StorageGlobalParams const&, mongo::StorageEngineLockFile const&) const ()
#17 0x0000000000e9f339 in mongo::ServiceContextMongoD::initializeGlobalStorageEngine() ()
#18 0x00000000008d73b0 in mongo::initAndListen(int) ()
#19 0x00000000008dc044 in main ()



 Comments   
Comment by Daniel Pasette (Inactive) [ 11/Aug/15 ]

Logging progress during journal recovery is an existing feature: SERVER-16796 and would have helped here.

Comment by Jonathan Abrahams [ 10/Aug/15 ]

The recovery took a long time (1107 seconds), but eventually completed. The mongod logfile did not indicate that there was ongoing activity, so it seemed the mongod was hung.

Generated at Thu Feb 08 03:52:17 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.