[SERVER-30939] Multiple index builds on same collection can livelock on WriteConflictExceptions Created: 04/Sep/17  Updated: 07/May/19  Resolved: 07/May/19

Status: Closed
Project: Core Server
Component/s: Index Maintenance
Affects Version/s: 3.4.7, 3.5.10
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Kevin Pulo Assignee: Benety Goh
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File backtrace-before.txt     Text File backtrace-cannot-shutdown-1.txt     Text File backtrace-cannot-shutdown-2.txt     Text File backtrace-cannot-shutdown-3.txt     Text File backtrace-cannot-shutdown-4.txt     Text File backtrace-during.txt     File mongod-cannot-shutdown-primary.log     File mongod-cannot-shutdown-secondary.log.xz     File mongod-primary.log     File mongod-secondary.log    
Issue Links:
Backports
Depends
Duplicate
duplicates SERVER-34142 Background index builds for RTT on st... Closed
Related
is related to SERVER-34142 Background index builds for RTT on st... Closed
is related to SERVER-39369 Filter index builds requests in the c... Closed
is related to SERVER-19745 Throttle concurrent index builds in W... Closed
Operating System: ALL
Backport Requested:
v4.0, v3.6
Steps To Reproduce:

Setup:

mlaunch init --replicaset --nodes 2 --arbiter --port 12345 --binarypath /sw/mongodb-unstable/3.5.10/bin --bind_ip 0.0.0.0

Test (intermittent, may require several/many attempts):

mongo --port 12345 --eval 'db.test.drop(); db.test.insert({})' ; for i in {1..63}; do mongo --port 12345 --eval "db.test.ensureIndex({a$i:1},{background:true})" & done

Sprint: Storage NYC 2019-04-22, Storage NYC 2019-05-06, Storage NYC 2019-05-20
Participants:
Case:
Linked BF Score: 15

 Description   

When building indexes in the background, the IndexBuilder will restart the entire index build process on catching a WriteConflictException. Depending on where the WriteConflictException is thrown from (MultiIndexBlock::insertAllDocuments(), MultiIndexBlock::commit(), we could restart the part of the index build process that failed.

======================= OLD DESCRIPTION BELOW ===================

Sometimes, when building numerous background indexes (in the same collection) on a secondary, the mongod gets into a state where 1 or more of the "repl index builder" threads loop indefinitely, rebuilding their bg index over and over again.

This does not happen every time, sometimes I need to run the test repeatedly (up to a dozen times) to trigger the problem.

2017-09-04T07:09:04.815+0000 I COMMAND  [repl writer worker 1] CMD: drop test.test
2017-09-04T07:09:04.815+0000 I STORAGE  [repl writer worker 1] dropCollection: test.test - renaming to drop-pending collection: test.system.drop.1504508944i1t1.test with drop optime { ts: Timestamp 1504508944000|1, t: 1 }
2017-09-04T07:09:04.950+0000 I REPL     [replication-1] Completing collection drop for test.system.drop.1504508944i1t1.test with drop optime { ts: Timestamp 1504508944000|1, t: 1 } (notification optime: { ts: Timestamp 1504508944000|3, t: 1 })
2017-09-04T07:09:05.151+0000 I INDEX    [repl index builder 1] build index on: test.test properties: { v: 2, key: { a1: 1.0 }, name: "a1_1", ns: "test.test", background: true }
2017-09-04T07:09:05.152+0000 I INDEX    [repl index builder 1] build index done.  scanned 1 total records. 0 secs
2017-09-04T07:09:05.160+0000 I INDEX    [repl index builder 2] build index on: test.test properties: { v: 2, key: { a2: 1.0 }, name: "a2_1", ns: "test.test", background: true }
2017-09-04T07:09:05.160+0000 I INDEX    [repl index builder 2] build index done.  scanned 1 total records. 0 secs
2017-09-04T07:09:05.178+0000 I INDEX    [repl index builder 3] build index on: test.test properties: { v: 2, key: { a3: 1.0 }, name: "a3_1", ns: "test.test", background: true }
2017-09-04T07:09:05.179+0000 I INDEX    [repl index builder 3] build index done.  scanned 1 total records. 0 secs
2017-09-04T07:09:05.187+0000 I INDEX    [repl index builder 4] build index on: test.test properties: { v: 2, key: { a4: 1.0 }, name: "a4_1", ns: "test.test", background: true }
2017-09-04T07:09:05.188+0000 I INDEX    [repl index builder 4] build index done.  scanned 1 total records. 0 secs
2017-09-04T07:09:05.196+0000 I INDEX    [repl index builder 3] build index on: test.test properties: { v: 2, key: { a3: 1.0 }, name: "a3_1", ns: "test.test", background: true }
2017-09-04T07:09:05.196+0000 I INDEX    [repl index builder 3] build index done.  scanned 1 total records. 0 secs
2017-09-04T07:09:05.204+0000 I INDEX    [repl index builder 4] build index on: test.test properties: { v: 2, key: { a4: 1.0 }, name: "a4_1", ns: "test.test", background: true }
2017-09-04T07:09:05.204+0000 I INDEX    [repl index builder 4] build index done.  scanned 1 total records. 0 secs
2017-09-04T07:09:05.217+0000 I INDEX    [repl index builder 5] build index on: test.test properties: { v: 2, key: { a5: 1.0 }, name: "a5_1", ns: "test.test", background: true }
2017-09-04T07:09:05.227+0000 I INDEX    [repl index builder 3] build index on: test.test properties: { v: 2, key: { a3: 1.0 }, name: "a3_1", ns: "test.test", background: true }
2017-09-04T07:09:05.228+0000 I INDEX    [repl index builder 5] build index done.  scanned 1 total records. 0 secs
2017-09-04T07:09:05.228+0000 I INDEX    [repl index builder 3] build index done.  scanned 1 total records. 0 secs
2017-09-04T07:09:05.242+0000 I INDEX    [repl index builder 4] build index on: test.test properties: { v: 2, key: { a4: 1.0 }, name: "a4_1", ns: "test.test", background: true }
2017-09-04T07:09:05.250+0000 I INDEX    [repl index builder 5] build index on: test.test properties: { v: 2, key: { a5: 1.0 }, name: "a5_1", ns: "test.test", background: true }
2017-09-04T07:09:05.257+0000 I INDEX    [repl index builder 3] build index on: test.test properties: { v: 2, key: { a3: 1.0 }, name: "a3_1", ns: "test.test", background: true }
2017-09-04T07:09:05.257+0000 I INDEX    [repl index builder 4] build index done.  scanned 1 total records. 0 secs
2017-09-04T07:09:05.265+0000 I INDEX    [repl index builder 6] build index on: test.test properties: { v: 2, key: { a6: 1.0 }, name: "a6_1", ns: "test.test", background: true }
2017-09-04T07:09:05.265+0000 I INDEX    [repl index builder 3] build index done.  scanned 1 total records. 0 secs
2017-09-04T07:09:05.265+0000 I INDEX    [repl index builder 5] build index done.  scanned 1 total records. 0 secs
2017-09-04T07:09:05.266+0000 I INDEX    [repl index builder 6] build index done.  scanned 1 total records. 0 secs
2017-09-04T07:09:05.283+0000 I INDEX    [repl index builder 4] build index on: test.test properties: { v: 2, key: { a4: 1.0 }, name: "a4_1", ns: "test.test", background: true }
2017-09-04T07:09:05.293+0000 I INDEX    [repl index builder 5] build index on: test.test properties: { v: 2, key: { a5: 1.0 }, name: "a5_1", ns: "test.test", background: true }
2017-09-04T07:09:05.293+0000 I INDEX    [repl index builder 4] build index done.  scanned 1 total records. 0 secs
2017-09-04T07:09:05.300+0000 I INDEX    [repl index builder 6] build index on: test.test properties: { v: 2, key: { a6: 1.0 }, name: "a6_1", ns: "test.test", background: true }
2017-09-04T07:09:05.300+0000 I INDEX    [repl index builder 5] build index done.  scanned 1 total records. 0 secs
2017-09-04T07:09:05.308+0000 I INDEX    [repl index builder 7] build index on: test.test properties: { v: 2, key: { a7: 1.0 }, name: "a7_1", ns: "test.test", background: true }
2017-09-04T07:09:05.308+0000 I INDEX    [repl index builder 6] build index done.  scanned 1 total records. 0 secs
2017-09-04T07:09:05.316+0000 I INDEX    [repl index builder 4] build index on: test.test properties: { v: 2, key: { a4: 1.0 }, name: "a4_1", ns: "test.test", background: true }
2017-09-04T07:09:05.316+0000 I INDEX    [repl index builder 7] build index done.  scanned 1 total records. 0 secs
2017-09-04T07:09:05.329+0000 I INDEX    [repl index builder 5] build index on: test.test properties: { v: 2, key: { a5: 1.0 }, name: "a5_1", ns: "test.test", background: true }
2017-09-04T07:09:05.329+0000 I INDEX    [repl index builder 4] build index done.  scanned 1 total records. 0 secs
2017-09-04T07:09:05.345+0000 I INDEX    [repl index builder 6] build index on: test.test properties: { v: 2, key: { a6: 1.0 }, name: "a6_1", ns: "test.test", background: true }
2017-09-04T07:09:05.345+0000 I INDEX    [repl index builder 5] build index done.  scanned 1 total records. 0 secs
2017-09-04T07:09:05.352+0000 I INDEX    [repl index builder 7] build index on: test.test properties: { v: 2, key: { a7: 1.0 }, name: "a7_1", ns: "test.test", background: true }
2017-09-04T07:09:05.352+0000 I INDEX    [repl index builder 6] build index done.  scanned 1 total records. 0 secs
2017-09-04T07:09:05.361+0000 I INDEX    [repl index builder 4] build index on: test.test properties: { v: 2, key: { a4: 1.0 }, name: "a4_1", ns: "test.test", background: true }
2017-09-04T07:09:05.361+0000 I INDEX    [repl index builder 7] build index done.  scanned 1 total records. 0 secs
2017-09-04T07:09:05.368+0000 I INDEX    [repl index builder 8] build index on: test.test properties: { v: 2, key: { a8: 1.0 }, name: "a8_1", ns: "test.test", background: true }
2017-09-04T07:09:05.376+0000 I INDEX    [repl index builder 5] build index on: test.test properties: { v: 2, key: { a5: 1.0 }, name: "a5_1", ns: "test.test", background: true }
2017-09-04T07:09:05.376+0000 I INDEX    [repl index builder 4] build index done.  scanned 1 total records. 0 secs
2017-09-04T07:09:05.377+0000 I INDEX    [repl index builder 8] build index done.  scanned 1 total records. 0 secs
2017-09-04T07:09:05.392+0000 I INDEX    [repl index builder 6] build index on: test.test properties: { v: 2, key: { a6: 1.0 }, name: "a6_1", ns: "test.test", background: true }
2017-09-04T07:09:05.392+0000 I INDEX    [repl index builder 5] build index done.  scanned 1 total records. 0 secs
2017-09-04T07:09:05.402+0000 I INDEX    [repl index builder 7] build index on: test.test properties: { v: 2, key: { a7: 1.0 }, name: "a7_1", ns: "test.test", background: true }
...

Full logs are attached, along with a backtrace from the secondary immediately before building the indexes, and while affected by this issue (for these logs). These are from the Ubuntu 14.04 SSL build (community) of 3.5.10. I've also seen it happen in the Ubuntu 16.04 SSL build of 3.4.7.

Usually once this starts happening, it continues ad infinitum (I've waited up to ~15 mins). But I've also seen it spontaneously resolve itself, and the index builds stop (unfortunately I didn't keep the logs from one of these runs).

Usually shutdown will interrupt these index builds:

2017-09-04T07:09:15.662+0000 I CONTROL  [signalProcessingThread] got signal 15 (Terminated), will terminate after current cmd ends
2017-09-04T07:09:15.663+0000 I NETWORK  [signalProcessingThread] shutdown: going to close listening sockets...
2017-09-04T07:09:15.663+0000 I NETWORK  [signalProcessingThread] shutdown: going to flush diaglog...
2017-09-04T07:09:15.663+0000 I REPL     [signalProcessingThread] shutting down replication subsystems
2017-09-04T07:09:15.663+0000 I REPL     [signalProcessingThread] Stopping replication reporter thread
2017-09-04T07:09:15.663+0000 I INDEX    [repl index builder 8] build index done.  scanned 1 total records. 0 secs
2017-09-04T07:09:15.663+0000 I INDEX    [repl index builder 12] build index done.  scanned 1 total records. 0 secs
2017-09-04T07:09:15.663+0000 I REPL     [SyncSourceFeedback] SyncSourceFeedback error sending update to devkev-1:12345: CallbackCanceled: Reporter no longer valid
2017-09-04T07:09:15.663+0000 I REPL     [signalProcessingThread] Stopping replication fetcher thread
2017-09-04T07:09:15.663+0000 I REPL     [signalProcessingThread] Stopping replication applier thread
2017-09-04T07:09:15.663+0000 I ASIO     [NetworkInterfaceASIO-RS-0] Ending connection to host devkev-1:12345 due to bad connection status; 1 connections to that host remain open
2017-09-04T07:09:15.663+0000 I REPL     [rsBackgroundSync] Replication producer stopped after oplog fetcher finished returning a batch from our sync source.  Abandoning this batch of oplog entries and re-evaluating our sync source.
2017-09-04T07:09:15.663+0000 I INDEX    [repl index builder 24] build index done.  scanned 1 total records. 0 secs
2017-09-04T07:09:15.676+0000 I INDEX    [repl index builder 25] build index on: test.test properties: { v: 2, key: { a25: 1.0 }, name: "a25_1", ns: "test.test", background: true }
2017-09-04T07:09:15.685+0000 I INDEX    [repl index builder 35] build index on: test.test properties: { v: 2, key: { a35: 1.0 }, name: "a35_1", ns: "test.test", background: true }
2017-09-04T07:09:15.693+0000 I INDEX    [repl index builder 13] build index on: test.test properties: { v: 2, key: { a13: 1.0 }, name: "a13_1", ns: "test.test", background: true }
...
2017-09-04T07:09:17.031+0000 I INDEX    [repl index builder 11] build index done.  scanned 1 total records. 0 secs
2017-09-04T07:09:17.031+0000 I INDEX    [repl index builder 31] build index done.  scanned 1 total records. 0 secs
2017-09-04T07:09:17.031+0000 I INDEX    [repl index builder 18] build index done.  scanned 1 total records. 0 secs
2017-09-04T07:09:17.031+0000 I INDEX    [repl index builder 23] build index done.  scanned 1 total records. 0 secs
2017-09-04T07:09:17.039+0000 I INDEX    [repl index builder 32] build index on: test.test properties: { v: 2, key: { a32: 1.0 }, name: "a32_1", ns: "test.test", background: true }
2017-09-04T07:09:17.039+0000 I INDEX    [repl index builder 46] build index done.  scanned 1 total records. 0 secs
2017-09-04T07:09:17.042+0000 I REPL     [signalProcessingThread] Stopping replication storage threads
2017-09-04T07:09:17.043+0000 I FTDC     [signalProcessingThread] Shutting down full-time diagnostic data capture
2017-09-04T07:09:17.048+0000 I INDEX    [repl index builder 19] build index on: test.test properties: { v: 2, key: { a19: 1.0 }, name: "a19_1", ns: "test.test", background: true }
2017-09-04T07:09:17.056+0000 I INDEX    [repl index builder 34] build index on: test.test properties: { v: 2, key: { a33: 1.0 }, name: "a33_1", ns: "test.test", background: true }
2017-09-04T07:09:17.063+0000 I INDEX    [repl index builder 8] build index on: test.test properties: { v: 2, key: { a8: 1.0 }, name: "a8_1", ns: "test.test", background: true }
...
2017-09-04T07:09:17.171+0000 I INDEX    [repl index builder 10] build index on: test.test properties: { v: 2, key: { a10: 1.0 }, name: "a10_1", ns: "test.test", background: true }
2017-09-04T07:09:17.180+0000 I INDEX    [repl index builder 25] build index on: test.test properties: { v: 2, key: { a25: 1.0 }, name: "a25_1", ns: "test.test", background: true }
2017-09-04T07:09:17.187+0000 I INDEX    [repl index builder 35] build index on: test.test properties: { v: 2, key: { a35: 1.0 }, name: "a35_1", ns: "test.test", background: true }
2017-09-04T07:09:17.187+0000 E INDEX    [repl index builder 14] IndexBuilder could not build index: InterruptedAtShutdown: interrupted at shutdown
2017-09-04T07:09:17.187+0000 E INDEX    [repl index builder 7] IndexBuilder could not build index: InterruptedAtShutdown: interrupted at shutdown
2017-09-04T07:09:17.187+0000 E INDEX    [repl index builder 36] IndexBuilder could not build index: InterruptedAtShutdown: interrupted at shutdown
2017-09-04T07:09:17.196+0000 I INDEX    [repl index builder 13] build index on: test.test properties: { v: 2, key: { a13: 1.0 }, name: "a13_1", ns: "test.test", background: true }
2017-09-04T07:09:17.196+0000 E INDEX    [repl index builder 26] IndexBuilder could not build index: InterruptedAtShutdown: interrupted at shutdown
2017-09-04T07:09:17.196+0000 E INDEX    [repl index builder 21] IndexBuilder could not build index: InterruptedAtShutdown: interrupted at shutdown
2017-09-04T07:09:17.205+0000 I INDEX    [repl index builder 20] build index on: test.test properties: { v: 2, key: { a20: 1.0 }, name: "a20_1", ns: "test.test", background: true }
2017-09-04T07:09:17.212+0000 I INDEX    [repl index builder 38] build index on: test.test properties: { v: 2, key: { a38: 1.0 }, name: "a38_1", ns: "test.test", background: true }
2017-09-04T07:09:17.212+0000 E INDEX    [repl index builder 5] IndexBuilder could not build index: InterruptedAtShutdown: interrupted at shutdown
2017-09-04T07:09:17.212+0000 E INDEX    [repl index builder 39] IndexBuilder could not build index: InterruptedAtShutdown: interrupted at shutdown
2017-09-04T07:09:17.220+0000 I INDEX    [repl index builder 28] build index on: test.test properties: { v: 2, key: { a28: 1.0 }, name: "a28_1", ns: "test.test", background: true }
2017-09-04T07:09:17.220+0000 E INDEX    [repl index builder 9] IndexBuilder could not build index: InterruptedAtShutdown: interrupted at shutdown
2017-09-04T07:09:17.234+0000 I INDEX    [repl index builder 6] build index on: test.test properties: { v: 2, key: { a6: 1.0 }, name: "a6_1", ns: "test.test", background: true }
2017-09-04T07:09:17.234+0000 E INDEX    [repl index builder 4] IndexBuilder could not build index: InterruptedAtShutdown: interrupted at shutdown
2017-09-04T07:09:17.242+0000 I INDEX    [repl index builder 41] build index on: test.test properties: { v: 2, key: { a41: 1.0 }, name: "a41_1", ns: "test.test", background: true }
2017-09-04T07:09:17.242+0000 E INDEX    [repl index builder 29] IndexBuilder could not build index: InterruptedAtShutdown: interrupted at shutdown
2017-09-04T07:09:17.242+0000 E INDEX    [repl index builder 42] IndexBuilder could not build index: InterruptedAtShutdown: interrupted at shutdown
2017-09-04T07:09:17.242+0000 E INDEX    [repl index builder 16] IndexBuilder could not build index: InterruptedAtShutdown: interrupted at shutdown
2017-09-04T07:09:17.243+0000 E INDEX    [repl index builder 44] IndexBuilder could not build index: InterruptedAtShutdown: interrupted at shutdown
2017-09-04T07:09:17.251+0000 I INDEX    [repl index builder 11] build index on: test.test properties: { v: 2, key: { a11: 1.0 }, name: "a11_1", ns: "test.test", background: true }
2017-09-04T07:09:17.259+0000 I INDEX    [repl index builder 31] build index on: test.test properties: { v: 2, key: { a31: 1.0 }, name: "a31_1", ns: "test.test", background: true }
2017-09-04T07:09:17.268+0000 I INDEX    [repl index builder 18] build index on: test.test properties: { v: 2, key: { a18: 1.0 }, name: "a18_1", ns: "test.test", background: true }
2017-09-04T07:09:17.276+0000 I INDEX    [repl index builder 23] build index on: test.test properties: { v: 2, key: { a22: 1.0 }, name: "a22_1", ns: "test.test", background: true }
2017-09-04T07:09:17.276+0000 E INDEX    [repl index builder 32] IndexBuilder could not build index: InterruptedAtShutdown: interrupted at shutdown
2017-09-04T07:09:17.285+0000 I INDEX    [repl index builder 46] build index on: test.test properties: { v: 2, key: { a46: 1.0 }, name: "a46_1", ns: "test.test", background: true }
2017-09-04T07:09:17.285+0000 E INDEX    [repl index builder 34] IndexBuilder could not build index: InterruptedAtShutdown: interrupted at shutdown
2017-09-04T07:09:17.285+0000 E INDEX    [repl index builder 19] IndexBuilder could not build index: InterruptedAtShutdown: interrupted at shutdown
2017-09-04T07:09:17.285+0000 E INDEX    [repl index builder 8] IndexBuilder could not build index: InterruptedAtShutdown: interrupted at shutdown
2017-09-04T07:09:17.285+0000 E INDEX    [repl index builder 12] IndexBuilder could not build index: InterruptedAtShutdown: interrupted at shutdown
2017-09-04T07:09:17.285+0000 E INDEX    [repl index builder 24] IndexBuilder could not build index: InterruptedAtShutdown: interrupted at shutdown
2017-09-04T07:09:17.285+0000 E INDEX    [repl index builder 37] IndexBuilder could not build index: InterruptedAtShutdown: interrupted at shutdown
2017-09-04T07:09:17.286+0000 E INDEX    [repl index builder 27] IndexBuilder could not build index: InterruptedAtShutdown: interrupted at shutdown
2017-09-04T07:09:17.286+0000 E INDEX    [repl index builder 40] IndexBuilder could not build index: InterruptedAtShutdown: interrupted at shutdown
2017-09-04T07:09:17.286+0000 E INDEX    [repl index builder 17] IndexBuilder could not build index: InterruptedAtShutdown: interrupted at shutdown
2017-09-04T07:09:17.286+0000 E INDEX    [repl index builder 15] IndexBuilder could not build index: InterruptedAtShutdown: interrupted at shutdown
2017-09-04T07:09:17.287+0000 E INDEX    [repl index builder 30] IndexBuilder could not build index: InterruptedAtShutdown: interrupted at shutdown
2017-09-04T07:09:17.287+0000 E INDEX    [repl index builder 43] IndexBuilder could not build index: InterruptedAtShutdown: interrupted at shutdown
2017-09-04T07:09:17.287+0000 E INDEX    [repl index builder 22] IndexBuilder could not build index: InterruptedAtShutdown: interrupted at shutdown
2017-09-04T07:09:17.287+0000 E INDEX    [repl index builder 45] IndexBuilder could not build index: InterruptedAtShutdown: interrupted at shutdown
2017-09-04T07:09:17.288+0000 E INDEX    [repl index builder 33] IndexBuilder could not build index: InterruptedAtShutdown: interrupted at shutdown
2017-09-04T07:09:17.288+0000 E INDEX    [repl index builder 10] IndexBuilder could not build index: InterruptedAtShutdown: interrupted at shutdown
2017-09-04T07:09:17.288+0000 E INDEX    [repl index builder 25] IndexBuilder could not build index: InterruptedAtShutdown: interrupted at shutdown
2017-09-04T07:09:17.288+0000 E INDEX    [repl index builder 35] IndexBuilder could not build index: InterruptedAtShutdown: interrupted at shutdown
2017-09-04T07:09:17.288+0000 E INDEX    [repl index builder 13] IndexBuilder could not build index: InterruptedAtShutdown: interrupted at shutdown
2017-09-04T07:09:17.288+0000 E INDEX    [repl index builder 38] IndexBuilder could not build index: InterruptedAtShutdown: interrupted at shutdown
2017-09-04T07:09:17.288+0000 E INDEX    [repl index builder 20] IndexBuilder could not build index: InterruptedAtShutdown: interrupted at shutdown
2017-09-04T07:09:17.288+0000 E INDEX    [repl index builder 28] IndexBuilder could not build index: InterruptedAtShutdown: interrupted at shutdown
2017-09-04T07:09:17.288+0000 E INDEX    [repl index builder 41] IndexBuilder could not build index: InterruptedAtShutdown: interrupted at shutdown
2017-09-04T07:09:17.288+0000 E INDEX    [repl index builder 6] IndexBuilder could not build index: InterruptedAtShutdown: interrupted at shutdown
2017-09-04T07:09:17.288+0000 E INDEX    [repl index builder 11] IndexBuilder could not build index: InterruptedAtShutdown: interrupted at shutdown
2017-09-04T07:09:17.288+0000 E INDEX    [repl index builder 23] IndexBuilder could not build index: InterruptedAtShutdown: interrupted at shutdown
2017-09-04T07:09:17.288+0000 E INDEX    [repl index builder 31] IndexBuilder could not build index: InterruptedAtShutdown: interrupted at shutdown
2017-09-04T07:09:17.288+0000 E INDEX    [repl index builder 46] IndexBuilder could not build index: InterruptedAtShutdown: interrupted at shutdown
2017-09-04T07:09:17.289+0000 E INDEX    [repl index builder 18] IndexBuilder could not build index: InterruptedAtShutdown: interrupted at shutdown
2017-09-04T07:09:17.289+0000 I STORAGE  [signalProcessingThread] WiredTigerKVEngine shutting down
2017-09-04T07:09:21.298+0000 I STORAGE  [signalProcessingThread] shutdown: removing fs lock...
2017-09-04T07:09:21.298+0000 I CONTROL  [signalProcessingThread] now exiting
2017-09-04T07:09:21.298+0000 I CONTROL  [signalProcessingThread] shutting down with code:0

But I have also seen it interfere with shutdown and prevent the repl writer threads from being stopped (the mongod had already stopped listening on its TCP port):

2017-09-04T00:44:12.690+0000 I CONTROL  [signalProcessingThread] got signal 15 (Terminated), will terminate after current cmd ends
2017-09-04T00:44:12.690+0000 I NETWORK  [signalProcessingThread] shutdown: going to close listening sockets...
2017-09-04T00:44:12.690+0000 I NETWORK  [signalProcessingThread] closing listening socket: 7
2017-09-04T00:44:12.692+0000 I INDEX    [repl index builder 57] build index on: test.test properties: { v: 2, key: { a57: 1.0 }, name: "a57_1", ns: "test.test", background: true }
2017-09-04T00:44:12.692+0000 I NETWORK  [signalProcessingThread] closing listening socket: 8
2017-09-04T00:44:12.692+0000 I NETWORK  [signalProcessingThread] removing socket file: /tmp/mongodb-12346.sock
2017-09-04T00:44:12.692+0000 I NETWORK  [signalProcessingThread] shutdown: going to flush diaglog...
2017-09-04T00:44:12.692+0000 I REPL     [signalProcessingThread] shutting down replication subsystems
2017-09-04T00:44:12.692+0000 I REPL     [signalProcessingThread] Stopping replication reporter thread
2017-09-04T00:44:12.692+0000 I REPL     [SyncSourceFeedback] SyncSourceFeedback error sending update to devkev-1:12345: CallbackCanceled: Reporter no longer valid
2017-09-04T00:44:12.692+0000 I REPL     [signalProcessingThread] Stopping replication fetcher thread
2017-09-04T00:44:12.692+0000 I REPL     [signalProcessingThread] Stopping replication applier thread
2017-09-04T00:44:12.692+0000 I ASIO     [NetworkInterfaceASIO-RS-0] Ending connection to host devkev-1:12345 due to bad connection status; 1 connections to that host remain open
2017-09-04T00:44:12.692+0000 I REPL     [rsBackgroundSync] Replication producer stopped after oplog fetcher finished returning a batch from our sync source.  Abandoning this batch of oplog entries and re-evaluating our sync source.
2017-09-04T00:44:12.765+0000 I -        [conn12] end connection 172.31.0.248:34953 (2 connections now open)
2017-09-04T00:44:12.846+0000 I INDEX    [repl index builder 48] build index on: test.test properties: { v: 2, key: { a48: 1.0 }, name: "a48_1", ns: "test.test", background: true }
2017-09-04T00:44:13.177+0000 I INDEX    [repl index builder 27] build index on: test.test properties: { v: 2, key: { a27: 1.0 }, name: "a27_1", ns: "test.test", background: true }
...
2017-09-04T01:00:05.331+0000 I INDEX    [repl index builder 57] build index on: test.test properties: { v: 2, key: { a57: 1.0 }, name: "a57_1", ns: "test.test", background: true }
2017-09-04T01:00:05.369+0000 I INDEX    [repl index builder 40] build index on: test.test properties: { v: 2, key: { a40: 1.0 }, name: "a40_1", ns: "test.test", background: true }
2017-09-04T01:00:05.407+0000 I INDEX    [repl index builder 27] build index on: test.test properties: { v: 2, key: { a27: 1.0 }, name: "a27_1", ns: "test.test", background: true }
2017-09-04T01:00:05.407+0000 I INDEX    [repl index builder 42] build index done.  scanned 1 total records. 0 secs
2017-09-04T01:00:05.443+0000 I INDEX    [repl index builder 39] build index on: test.test properties: { v: 2, key: { a39: 1.0 }, name: "a39_1", ns: "test.test", background: true }
2017-09-04T01:00:05.443+0000 I INDEX    [repl index builder 36] build index done.  scanned 1 total records. 0 secs
2017-09-04T01:00:05.443+0000 I INDEX    [repl index builder 31] build index done.  scanned 1 total records. 0 secs
<kill -9 of mongod>

In this case, I had also noticed that the secondary was stuck in the middle of applying the batch — replication was not making progress (repllag of the secondary would not increase if I did ops on the primary).

I've attached logs and a few backtraces from the mongod while in this state (using the 3.4.7 Ubuntu 14.04 SSL build; I also have the whole core if necessary).

Affected versions:

3.2.12 unaffected
3.2.16 unaffected
3.4.0 AFFECTED
3.4.7 AFFECTED
3.4.8-rc1 AFFECTED
3.5.1 AFFECTED
3.5.6 AFFECTED
3.5.9 AFFECTED
3.5.10 AFFECTED
3.5.11 unaffected
3.5.12 unaffected

Looking at the commits between 3.5.10 and 3.5.11, nothing stands out as being relevant. So this may have been fixed tangentially as part of some other work.



 Comments   
Comment by Eric Milkie [ 07/May/19 ]

Fixed by SERVER-34142. Not considering a backport at this time.

Comment by Benety Goh [ 03/May/19 ]

In 4.2, when we moved most of the background index logic out of IndexBuilder into the IndexBuildsCoordinator, we retained the WriteConflict retry logic around the MultiIndexBlock::commit() call.

Comment by Benety Goh [ 03/May/19 ]

We added a write conflict retry loop around the MultiIndexBlock::commit() call in IndexBuilder in SERVER-34142.

https://github.com/mongodb/mongo/blob/a273f6f84cb29e31221f580b00905613d3f31330/src/mongo/db/index_builder.cpp#L205

This should be sufficient to prevent the background index build on the secondary from restarting from the beginning when it has to re-insert the metadata into the on-disk index catalog.

SERVER-34142 landed in 4.0 but was not backported to 3.6 or 3.4. I'm not sure how feasible the back port would be given that the WCE retry loop was not the main goal of SERVER-34142.

Comment by Eric Milkie [ 28/Oct/18 ]

I think we should fix this outside the Simultaneous Index Builds epic, as it ought to be backported.

Comment by Matthew Russotto [ 26/Oct/18 ]

It appears what's going on is that some of the reads in the index build (done under the IX lock) establish the WiredTiger snapshot. When the index build goes to get the X lock, another index build that is starting gets the X lock first and modifies the metadata to mark the index as started. Then the original index build gets its X lock, and gets a write conflict exception. It attempts to re-lock, and the second index build gets in and does the index build (establishing its snapshot). Then the second build tries to get the X lock, the first build (which is restarting) gets its X lock, and modifies the metadata again to mark its index as started. Now the second build WCEs... ad infinitum.

1) Thread 1 acquires X lock (IndexBuilder::run())
2) Thread 1 writes metadata to indicate index A is started
3) Thread 1 releases X lock (relockWithMode(MODE_IX) in IndexBuilder::_build())
4) Thread 1 acquires IX lock (ditto)
5) Thread 1 creates index A and establishes snapshot
6) Thread 1 releases IX lock (the first relockWithMode(MODE_X) in IndexBuilder::_build)
7) Thread 2 acquires X lock
8) Thread 2 writes metadata to indicate index B is started
9) Thread 2 releases X lock
10) Thread 1 acquires X lock (the first relockWithMode(MODE_X) in IndexBuilder::_build)
11) Thread 1 attempts to write metadata indicating index A is done, WCE
12) Thread 1 releases X lock (the second relockWithMode(MODE_X) in IndexBuilder::_build)
13) Thread 2 acquires IX lock
14) Thread 2 builds index B and establishes snapshot
15) Thread 2 releases IX lock
16) Thread 1 acquires X lock (the second relockWithMode(MODE_X) in IndexBuilder::_build)
17) Thread 1 writes metadata to indicate index A is started
18) Thread 1 releases X lock
19) Thread 2 acquires X lock
20) Thread 2 attempts to write metadata indicating index B is done, WCE
21) Thread 2 releases X lock
22) Start over at step 4.

This is a reasonably stable livelock, although it might be possible to break out of it if a context switch occurred in the middle of relockWithMode().

Probably the best fix would be to simply abandon the snapshot after obtaining the X lock to write the "index finished" metadata. Once that is done, no write conflict should be possible, as the exclusive lock will prevent it.

Comment by Eric Milkie [ 28/Nov/17 ]

We can probably make index builds smarter by only retrying the last transaction when getting a WCE, rather than the entire index build.

Comment by Benety Goh [ 27/Nov/17 ]

kevin.pulo, does this issue happen with the more recent 3.4 releases (3.4.9 and 3.4.10)?

michael.cahill, can you help clarify some of the WiredTiger stack traces reported on the server at shutdown? I'm not 100% certain that they could prevent the server from completing its shutdown procedure.

Comment by Benety Goh [ 27/Nov/17 ]

As for the instances where the server experienced difficulty shutting down, these are some of the more interesting stack traces:

backtrace-cannot-shutdown-2.txt

Thread 89 (Thread 0x7f3c891f8700 (LWP 11157)):
#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f3c8edd3649 in _L_lock_909 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2  0x00007f3c8edd3470 in __GI___pthread_mutex_lock (mutex=0x7f3c945e14d8) at ../nptl/pthread_mutex_lock.c:79
#3  0x00007f3c920c1751 in __wt_spin_lock_track ()
#4  0x00007f3c920c1bac in __session_drop ()
#5  0x00007f3c913d5bfa in mongo::WiredTigerKVEngine::dropSomeQueuedIdents() ()
#6  0x00007f3c913efb30 in mongo::WiredTigerSessionCache::releaseSession(mongo::WiredTigerSession*) ()
#7  0x00007f3c913f0183 in mongo::WiredTigerSessionCache::waitUntilDurable(bool) ()
#8  0x00007f3c913dba9c in mongo::WiredTigerKVEngine::WiredTigerJournalFlusher::run() ()
 
Thread 35 (Thread 0x7f3c6bdb1700 (LWP 12791)):
#0  0x00007f3c8eaf5e8d in fdatasync () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007f3c92091040 in __posix_sync ()
#2  0x00007f3c9209124d in __posix_directory_sync ()
#3  0x00007f3c92091bcb in __posix_open_file ()
#4  0x00007f3c9208dd61 in __wt_open ()
#5  0x00007f3c91fec8c6 in __wt_block_manager_create ()
#6  0x00007f3c920ae540 in __wt_schema_create ()
#7  0x00007f3c920af140 in __create_colgroup ()
#8  0x00007f3c920aeb11 in __wt_schema_create ()
#9  0x00007f3c920c62d5 in __wt_session_create ()
#10 0x00007f3c920c66ae in __session_create ()
#11 0x00007f3c913ca340 in mongo::WiredTigerIndex::Create(mongo::OperationContext*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
#12 0x00007f3c913d6dd4 in mongo::WiredTigerKVEngine::createSortedDataInterface(mongo::OperationContext*, mongo::StringData, mongo::IndexDescriptor const*) ()
#13 0x00007f3c91314835 in mongo::KVCollectionCatalogEntry::prepareForIndexBuild(mongo::OperationContext*, mongo::IndexDescriptor const*) ()
#14 0x00007f3c90b8349c in mongo::IndexCatalog::IndexBuildBlock::init() ()
#15 0x00007f3c90b8ca0a in mongo::MultiIndexBlock::init(std::vector<mongo::BSONObj, std::allocator<mongo::BSONObj> > const&) ()
#16 0x00007f3c90b8d814 in mongo::MultiIndexBlock::init(mongo::BSONObj const&) ()
#17 0x00007f3c90dee872 in mongo::IndexBuilder::_build(mongo::OperationContext*, mongo::Database*, bool, mongo::Lock::DBLock*) const ()
#18 0x00007f3c90defec9 in mongo::IndexBuilder::run() ()
#19 0x00007f3c91654f61 in mongo::BackgroundJob::jobBody() ()
#20 0x00007f3c92159500 in execute_native_thread_routine ()
#21 0x00007f3c8edd1184 in start_thread (arg=0x7f3c6bdb1700) at pthread_create.c:312
#22 0x00007f3c8eafe37d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

backtrace-cannot-shutdown-3.txt

Thread 89 (Thread 0x7f3c891f8700 (LWP 11157)):
#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f3c8edd3649 in _L_lock_909 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2  0x00007f3c8edd3470 in __GI___pthread_mutex_lock (mutex=0x7f3c945e14d8) at ../nptl/pthread_mutex_lock.c:79
#3  0x00007f3c920c1751 in __wt_spin_lock_track ()
#4  0x00007f3c920c1bac in __session_drop ()
#5  0x00007f3c913d5bfa in mongo::WiredTigerKVEngine::dropSomeQueuedIdents() ()
#6  0x00007f3c913efb30 in mongo::WiredTigerSessionCache::releaseSession(mongo::WiredTigerSession*) ()
#7  0x00007f3c913f0183 in mongo::WiredTigerSessionCache::waitUntilDurable(bool) ()
#8  0x00007f3c913dba9c in mongo::WiredTigerKVEngine::WiredTigerJournalFlusher::run() ()
#9  0x00007f3c91654f61 in mongo::BackgroundJob::jobBody() ()
#10 0x00007f3c92159500 in execute_native_thread_routine ()
#11 0x00007f3c8edd1184 in start_thread (arg=0x7f3c891f8700) at pthread_create.c:312
#12 0x00007f3c8eafe37d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
 
Thread 14 (Thread 0x7f3c5ed97700 (LWP 13397)):
#0  0x00007f3c8eaf5e8d in fdatasync () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007f3c92091040 in __posix_sync ()
#2  0x00007f3c92071876 in __wt_log_release ()
#3  0x00007f3c920737f7 in __wt_log_write ()
#4  0x00007f3c920da0e5 in __wt_txn_checkpoint_log ()
#5  0x00007f3c9208a5ea in __wt_meta_track_off ()
#6  0x00007f3c920ae1cc in __wt_schema_create ()
#7  0x00007f3c920c62d5 in __wt_session_create ()
#8  0x00007f3c920c66ae in __session_create ()
#9  0x00007f3c913ca340 in mongo::WiredTigerIndex::Create(mongo::OperationContext*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
#10 0x00007f3c913d6dd4 in mongo::WiredTigerKVEngine::createSortedDataInterface(mongo::OperationContext*, mongo::StringData, mongo::IndexDescriptor const*) ()
#11 0x00007f3c91314835 in mongo::KVCollectionCatalogEntry::prepareForIndexBuild(mongo::OperationContext*, mongo::IndexDescriptor const*) ()
#12 0x00007f3c90b8349c in mongo::IndexCatalog::IndexBuildBlock::init() ()
#13 0x00007f3c90b8ca0a in mongo::MultiIndexBlock::init(std::vector<mongo::BSONObj, std::allocator<mongo::BSONObj> > const&) ()
#14 0x00007f3c90b8d814 in mongo::MultiIndexBlock::init(mongo::BSONObj const&) ()
#15 0x00007f3c90dee872 in mongo::IndexBuilder::_build(mongo::OperationContext*, mongo::Database*, bool, mongo::Lock::DBLock*) const ()
#16 0x00007f3c90defec9 in mongo::IndexBuilder::run() ()
#17 0x00007f3c91654f61 in mongo::BackgroundJob::jobBody() ()
#18 0x00007f3c92159500 in execute_native_thread_routine ()
#19 0x00007f3c8edd1184 in start_thread (arg=0x7f3c5ed97700) at pthread_create.c:312
#20 0x00007f3c8eafe37d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Comment by Benety Goh [ 27/Nov/17 ]

The repeated "build index on ..." and "index build done..." messages can be explained by MultiIndexBlock::commit() throwing a WriteConflictException which causes IndexBuilder::_run to restart the index building process.

MultiIndexBlock::commit() internally calls WiredTigerRecordStorage::findRecord() which may throw a WriteConflictException.

Comment by Kevin Pulo [ 08/Nov/17 ]

I haven't been able to repro this with MMAPv1, only with WiredTiger.

When testing MMAPv1, I needed to add a few thousand empty docs (rather than just 1) to get the background builds to be somewhat concurrent. But every time I tried (at least several dozen), the index builds all completed normally, and never got stuck in the loop where they fruitlessly build the index over and over again.

Comment by Benety Goh [ 18/Oct/17 ]

kevin.pulo, do this issue happen with the MMAPv1 storage engine? I'm trying to see if this is specific to WiredTiger. See SERVER-19745.

Generated at Thu Feb 08 04:25:30 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.