[SERVER-27237] do not pass empty metadata to shouldStopFetching() Created: 30/Nov/16  Updated: 04/Feb/18  Resolved: 18/Apr/17

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Eric Milkie Assignee: Siyuan Zhou
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-26528 Add additional logging when sync sour... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Repl 2017-02-13, Repl 2017-04-17, Repl 2017-05-08
Participants:
Linked BF Score: 0

 Description   

In OplogFetcher::_callback(), we check to see if the query response actually has metadata, and set a boolean hasMetadata.
However, later on in the function, we blindly call shouldStopFetching() and pass the metadata, regardless of hasMetadata-ness.

This results in some erroneous behavior. In particular, you can get stuck in a tight CPU loop when chaining is turned off, because the node will choose what it believes to be a primary, but immediately afterwards, TopologyCoordinatorImpl::shouldChangeSyncSource() will return true, since the metadata config version (it's null) will not match the current config version. Repeat in a tight loop.

You can see the effects of this on the log here:
https://logkeeper.mongodb.org/build/7419231f517600f1d972ad9bc50cb45b/test/582a071abe07c472fe0b4a36
This tight loop appears to be part of the reason why the test suite failed (a replica set lost quorum due to slow heartbeats).



 Comments   
Comment by Siyuan Zhou [ 18/Apr/17 ]

Discussed with judah.schvimer, we believe the CPU loop is a temporary problem. The sync source was the primary and stepped down, so fetcher realized it's no longer a valid sync source and wanted to change sync source. However choosing sync source is based on heartbeats which are stale for several seconds. Until the secondary gets the new heartbeats, the node keeps choosing sync source in a loop. The is confirmed by the log. The log messages stopped when new heartbeats arrived.

[ReplicaSetFixture:job14:primary] 2016-11-14T18:49:24.133+0000 I REPL     [ReplicationExecutor] can't see a majority of the set, relinquishing primary
...
[ReplicaSetFixture:job14:primary] 2016-11-14T18:49:24.133+0000 I REPL     [ReplicationExecutor] Stepping down from primary in response to heartbeat
[ReplicaSetFixture:job14:initsync] 2016-11-14T18:49:24.135+0000 I ASIO     [NetworkInterfaceASIO-RS-0] Successfully connected to ip-10-181-123-13:23500
[ReplicaSetFixture:job14:primary] 2016-11-14T18:49:24.133+0000 I REPL     [ReplicationExecutor] Member ip-10-181-123-13:23502 is now in state SECONDARY
[js_test:queryoptimizer3] 2016-11-14T18:49:24.135+0000 sh14087| 2016-11-14T18:49:24.134+0000 E QUERY    [main] Error: error doing query: failed: network error while attempting to run command 'drop' on host '127.0.0.1:23500'  :
...
[js_test:queryoptimizer3] 2016-11-14T18:49:24.136+0000 sh14087| DB.prototype.runCommand@src/mongo/shell/db.js:132:1
[js_test:queryoptimizer3] 2016-11-14T18:49:24.136+0000 sh14087| DBCollection.prototype.drop@src/mongo/shell/collection.js:748:15
...
[ReplicaSetFixture:job14:primary] 2016-11-14T18:49:24.133+0000 I COMMAND  [conn21] command local.oplog.rs command: getMore { getMore: 26770111653, collection: "oplog.rs", maxTimeMS: 5000, term: 1, lastKnownCommittedOpTime: { ts: Timestamp 1479149343000|147, t: 1 } } originatingCommand: { find: "oplog.rs", filter: { ts: { $gte: Timestamp 1479149307000|1 } }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 60000, term: 1 } planSummary: COLLSCAN cursorid:26770111653 keysExamined:0 docsExamined:0 numYields:1 nreturned:0 reslen:292 locks:{ Global: { acquireCount: { r: 6 } }, MMAPV1Journal: { acquireCount: { r: 3 } }, Database: { acquireCount: { r: 3 } }, oplog: { acquireCount: { R: 3 } } } protocol:op_command 12029ms
[ReplicaSetFixture:job14:secondary] 2016-11-14T18:49:24.137+0000 W REPL     [rsBackgroundSync] Fetcher stopped querying remote oplog with error: InvalidSyncSource: sync source ip-10-181-123-13:23500 (last optime: { ts: Timestamp 0|0, t: -1 }; sync source index: -1; primary index: -1) is no longer valid
[js_test:queryoptimizer3] 2016-11-14T18:49:24.138+0000 2016-11-14T18:49:24.134+0000 E QUERY    [main] Error: Error: error doing query: failed :
[js_test:queryoptimizer3] 2016-11-14T18:49:24.138+0000 DBCollection.prototype.insert@src/mongo/shell/collection.js:363:23
[js_test:queryoptimizer3] 2016-11-14T18:49:24.138+0000 DBCollection.prototype.save@src/mongo/shell/collection.js:619:16
[ReplicaSetFixture:job14:initsync] 2016-11-14T18:49:24.136+0000 I COMMAND  [repl writer worker 4] CMD: drop test.jstests_queryoptimizer3
[ReplicaSetFixture:job14:secondary] 2016-11-14T18:49:24.137+0000 I REPL     [rsBackgroundSync] chaining not allowed, choosing primary as sync source candidate: ip-10-181-123-13:23500
[js_test:queryoptimizer3] 2016-11-14T18:49:24.139+0000 @jstests/core/queryoptimizer3.js:14:1
[js_test:queryoptimizer3] 2016-11-14T18:49:24.139+0000 failed to load: jstests/core/queryoptimizer3.js
[ReplicaSetFixture:job14:initsync] 2016-11-14T18:49:24.137+0000 W REPL     [rsBackgroundSync] Fetcher stopped querying remote oplog with error: InvalidSyncSource: sync source ip-10-181-123-13:23500 (last optime: { ts: Timestamp 0|0, t: -1 }; sync source index: -1; primary index: -1) is no longer valid
[ReplicaSetFixture:job14:secondary] 2016-11-14T18:49:24.138+0000 W REPL     [rsBackgroundSync] Fetcher stopped querying remote oplog with error: InvalidSyncSource: sync source ip-10-181-123-13:23500 (last optime: { ts: Timestamp 0|0, t: -1 }; sync source index: -1; primary index: -1) is no longer valid
[ReplicaSetFixture:job14:initsync] 2016-11-14T18:49:24.137+0000 I REPL     [rsBackgroundSync] chaining not allowed, choosing primary as sync source candidate: ip-10-181-123-13:23500
[ReplicaSetFixture:job14:primary] 2016-11-14T18:49:24.133+0000 I COMMAND  [conn35] command test.$cmd appName: "MongoDB Shell" command: createIndexes { createIndexes: "jstests_queryoptimizer3", indexes: [ { key: { b: 1.0 }, name: "b_1" } ] } numYields:0 reslen:98 locks:{ Global: { acquireCount: { r: 2, w: 2 } }, MMAPV1Journal: { acquireCount: { w: 4 }, acquireWaitCount: { w: 1 }, timeAcquiringMicros: { w: 12505043 } }, Database: { acquireCount: { w: 1, W: 1 } }, Collection: { acquireCount: { W: 1 } }, Metadata: { acquireCount: { w: 1, W: 1 } }, oplog: { acquireCount: { W: 1 } } } protocol:op_command 12505ms
[ReplicaSetFixture:job14:secondary] 2016-11-14T18:49:24.138+0000 I REPL     [rsBackgroundSync] chaining not allowed, choosing primary as sync source candidate: ip-10-181-123-13:23500
[ReplicaSetFixture:job14:primary] 2016-11-14T18:49:24.133+0000 I -        [conn21] end connection 10.181.123.13:60479 (10 connections now open)
[ReplicaSetFixture:job14:secondary] 2016-11-14T18:49:24.138+0000 I ASIO     [NetworkInterfaceASIO-RS-0] Connecting to ip-10-181-123-13:23500
[ReplicaSetFixture:job14:initsync] 2016-11-14T18:49:24.138+0000 W REPL     [rsBackgroundSync] Fetcher stopped querying remote oplog with error: InvalidSyncSource: sync source ip-10-181-123-13:23500 (last optime: { ts: Timestamp 0|0, t: -1 }; sync source index: -1; primary index: -1) is no longer valid
[ReplicaSetFixture:job14:secondary] 2016-11-14T18:49:24.139+0000 W REPL     [rsBackgroundSync] Fetcher stopped querying remote oplog with error: InvalidSyncSource: sync source ip-10-181-123-13:23500 (last optime: { ts: Timestamp 0|0, t: -1 }; sync source index: -1; primary index: -1) is no longer valid
[ReplicaSetFixture:job14:initsync] 2016-11-14T18:49:24.138+0000 I REPL     [rsBackgroundSync] chaining not allowed, choosing primary as sync source candidate: ip-10-181-123-13:23500
[ReplicaSetFixture:job14:secondary] 2016-11-14T18:49:24.139+0000 I REPL     [rsBackgroundSync] chaining not allowed, choosing primary as sync source candidate: ip-10-181-123-13:23500
...
...
 
[ReplicaSetFixture:job14:secondary] 2016-11-14T18:49:26.135+0000 I REPL     [ReplicationExecutor] Member ip-10-181-123-13:23500 is now in state SECONDARY
[ReplicaSetFixture:job14:initsync] 2016-11-14T18:49:26.128+0000 W REPL     [rsBackgroundSync] Fetcher stopped querying remote oplog with error: InvalidSyncSource: sync source ip-10-181-123-13:23500 (last optime: { ts: Timestamp 0|0, t: -1 }; sync source index: -1; primary index: -1) is no longer valid
[ReplicaSetFixture:job14:secondary] 2016-11-14T18:49:26.136+0000 W REPL     [rsBackgroundSync] Fetcher stopped querying remote oplog with error: InvalidSyncSource: sync source ip-10-181-123-13:23500 (last optime: { ts: Timestamp 0|0, t: -1 }; sync source index: -1; primary index: -1) is no longer valid
[ReplicaSetFixture:job14:initsync] 2016-11-14T18:49:26.128+0000 I REPL     [rsBackgroundSync] chaining not allowed, choosing primary as sync source candidate: ip-10-181-123-13:23500
[ReplicaSetFixture:job14:initsync] 2016-11-14T18:49:26.129+0000 W REPL     [rsBackgroundSync] Fetcher stopped querying remote oplog with error: InvalidSyncSource: sync source ip-10-181-123-13:23500 (last optime: { ts: Timestamp 0|0, t: -1 }; sync source index: -1; primary index: -1) is no longer valid
[ReplicaSetFixture:job14:initsync] 2016-11-14T18:49:26.129+0000 I REPL     [rsBackgroundSync] chaining not allowed, choosing primary as sync source candidate: ip-10-181-123-13:23500
[ReplicaSetFixture:job14:initsync] 2016-11-14T18:49:26.130+0000 W REPL     [rsBackgroundSync] Fetcher stopped querying remote oplog with error: InvalidSyncSource: sync source ip-10-181-123-13:23500 (last optime: { ts: Timestamp 0|0, t: -1 }; sync source index: -1; primary index: -1) is no longer valid
[ReplicaSetFixture:job14:initsync] 2016-11-14T18:49:26.130+0000 I REPL     [rsBackgroundSync] chaining not allowed, choosing primary as sync source candidate: ip-10-181-123-13:23500
[ReplicaSetFixture:job14:initsync] 2016-11-14T18:49:26.140+0000 I REPL     [ReplicationExecutor] Member ip-10-181-123-13:23500 is now in state SECONDARY
[ReplicaSetFixture:job14:initsync] 2016-11-14T18:49:26.140+0000 W REPL     [rsBackgroundSync] Fetcher stopped querying remote oplog with error: InvalidSyncSource: sync source ip-10-181-123-13:23500 (last optime: { ts: Timestamp 0|0, t: -1 }; sync source index: -1; primary index: -1) is no longer valid
2016-11-14 18:49:31 +0000 
[ReplicaSetFixture:job14:secondary] 2016-11-14T18:49:31.136+0000 I REPL     [SyncSourceFeedback] SyncSourceFeedback error sending update to ip-10-181-123-13:23500: InvalidSyncSource: Sync target is no longer valid
[ReplicaSetFixture:job14:initsync] 2016-11-14T18:49:31.140+0000 I REPL     [SyncSourceFeedback] SyncSourceFeedback error sending update to ip-10-181-123-13:23500: InvalidSyncSource: Sync target is no longer valid

SERVER-27403 added a delay before choosing the same sync source, which avoids the CPU spin. The root cause is that fetcher and the heartbeat don't agree on the view of the sync source. We could update our view of nodes based on request metadata as well.

We don't think there is a problem in master, so I'm closing it as "Won't Fix."

Comment by Siyuan Zhou [ 18/Apr/17 ]

In the failure of invalid_key_pattern_upgrade.js, 3.4 node sent a request with $replData to a 3.2 node. However 3.2 node didn't send metadata back if replication executer is in shutdown. This is a general issue in 3.2, but I don't think it's critical since the worst case is to choose a new sync source if the old one is shutting down, which would happen anyway.

The problem in the original ticket BF-4112 seems a different one. It's not clear whether the CPU loop in BF-4112 is a result or a side-effect. I'm not convinced that empty metadata is the root cause either. I'd suggest to close the dependent BF as "Cannot reproduce" since we don't have enough evidence to make a conclusion. When the problem happens again, SERVER-26528 would make the investigation much easier. CC judah.schvimer.

The fix described in this ticket seems reasonable though. At least, it makes the code easier to reason about and debug if we log the unexpected empty metadata. This seems a good "neweng" ticket for interns.

Comment by Judah Schvimer [ 14/Feb/17 ]

I have been unable to reproduce this again. Per our discussion, I'm handing this off to siyuan.zhou to see if he can have more success.

Comment by Judah Schvimer [ 02/Feb/17 ]

The multiversion_WT failure here in excution 2 of invalid_key_pattern_upgrade.js is a repro.

In the original BF, the primary is stepping down when this begins and in this case it is during shutdown, so there may be a connection there.

Looking at the core dump, we see that the documents and metadata field of the queryResponse are both empty:

(gdb) p queryResponse
$1 = (const mongo::Fetcher::QueryResponse &) @0x7f40bf428ed0: {
  cursorId = 20366897929, 
  nss = {
    static kAdminDb = {
      _data = 0x7f40d8c3564e "admin", 
      _size = 5
    }, 
    static kLocalDb = {
      _data = 0x7f40d8c356de "local", 
      _size = 5
    }, 
    static kSystemDotViewsCollectionName = {
      _data = 0x7f40d8c3ae6d "system.views", 
      _size = 12
    }, 
    static kConfigCollectionNamespace = {
      static kAdminDb = <same as static member of an already seen type>, 
      static kLocalDb = <same as static member of an already seen type>, 
      static kSystemDotViewsCollectionName = <same as static member of an already seen type>, 
      static kConfigCollectionNamespace = <same as static member of an already seen type>, 
      _ns = {
        static npos = 18446744073709551615, 
        _M_dataplus = {
          <std::allocator<char>> = {
            <__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, 
          members of std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_Alloc_hider: 
          _M_p = 0x7f40da653480 "admin.system.version"
        }, 
        _M_string_length = 20, 
        {
          _M_local_buf = "\024", '\000' <repeats 14 times>, 
          _M_allocated_capacity = 20
        }
      }, 
      _dotIndex = 5
    }, 
    _ns = {
      static npos = 18446744073709551615, 
      _M_dataplus = {
        <std::allocator<char>> = {
          <__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, 
        members of std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_Alloc_hider: 
        _M_p = 0x7f40bf428ee8 "local.oplog.rs"
      }, 
      _M_string_length = 14, 
      {
        _M_local_buf = "local.oplog.rs\000", 
        _M_allocated_capacity = 8101745297688063852
      }
    }, 
    _dotIndex = 5
  }, 
  documents = {
    <std::_Vector_base<mongo::BSONObj, std::allocator<mongo::BSONObj> >> = {
      _M_impl = {
        <std::allocator<mongo::BSONObj>> = {
          <__gnu_cxx::new_allocator<mongo::BSONObj>> = {<No data fields>}, <No data fields>}, 
        members of std::_Vector_base<mongo::BSONObj, std::allocator<mongo::BSONObj> >::_Vector_impl: 
        _M_start = 0x0, 
        _M_finish = 0x0, 
        _M_end_of_storage = 0x0
      }
    }, <No data fields>}, 
  otherFields = {
Python Exception <type 'exceptions.TypeError'> unbound method decode() must be called with BSON instance as first argument (got str instance instead): 
    metadata = owned empty BSONObj @ 0x7f40de1e6965
  }, 
  elapsedMillis = {
    _count = 7
  }, 
  first = false
}

Comment by Judah Schvimer [ 16/Dec/16 ]

After looking into this, I think that it is a bug if a PV1 OplogFetcher response does not contain metadata. Rather than paper over the bug with this fix, we need to find the bug and address it directly. I have so far been unable to reproduce this behavior. SERVER-26528, should help clarify when this is occurring.

Generated at Thu Feb 08 04:14:34 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.