[SERVER-59631] Values of value::TypeTags::ArraySet type might compare incorrectly for equality Created: 26/Aug/21  Updated: 29/Oct/23  Resolved: 15/Sep/21

Status: Closed
Project: Core Server
Component/s: Internal Code
Affects Version/s: None
Fix Version/s: 5.1.0-rc0

Type: Bug Priority: Major - P3
Reporter: Irina Yatsenko (Inactive) Assignee: Mohammad Dashti (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Problem/Incident
Related
related to SERVER-59930 Complete TODO listed in SERVER-59631 Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: QE 2021-09-20
Participants:
Linked BF Score: 135

 Description   

Add a unit test similar to this one:

TEST_F(SbeAccumulatorBuilderTest, compareTwoArraySets) {
    using namespace mongo::sbe;
 
    auto bsonObj = BSON("c" << 1);
    auto bsonArr = BSON_ARRAY(1 << 2 << 3);
 
    auto [lhsTag, lhsVal] = value::makeNewArraySet();
    value::ValueGuard lhsGuard{lhsTag, lhsVal};
    auto lhsView = value::getArraySetView(lhsVal);
    auto [lhsItemTag, lhsItemVal] =
        value::makeNewString("a long enough string"_sd); // FAILS
        // value::copyValue(value::TypeTags::bsonArray, value::bitcastFrom<const char*>(bsonArr.objdata())); // FAILS
        // value::makeSmallString("abc"_sd); // OK
        // copyValue(value::TypeTags::bsonObject, value::bitcastFrom<const char*>(bsonObj.objdata())); // FAILS
        lhsView->push_back(lhsItemTag, lhsItemVal);
 
    auto [rhsTag, rhsVal] = value::makeNewArraySet();
    value::ValueGuard rhsGuard{rhsTag, rhsVal};
    auto rhsView = value::getArraySetView(rhsVal);
    auto [rhsItemTag, rhsItemVal] =
        value::makeNewString("a long enough string"_sd); // FAILS
        // value::copyValue(value::TypeTags::bsonArray, value::bitcastFrom<const char*>(bsonArr.objdata())); // FAILS
        // value::makeSmallString("abc"_sd); // OK
        // copyValue(value::TypeTags::bsonObject, value::bitcastFrom<const char*>(bsonObj.objdata())); // FAILS
    rhsView->push_back(rhsItemTag, rhsItemVal);
 
    ASSERT(valueEquals(lhsTag, lhsVal, rhsTag, rhsVal))
        << "lhs set: " << std::make_pair(lhsTag, lhsVal)
        << "rhs set: " << std::make_pair(rhsTag, rhsVal);
}

The test creates two arraySets with a single item in each. The items would compare equal on their own, but when the containing sets are checked for equality, it would fail for pointer-based types of values such as bsonObject or long strings.

The problem seems to be with the implementation of raw_hash_set::has_element() (see https://github.com/10gen/mongo/blob/defbe4582778e3da3abdc23c73ef7639543ab380/src/third_party/abseil-cpp-master/abseil-cpp/absl/container/internal/raw_hash_set.h#L1693 as, after matching the hashes, it compares the stored pairs literally, rather than using value::compareValue().



 Comments   
Comment by Githook User [ 15/Sep/21 ]

Author:

{'name': 'Mohammad Dashti', 'email': 'mdashti@gmail.com', 'username': 'mdashti'}

Message: SERVER-59631 Fixed the `compareValue` implementation for the values of `ValueSetType` and `ValueMapType` type in SBE
Branch: master
https://github.com/mongodb/mongo/commit/3d3d9e51fdda8191ec958a421f390f9e3de613dd

Comment by Githook User [ 14/Sep/21 ]

Author:

{'name': 'Uladzimir Makouski', 'email': 'uladzimir.makouski@mongodb.com', 'username': 'umakouski'}

Message: Revert "SERVER-59631 Fixed the `compareValue` implementation for the values of `ValueSetType` and `ValueMapType` type in SBE"

This reverts commit 013ea42551ce36cc0b134e0ac52e221a681f0ecd.
Branch: master
https://github.com/mongodb/mongo/commit/8d97a7a71bde56723e295286a3b9425f220ebc11

Comment by Githook User [ 13/Sep/21 ]

Author:

{'name': 'Mohammad Dashti', 'email': 'mdashti@gmail.com', 'username': 'mdashti'}

Message: SERVER-59631 Fixed the `compareValue` implementation for the values of `ValueSetType` and `ValueMapType` type in SBE
Branch: master
https://github.com/mongodb/mongo/commit/013ea42551ce36cc0b134e0ac52e221a681f0ecd

Comment by Yoon Soo Kim [ 02/Sep/21 ]

Currently two fixes have been proposed.

1. Yoonsoo's proposal: As a workaround for SBE, we can modify the compareValue() for ArraySet comparison as follows.

        if (lhsTag == TypeTags::ArraySet && rhsTag == TypeTags::ArraySet) {
            const auto& lhsSet = getArraySetView(lhsValue)->values();
            const auto& rhsSet = getArraySetView(rhsValue)->values();
            if (lhsSet.size() == rhsSet.size()) {
                for (const auto& e: rhsSet) {
                    if (!lhsSet.contains(e)) {
                        return {TypeTags::Nothing, 0};
                    }
                }
                return {TypeTags::NumberInt32, bitcastFrom<int32_t>(0)};
            }
            return {TypeTags::Nothing, 0};
        }

 
2. Ian's proposal: I think we should consider investing in a fix that prevents anyone else from spending time on this again.

struct DeepEqualityTagValuePair {
    bool operator==(...); // Uses compareValue.
 
    TypeTags tag; // Or call them 'first' and 'second,' whatever you prefer.
    Value value;
};
using ValueSetType = absl::flat_hash_set<DeepEqualityTagValuePair, ValueHasher>;

 

Comment by Billy Donahue [ 02/Sep/21 ]

FTR: https://developercommunity.visualstudio.com/t/unordered-set-operator-is-incorrect-with-custom-eq/1392782

Comment by Kyle Suarez [ 02/Sep/21 ]

Per our discussion in SBE standup, it sounds like the next steps are to implement a proper value::TypeTags::ArraySet::operator==() that actually does the right thing. Assigning to mohammad.dashti to take a look for the next sprint.

Comment by Billy Donahue [ 01/Sep/21 ]

So MSVC's stdlib is doing something that seems "sensible" IMO by saying that if I use unordered_set's equal_range (NOT unordered_multiset, just regular unordered_set), I KNOW I'm going to get a 1-element range. If I look up the contents of that element in the rhs container with equal_range, I'm going to get another 0-element or 1-element range. I can see why we wouldn't bother to check whether these 1 element ranges are permutations of each other. It looks like kind of a waste of time! It's already specified that key_eq has to have the same behavior on both containers. But MSVC's stdlib is wrong and libstdc++ is right. The equivalence of the keys under key_eq is not enough to determine container equality. The elements have to be equal to each other, not just equivalent. That's the spec. This is the same as the spec for std::set. The elements at corresponding positions have to be ==, not just equivalent under key_comp. Even the std::set relops like < ignore the key_comp and use lexicographical comparison using the key's own < operator.

PS: Also I might be the Abseil contributor Ian was referring to earlier.

PPS: I had the culprits swapped in the first version of this comment. Whoever is doing the == to compare elements is right. That's libstdc++ and I had it backwards at first. I updated the text of this comment.

Comment by Yoon Soo Kim [ 29/Aug/21 ]

Sorry but I could not help myself not write a reply.

Thanks for finding the document. I'm not a language lawyer too but tried my best to interpret the document.

First of all, there's no mentioning that the behavior is undefined if Key is not EqualityComparable.

The second:

Two unordered containers a and b compare equal if a.size() == b.size() and, for every equivalent-key group [Ea1, Ea2) obtained from a.equal_range(Ea1), there exists an equivalent-key group [Eb1, Eb2) obtained from b.equal_range(Ea1), such that is_permutation(Ea1, Ea2, Eb1, Eb2) returns true.

is_permutation() has two different versions.

template<class ForwardIterator1, class ForwardIterator2>
constexpr bool is_permutation(ForwardIterator1 first1, ForwardIterator1 last1,
ForwardIterator2 first2);
template<class ForwardIterator1, class ForwardIterator2, class BinaryPredicate>
constexpr bool is_permutation(ForwardIterator1 first1, ForwardIterator1 last1,
ForwardIterator2 first2, BinaryPredicate pred);

The second version uses pred and we should assume that is_permutation() here means the second version since there's no is_permutation() taking 4 iterators and it certainly makes sense to use key_eq() for pred param for is_permutation() if a user specifies key_eq().

The third:

For unordered_set and unordered_map, the complexity of operator== (i.e., the number of calls to the == operator of the value_type, to the predicate returned by key_eq(), and to the hasher returned by hash_function()) is proportional to N in the average case and to N^2 in the worst case, where N is a.size().

This complexity analysis implies that key_eq() is used along the way, not operator==.

Lastly:

The behavior of a program that uses operator== or operator!= on unordered containers is undefined unless the Pred function object has the same behavior for both containers and the equality comparison function for Key is a refinement(224) of the partition into equivalent-key groups produced by Pred.

224) Equality comparison is a refinement of partitioning if no two objects that compare equal fall into different partitions.

Here I believe that Pred means hash_function() and "the equality comparison function for key" is key_eq(), not operator==.

Again, I could not think of any reason why the C++ standard defines such a confusing behavior that operator== for unordered container uses operator== for key equality comparison though a user specifies key_eq(), which basically overrides user's intention.

If we still can't agree on analysis, we can get a help from language lawyers on Monday.

Comment by Ian Boros [ 28/Aug/21 ]

Well if MSVC disagrees then we probably are in a bit of a murky area...on Monday we'll have to summon the C++ language lawyers to help us out .

In the document you linked to I found the following on page 817:

Two unordered containers a and b compare equal if a.size() == b.size() and, for every equivalent-key group [Ea1, Ea2) obtained from a.equal_range(Ea1), there exists an equivalent-key group [Eb1, Eb2) obtained from b.equal_range(Ea1), such that is_permutation(Ea1, Ea2, Eb1, Eb2) returns true.

std::is_permutation does comparison via operator==, so, while I agree that the behavior is confusing, I believe it is what they intended.

 

Also:

The behavior of a program that uses operator== or operator!= on unordered containers is undefined unless the Pred function object has the same behavior for both containers and the equality comparison function for Key is a refinement of the partition into equivalent-key groups produced by Pred.

If I understand right (an I am no C++ lawyer ) this is saying that it must be the case that any two elements which produce true for a == b must return true for Pred(a,b). However, it is not necessary for the converse to be true. This is exactly our case: when operator== returns true for a pair<Tag,Value>, valueCompare() will also return true. This is not necessarily true the other way around, especially for non-shallow types.

Comment by Yoon Soo Kim [ 28/Aug/21 ]

This is a fun.
I think that you're assuming that the document you pointed out is quoted from C++ standard library standard documentation but I could not find such statements in C++ standard document: http://open-std.org/jtc1/sc22/wg21/docs/papers/2020/n4861.pdf. I would be surprised if the C++ standard document really specified such an undefined behavior. If a unordered_set user ever specified key_eq, key_eq is intended to be used for all key equality comparison and during operator== processing, it should use key_eq. Otherwise, it simply does not make a sense. Why would C++ standard committee define such a confusing behavior?

Also I tested your code and it passed on MSVC compiler and library implementation and TestIntEq is used for ASSERT_TRUE(setA == setB). So, my conclusion is that the document you pointed out is probably about Linux C++ library implementation (g++ or clang++ or both?) and absl hash set's behavior is wrong.

Comment by Ian Boros [ 28/Aug/21 ]

I think that our ValueEq implementation is already correct and is a wrapper for compareValue() but it's an absl hash set issue.

I'm hesitant to call this an absl issue. From what I could tell the documented and intended behavior is for the absl set's equality operator to rely on the equality operator of the underlying type rather than the ValueEq predicate. This is why we see this "shallow comparison" behavior. To me, the problem is our code, as compareValue() does not take this behavior into account.

The solution of simply not using the hash set's equality operator and writing our own logic for set equality is a fine workaround, though it doesn't stop other people from running into this issue if they use operator== on this type. Right now I only see one other use of ValueSetType but there will likely be more. I think we should consider investing in a fix that prevents anyone else from spending time on this again.

For example:

struct DeepEqualityTagValuePair {
    bool operator==(...); // Uses compareValue.
 
    TypeTags tag; // Or call them 'first' and 'second,' whatever you prefer.
    Value value;
};
using ValueSetType = absl::flat_hash_set<DeepEqualityTagValuePair, ValueHasher>; 

But maybe there's a better way?

Also, I believe one of the authors/contributors to abseil actually works here. Maybe we can get their thoughts.

Comment by Yoon Soo Kim [ 27/Aug/21 ]

The absl hash set's has_element() has the following implementation.

  bool has_element(const value_type& elem) const {
    size_t hash = PolicyTraits::apply(HashElement{hash_ref()}, elem);
    auto seq = probe(ctrl_, hash, capacity_);
    while (true) {
      Group g{ctrl_ + seq.offset()};
      for (int i : g.Match(H2(hash))) {
        if (ABSL_PREDICT_TRUE(PolicyTraits::element(slots_ + seq.offset(i)) ==
                              elem)) // [yoonsoo.kim] This is problematic. This does not use key_equal().
                                     // It should use key_equal() instead.
          return true;
      }
      if (ABSL_PREDICT_TRUE(g.MatchEmpty())) return false;
      seq.next();
      assert(seq.index() < capacity_ && "full table!");
    }
    return false;
  }

And hash set's operator== is defined as...

  friend bool operator==(const raw_hash_set& a, const raw_hash_set& b) {
    if (a.size() != b.size()) return false;
    const raw_hash_set* outer = &a;
    const raw_hash_set* inner = &b;
    if (outer->capacity() > inner->capacity()) std::swap(outer, inner);
    for (const value_type& elem : *outer)
      if (!inner->has_element(elem)) return false;
    return true;
  }

It's using has_element(). But find() implementation is using key_equal()

  template <class K = key_type>
  iterator find(const key_arg<K>& key, size_t hash) {
    auto seq = probe(ctrl_, hash, capacity_);
    while (true) {
      Group g{ctrl_ + seq.offset()};
      for (int i : g.Match(H2(hash))) {
        if (ABSL_PREDICT_TRUE(PolicyTraits::apply(
                EqualElement<K>{key, eq_ref()},
                PolicyTraits::element(slots_ + seq.offset(i)))))
          return iterator_at(seq.offset(i));
      }
      if (ABSL_PREDICT_TRUE(g.MatchEmpty())) return end();
      seq.next();
      assert(seq.index() < capacity_ && "full table!");
    }
  }
  template <class K = key_type>
  iterator find(const key_arg<K>& key) {
    return find(key, hash_ref()(key));
  }

I think that our ValueEq implementation is already correct and is a wrapper for compareValue() but it's an absl hash set issue.

As a workaround for SBE, if we modify the compareValue() for ArraySet comparison as follows, Irina's test case passes.

        if (lhsTag == TypeTags::ArraySet && rhsTag == TypeTags::ArraySet) {
            const auto& lhsSet = getArraySetView(lhsValue)->values();
            const auto& rhsSet = getArraySetView(rhsValue)->values();
            if (lhsSet.size() == rhsSet.size()) {
                for (const auto& e: rhsSet) {
                    if (!lhsSet.contains(e)) {
                        return {TypeTags::Nothing, 0};
                    }
                }
                return {TypeTags::NumberInt32, bitcastFrom<int32_t>(0)};
            }
            return {TypeTags::Nothing, 0};
        }

Comment by Ian Boros [ 26/Aug/21 ]

I tried this again with the standard library and got similar results.

#include <unordered_set>
 
// Hashing/equality operators which just dereference the int*.
struct TestIntHash {
    size_t operator()(int* i) const {
        return *i;
    }
};
 
struct TestIntEq {
    bool operator()(int* a, int* b) const {
        return *a == *b;
    }
 
};
 
void ASSERT_TRUE(bool v) {
    if (!v) {
        abort();
    }
}
 
int main() {
 
    // Sorry, these leak.
    int* two = new int(2);
    int* otherTwo = new int(2);
 
    std::unordered_set<int*, TestIntHash, TestIntEq> setA;
    setA.insert(two);
 
    std::unordered_set<int*, TestIntHash, TestIntEq> setB;
    setB.insert(otherTwo);
 
    ASSERT_TRUE(TestIntHash{}(two) == TestIntHash{}(otherTwo));
    ASSERT_TRUE(TestIntEq{}(two, otherTwo));
 
    ASSERT_TRUE(setA.size() == setB.size());
    // These appear to use the provided Predicate.
    for (auto && elem : setA) {
        ASSERT_TRUE(setB.count(elem));
    }
 
    for (auto && elem : setB) {
        ASSERT_TRUE(setA.count(elem));
    }
 
    // Doesn't pass.
    // Looks like the standard library implementation on my machine is using operator== on the
    // int* classes instead of the comparator that I provided.
    ASSERT_TRUE(setA == setB);
 
    return 0;
}
 
 

I thought I was losing my mind but then I read the docs for unordered_set operator==:

The behavior is undefined if Key is not EqualityComparable.

 and:

The behavior is also undefined if hash_function() and key_eq() do (until C+20)key_eq() does (since C+20) not have the same behavior on lhs and rhs or if operator== for Key is not a refinement of the partition into equivalent-key groups introduced by key_eq() (that is, if two elements that compare equal using operator== fall into different partitions)

It seems like unordered_set's equality operator is intentionally built on the underlying type's equality operator and not the key_eq predicate. I assume the abseil implementation just uses the same semantics as they're intended as "near drop-in" replacements? (edited)

 

My inclination is to get around this by redefining ValueSetType and ValueMapType to use a custom "pair" class that has its own operator== which is a wrapper for compareValue().

Comment by Ian Boros [ 26/Aug/21 ]

Here's a repro that doesn't involve SBE and may help diagnosing this:

// Weird equality/hashing operators where all even numbers are considered equal and all
// odd numbers are considered equal.
struct TestIntHash {
    size_t operator()(int i) const {
        return i % 2;
    }
};
 
struct TestIntEq {
    bool operator()(int a, int b) const {
        return (a % 2) == (b % 2);
    }
};
 
TEST_F(SbeAccumulatorBuilderTest, HashSetTestWithCustomComparator) {
    absl::node_hash_set<int, TestIntHash, TestIntEq> setA;
    setA.insert(2);
 
    absl::node_hash_set<int, TestIntHash, TestIntEq> setB;
    setB.insert(4);
 
    ASSERT_TRUE(TestIntHash{}(2) == TestIntHash{}(4));
    ASSERT_TRUE(TestIntEq{}(2, 4));
 
    ASSERT_TRUE(setA.size() == setB.size());
    for (auto && elem : setA) {
        ASSERT_TRUE(setB.count(elem));
    }
 
    for (auto && elem : setB) {
        ASSERT_TRUE(setA.count(elem));
    }
 
    // Doesn't pass.
    ASSERT_TRUE(setA == setB);
}
 

 

Generated at Thu Feb 08 05:47:42 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.