[SERVER-58263] Fix absl map erase loop in TenantMigrationAccessBlockerRegistry Created: 03/Jul/21  Updated: 29/Oct/23  Resolved: 06/Jul/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 5.0.4, 5.1.0-rc0

Type: Bug Priority: Major - P3
Reporter: Billy Donahue Assignee: Billy Donahue
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
is depended on by SERVER-51476 Upgrade Abseil to 20210324.1 Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v5.0
Sprint: Service Arch 2021-07-12
Participants:

 Description   

absl hash maps can't ranged-for while erasing elements.
Blocking abseil upgrade SERVER-51476.



 Comments   
Comment by Vivian Ge (Inactive) [ 06/Oct/21 ]

Updating the fixversion since branching activities occurred yesterday. This ticket will be in rc0 when it’s been triggered. For more active release information, please keep an eye on #server-release. Thank you!

Comment by Githook User [ 29/Sep/21 ]

Author:

{'name': 'Billy Donahue', 'email': 'billy.donahue@mongodb.com', 'username': 'BillyDonahue'}

Message: SERVER-58263 fix hash map erase loop in TenantMigrationAccessBlockerRegistry

(cherry picked from commit efdd9cd309c2e1f6a10762b7dff3913490ea71a3)
Branch: v5.0
https://github.com/mongodb/mongo/commit/bffbdc273f45f79390ed6ee0f67be76e505844b4

Comment by Billy Donahue [ 06/Jul/21 ]

Every container documents the iterator invalidation semantics of erase and they're all a little different.

In this case, erase is said to invaidate any iterators pointing to the erased element. Other iterators are unaffected.

The m.erase(it++) or it=m.erase(it) is old school.

It's tricky but it's been around a long time and it's one of those things you just have to read Scott Meyers (or similar) and know about.
In C++20 we're getting erase_if overloads for all the containers which might help to make the common cases easier to write.

I wouldn't trust sanitizers to find these problems. I'm not sure a sanitizer can tell when it's violated as it's not necessarily a use of invalid memory. We are relying on absl internal asserts.

Maybe we need to invest in C++ training? This isn't a new kind of bug. What's new is Abseil's former tolerance of it, but this would have happened in any container. I'm not sure how to keep it from recurring other than having stronger code reviews.

Comment by Daniel Gottlieb (Inactive) [ 06/Jul/21 ]

Thanks for engaging me on this, Billy.

so the iterator variable it local to the loop has already advanced when the erase happens

Ahh. So would the following be true then:
"It is legal to remove items that an iterator has already visited and is no longer positioned on."
Or is it simply:
"It is legal to remove items that no iterator is positioned on" regardless of whether the iterator is in front of or behind the removed item.

The proper code to perform this relatively common technique seems trickier than I was expecting. I have an interest in this because a previous case of iterating + removing that I debugged resulted in not actually deleting all the things. We solved that issue by making vector of items to delete, then passing over the vector to actually remove from the map.

I'm curious what you think might be a good set of guard rails to more proactively catch these issues in the future? Or do you think that the

{A,UB}

SAN with the newer versions of abseil will be sufficient to identify these bugs before they hit customers?

Comment by Billy Donahue [ 06/Jul/21 ]

The key to this fix is the (it++), so the iterator variable it local to the loop has already advanced when the erase happens. Ranged-for has no way to do this.

Index: src/mongo/db/repl/tenant_migration_access_blocker_registry.cpp
diff --git a/src/mongo/db/repl/tenant_migration_access_blocker_registry.cpp b/src/mongo/db/repl/tenant_migration_access_blocker_registry.cpp
index 27c8ad9c03a2305b467a082fea8c2edf7a7f34b0..0370d7d7fe3f344c2797adc6035d88e05cc51da8 100644
--- a/src/mongo/db/repl/tenant_migration_access_blocker_registry.cpp
+++ b/src/mongo/db/repl/tenant_migration_access_blocker_registry.cpp
@@ -85,8 +85,9 @@ void TenantMigrationAccessBlockerRegistry::remove(StringData tenantId, MtabType
 void TenantMigrationAccessBlockerRegistry::removeAll(MtabType type) {
     stdx::lock_guard<Latch> lg(_mutex);
 
-    for (auto& [tenantId, _] : _tenantMigrationAccessBlockers) {
-        _remove(lg, tenantId, type);
+    for (auto it = _tenantMigrationAccessBlockers.begin();
+         it != _tenantMigrationAccessBlockers.end();) {
+        _remove(lg, (it++)->first, type);
     }
 }

Comment by Daniel Gottlieb (Inactive) [ 06/Jul/21 ]

Thanks for the response.

A ranged-for loop that does an erase is always a bug in just about any container.

I'm surprised by that. From the patch you made, we're still just removing based on the key (as opposed to calling a remove on the iterator/position). This means to me that a ranged-for loop is not using iterators under the hood? Based on your knowledge, it's not obvious to me why the new code is correct.

My experience is with Java where this sort of bug is deterministically caught with a ConcurrentModificationException. Only calling remove on the iterator will succeed, forcing the programmer to forgo the for-each sugar.

Comment by Githook User [ 06/Jul/21 ]

Author:

{'name': 'Billy Donahue', 'email': 'billy.donahue@mongodb.com', 'username': 'BillyDonahue'}

Message: SERVER-58263 fix hash map erase loop in TenantMigrationAccessBlockerRegistry
Branch: master
https://github.com/mongodb/mongo/commit/efdd9cd309c2e1f6a10762b7dff3913490ea71a3

Comment by Billy Donahue [ 06/Jul/21 ]

Billy Donahue, the change from the wrong code to the right code is subtle. I'm curious, will the new version of abseil deterministically error out at runtime when a programmer makes this mistake or is the behavior undefined?

I wouldn't phrase it as an either/or, as the behavior is technically undefined whether the container can reliably error out or not . I don't think it can, though. I think it's erroring out only because it's in a debug build and the absl::node_hash_map happens to be catching a problem with a slot being in an unexpected state when it's accessed by an invalidated iterator. I'm not sure, but I think even in debug builds it's possible that the node_hash_map could happen to be in a configuration for which this assertion wouldn't triggered.

I haven't looked at the code up close and I don't know what the salient code difference is between the old and new absl code. I should do that one of these days so we understand what we're really dealing with here. So far I'm just patching in the new Abseil and chasing down stacktraces.

A ranged-for loop that does an erase is always a bug in just about any container. I think previous implementations of absl just weren't catching it and were perhaps structured in such a way that it worked anyway.

Comment by Daniel Gottlieb (Inactive) [ 06/Jul/21 ]

billy.donahue, the change from the wrong code to the right code is subtle. I'm curious, will the new version of abseil deterministically error out at runtime when a programmer makes this mistake or is the behavior undefined?

Comment by Billy Donahue [ 03/Jul/21 ]

Code review: https://mongodbcr.appspot.com/814140004

Generated at Thu Feb 08 05:44:02 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.