[SERVER-4359] Non-atomic reference counting in threaded aggregation. Created: 23/Nov/11  Updated: 11/Jul/16  Resolved: 29/Nov/11

Status: Closed
Project: Core Server
Component/s: Aggregation Framework
Affects Version/s: None
Fix Version/s: 2.1.0

Type: Bug Priority: Major - P3
Reporter: Sergey Mitsyn Assignee: Chris Westin
Resolution: Done Votes: 0
Labels: aggregation, concurrency, crash
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

CentOS 5; probably any OS


Issue Links:
Depends
is depended on by SERVER-447 new aggregation framework Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:

 Description   

With aggregation command, expressions use shared Value instances across multiple threads. Such shared instances may be got using eg. Value::getTrue().

As mongo::Value is reference counted with mongo::IntrusiveCounter, operations on counter are not atomic. With two or more simultaneous aggregations being run this (may) results in freeing of shared values that are being used and thus strange errors like double free detected by glibc or assertions in Value::coerceToBool() for boolean Value. .

This 'bug' can be fixed by changing IntrusiveCounter's counter variable type to bson::AtomicUInt, which is ugly but works. Maybe the final fix shouldn't be so radical.



 Comments   
Comment by Chris Westin [ 29/Nov/11 ]

Using atomic ints for this would be a heavy hammer. The aggregation pipelines are not normally expected to share any values between threads or different aggregations. In this case, it is the re-usable static values that are the problem.

If we were to use atomic integers, it would cause processor cache lines to be flushed so that these reference counters could be incremented and decremented across processor cores when different aggregation pipelines are being executed by different threads on different cores; I suspect that would decrease overall throughput as these values get shuffled back and forth between cores.

The original implementation used shared_ptr<>, which comes with a recipe for not modifying reference counts for static objects. Unfortunately, that got lost in translation to intrusive_ptr<> (a move which was made because profiling showed memory allocation of the sideband counters for shared_ptr<>). But that's the right principle. So I made the addRef() and release() methods on IntrusiveCounter abstract, and created a concrete implementation in IntrusiveCounterUnsigned. I also created a ValueStatic class that derives from Value, and which overrides addRef() and release() to do nothing. The static values are now initialized from that. This eliminates any need to worry about thread safety for the reference counts on the static objects, which never go away anyway. Future users of IntrusiveCounter may use atomic integers if they expect to share objects between threads.

Before committing this change, I benchmarked it again to make sure the change of the reference counter maintenance from being inlined to being indirect function calls didn't affect things too much, and it barely showed up.

Generated at Thu Feb 08 03:05:46 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.