[SERVER-69987] Investigate big_collection regressions in SBE Created: 26/Sep/22  Updated: 29/Oct/23  Resolved: 18/Jan/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 6.3.0-rc0

Type: Task Priority: Major - P3
Reporter: Mihai Andrei Assignee: Anna Wawrzyniak
Resolution: Fixed Votes: 0
Labels: pm2697-m3
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File screenshot-1.png    
Issue Links:
Depends
is depended on by SERVER-69990 Investigate performance of Where.Comp... Closed
Problem/Incident
Backwards Compatibility: Fully Compatible
Sprint: QE 2022-10-31, QE 2022-11-14, QE 2022-11-28, QE 2022-12-12, QE 2022-12-26, QE 2023-01-09, QE 2023-01-23
Participants:
Linked BF Score: 35
Story Points: 10

 Description   

This task includes, but is not limited to, investigating the following tests: 

BigCollection.Filter (nDocs: 25, docSize: 16777216, batchSize: 0) -25.9781047
BigCollection.Filter (nDocs: 400, docSize: 1048576, batchSize: 1) -23.54608963
BigCollection.Filter (nDocs: 6400, docSize: 65536, batchSize: 16) -10.28602421
BigCollection.Scan (nDocs: 25, docSize: 16777216, batchSize: 0) -27.40575863
BigCollection.Scan (nDocs: 400, docSize: 1048576, batchSize: 1) -22.98373918


 Comments   
Comment by Githook User [ 18/Jan/23 ]

Author:

{'name': 'Drew Paroski', 'email': 'drew.paroski@mongodb.com', 'username': 'paroski'}

Message: SERVER-69987 Avoid copying slot data during saveState at the command boundary in SBE
Branch: master
https://github.com/mongodb/mongo/commit/7280be3b5e1517c767f8b637e28bb23d9da008db

Comment by Anna Wawrzyniak [ 03/Nov/22 ]

https://jira.mongodb.org/browse/PM-2451 Looks like it would solve this problem (option 3)

Comment by Anna Wawrzyniak [ 02/Nov/22 ]

Performance comparison for big collections:

https://docs.google.com/spreadsheets/d/1ZluIWD522RdxJScTxIjZKNNRr-kgKpZMl4pr35-KqEo/edit?usp=sharing

 

  Classic SBE Prototype 1 Prototype 2 Prototype 1 vs Classis Prototype 1 vs SBE Prototype 2 vs Classic Prototype 2 vs SBE
BigCollection.Filter (nDocs: 25, docSize: 16777216, batchSize: 0) 2.612198471 2.111650135 2.616177834 2.616677657 1.001523377 1.2389258 1.001714719 1.239162498
BigCollection.Filter (nDocs: 400, docSize: 1048576, batchSize: 1) 3.224426393 2.940898805 3.029434149 3.223661596 0.9395265327 1.030104859 0.9997628116 1.096148426
BigCollection.Filter (nDocs: 6400, docSize: 65536, batchSize: 16) 3.162875012 3.164483595 3.211180576 3.154049627 1.015272676 1.014756588 0.9972096953 0.9967027897
BigCollection.Scan (nDocs: 25, docSize: 16777216, batchSize: 0) 2.381298787 1.908265342 2.40463363 2.413963529 1.009799208 1.260114921 1.013717196 1.265004125
BigCollection.Scan (nDocs: 400, docSize: 1048576, batchSize: 1) 2.894236321 2.679483906 2.895949339 2.895826246 1.000591872 1.080786241 1.000549342 1.080740302

Both prototypes fix the regression.

Comment by Anna Wawrzyniak [ 02/Nov/22 ]

This issue is caused by unnecessary copying bson objects by save/restore stage that happens at the GetMore command boundary. The larger the document and the smaller the batch size (in number of documents count) the larger the overhead of making the document copy per batch that GetMore command computes.

Details:
When GetMore command completes, we need to perform saveState on PlanStage and then when subsequent GetMore command runs, we perform restoreState. Specificically in this case SBE ScanStage::saveState assumes that the bson record will be no longer available, and also conservatively assumes that the slot where the document was stored may be accessed after restoreState.

In case of simple scans, or plans that consist of streaming stages a above scan (makeBson, filter, project, traverse, limit, etc) the stored document returned by scan stage will never be accesed and the copy that is made in saveState will be thrown away.

Possible solutions:

1) Do nothing and accept the overhead and regression compared to classis for cases with large documents and small batch size.

2) Advise customers to use larger batch size. This may not be practical when collection contains large documents. A batch size of 100 documents of 16MB would result in 1.6gb per batch which may not be a good choice.

3) Change storage api to guarantee some form of "stable pointers" to returned documents, that survive context switch and yielding. A storage that supports mvcc or copy-on-write might be able to satisfy that requirement even if page was modified. However, in certain cases a copy might still need to be made (for example when the old page needed to be collected for some reased). In such case, QE would need still need be able to switch to the new document pointer and possibly restore all views/subtrees of such document.

4) Modify the save/restoreState logic in SBE to avoid unnecessary copies where it is known that the slots holding the document will not be accessed until subsequent getNext(). GetMore always performs getNext() as first operation after restoreState on root, so that invariant is true for root and for all its streamed inner-most children. Such save/restore logic extension would need ability for GetMore to notify the stage that slots will not longer be accessed until following getNext() and then that information would need to be propagated through the sbe stage tree to identify all stages that can safely discard their state when performing save/restore.

Prototype #1:

https://github.com/10gen/mongo/pull/new/anna.wawrzyniak/save_restore

This extends the saveState to include a "bool discardPublicState" parameter that indicates that the public slots of the stage will not be accessed until the subsequent getNext(). The streaming stages propagate the discardPublicState to children when appropriate. The default implementation conservatively assumes discardPublicState = false.

 

Prototype #2:

https://github.com/10gen/mongo/pull/new/anna.wawrzyniak/save_restore2

This utilizes the existing mechanism of marking slots as not needed used by yielding. Stages already use disableSlotAccess to indicate that slots are no longer needed:

a) non-recursive - typically called from getNext method to indicate that the slots are no longer needed and they will be recomputed when getNext completes
b) recursive - typically called when it is known that the slots are no longer needed and will not be used before subsequent call to close/open.

A GetMore command executor could use disableSlotAccess() method that the slots are no longer required until subsequent getNext call. However, the non-recursibe version of disableSlotAccess only marks the parent stage, but does not propagate that information to children. In case of streaming stages, that information could be propagated to children when appropriate and preventing unnecessary slot copying when safe to do so.

The prototype avoids the potential square complexity when disableSlotAccess propagates to its children in getNext method, by using lazy evaluation. Only when saveState is called, the subtree actually computes whether stages have slot access enabled/disabled.

 

 

Performance:

 

 

Generated at Thu Feb 08 06:14:58 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.