[SERVER-46406] Evaluate JSON logging performance in test suites Created: 25/Feb/20  Updated: 29/Oct/23  Resolved: 13/Mar/20

Status: Closed
Project: Core Server
Component/s: Logging
Affects Version/s: None
Fix Version/s: 4.4.0-rc0, 4.7.0

Type: Task Priority: Major - P3
Reporter: Mark Benvenuto Assignee: Henrik Edin
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Log_Overview.png     PNG File json_formatter.png     PNG File libfmt_slow.png     PNG File open_record.png    
Issue Links:
Backports
Duplicate
duplicates SERVER-46467 Test impact of UTF8 validation on tes... Closed
Backwards Compatibility: Fully Compatible
Backport Requested:
v4.4
Sprint: Security 2020-03-09, Security 2020-03-23
Participants:

 Description   

Use evergreen data to compare logging performance and evaluate any opportunities for performance improvements.

One suggestion has been to disable UTF-8 validation in logging. Other possibilities include examining memory and I/O overhead.



 Comments   
Comment by Githook User [ 16/Mar/20 ]

Author:

{'username': 'henrikedin', 'name': 'Henrik Edin', 'email': 'henrik.edin@mongodb.com'}

Message: SERVER-46406 Perf improvements in the logger.

Use fmt::compile and fmt::format_int when possible.

(cherry picked from commit 5094b4f2f2ee9de07dc4c3d3c320e967595496f1)
Branch: v4.4
https://github.com/mongodb/mongo/commit/bc43f398697282a5fedb129d6913ef09244af6fb

Comment by Henrik Edin [ 13/Mar/20 ]

Looked into the difference in run time by comparing the sharding suite in these two runs:

Jan 24:
https://evergreen.mongodb.com/task/mongodb_mongo_master_enterprise_rhel_62_64_bit_display_sharding_38ec223f7478be14fc3bf082643a1109efaeb57c_20_01_25_02_28_44

Feb 25:
https://evergreen.mongodb.com/task/mongodb_mongo_master_enterprise_rhel_62_64_bit_display_sharding_742ac9b37b1d8f489e1b259a0a3575f8811edac4_20_02_25_14_46_48

Execution subtasks went from 26 to 28
Cumulative time went from 1h 54m 55s to 2h 16m 53s
Wall clock time went from 31m 32s to 1h 0m 10s

Before there was 972 jobs that had a total runtime of 3h 37m 15s
After there was 1012 jobs that had a total runtime of 3h 58m 15s

The number of jobs increased, the cumulative time increased with around 20% but the wall clock time with a whopping 100%.

When looking at the logs from the various subtasks you notice:

Before the gen task finished 21:33 and the exec tasks started around 21:53 and ran for up to 10min
But after we have a gen task that finished 10:01 and some exec tasks that start at 10:20 but most of them doesn't start until 10:50. That's 50min of no work, this is where the wall clock time is going.

So from what I can tell the problem is a scheduling / resource issue at Evergreen. There's not enough hosts to schedule the execution, leading to long wall clock times. When they do eventually run the execution time is not much different and even if there was an increased total execution time there are also more tests that run now than before.

I will close this ticket with the CPU perf improvements that has been committed.

Comment by Githook User [ 13/Mar/20 ]

Author:

{'name': 'Henrik Edin', 'username': 'henrikedin', 'email': 'henrik.edin@mongodb.com'}

Message: SERVER-46406 Perf improvements in the logger.

Use fmt::compile and fmt::format_int when possible.
Branch: master
https://github.com/mongodb/mongo/commit/5094b4f2f2ee9de07dc4c3d3c320e967595496f1

Comment by Mark Benvenuto [ 12/Mar/20 ]

Assigning to henrik.edin since he is making the code changes to address some of the issues identified.

Comment by Bruce Lucas (Inactive) [ 11/Mar/20 ]

Thanks Henrik, that's good to know.

Comment by Henrik Edin [ 11/Mar/20 ]

bruce.lucas, mark.benvenuto I ran the same benchmark on 4.2 and it shows a similar picture. LogstreamBuilder::~LogstreamBuilder() + OpDebug::report() that's used to log in completeAndLogOperation() is almost as much time as mongo::performUpdates(). That might be the bulk of the logging, but formatting in other log statements using operator<< is not included because it happens outside of the log system in the old logger.

Comment by Henrik Edin [ 10/Mar/20 ]

mark.benvenuto Sent you a CR for low hanging fruit improvements.

V1 escaped in BSONObj::jsonString, but it did not escape all the strings as we do now. And when BSONObj was logged with the regular toString() there is no escaping.

Using SSE to further improve escapeForJSON is an interesting idea. But I think that's too large to do as part of this. We should ticket it out separatley.

 

Comment by Mark Benvenuto [ 09/Mar/20 ]

bruce.lucas, I did not compare against the original version. I originally thought of comparing a "verbose" jstest test but there is more variance for any given test to be a good comparison. I think it would be a good idea for compare YSCB on say 4.2.x which is still using the original V1 logging framework. It would be interesting to look at the relative cost comparison.

henrik.edin
1. itoa - We have are own optimized version which is even faster - see itoa.h
2. I see how it buffers now. The code was difficult for me to read. I still believe we have an opportunity to optimize this code.

Where does the log V1 code escape? I found some code on Windows for newlines and some UTF-8 stuff for the Windows console. I did not find any code for the escaping when writing to a file.

I think the Evergreen per problems are two fold:
1. Larger I/O - as has been discussed, we are 20-50% larger. This is probably most of the cost.
2. CPU - CPU time matters which is causing some of the slow down.

In general, benchmarks are designed to not log which is why they will not show a problem. My test was artificially trying to push the logging code harder than a benchmark should.

Appendix:
For an example of optimizations, we could focus on just scanning for valid ascii. We could do various tricks to scan blocks of characters instead of one at a time. We could do this with regular CPU instructions or probably using SIMD in many cases.
There is also the magic _mm_cmpestri SSE 4.2 instruction.

Untested example code:

// https://graphics.stanford.edu/~seander/bithacks.html
#define haszero(v) (((v) - 0x01010101UL) & ~(v) & 0x80808080UL)
#define hasvalue(x,n) (haszero((x) ^ (~0UL/255 * (n))))
#define hasless(x,n) (((x)-~0UL/255*(n))&~(x)&~0UL/255*128)
int utf8_mask = 0x80808080;
 
bool isValidJSONCharacters(char* str, int len) {
    while (len > 4) {
        int* word_ptr = reinterpret_cast<int*>(str);
        for(int i = 0; i < (len /4); i++) {
            int word = word_ptr[i];
            if ((word & utf8_mask) != 0 || hasless(word,0x1f) || hasvalue(word,' ') || hasvalue(word,'\\') || hasvalue(word,0x7f) ) {
                return false;
            }
        }
        str += 4 * (len /4);
        len -= 4 * (len /4);
    }
    for(int i = 0; i < len; i++) {
        if( str[i] < 0x1f || str[i] == ' ' || str[i] == '\\' || str[i] == 0x7f) {
            return false;
        }
    }
    return true;
}

Comment by Henrik Edin [ 07/Mar/20 ]

Thanks for the excellent analysis Mark! A few comments and clarifications:

1. Using a faster formatting for integers than fmt::format is probably a good idea. Perhaps the answer is itoa, but there's also a fast path for integers that we can use in libfmt called format_int that's supposed to be very fast. See this benchmark: http://www.zverovich.net/2013/09/07/integer-to-string-conversion-in-cplusplus.html This seem to be a low hanging fruit we should certainly pursue.

2. escapeForJSON does not write bytes one at a time, but it analyze them one at a time. At a minimum this is something we need to do to escape ASCII control characters (the old logger also did this) but in addition it also validates the UTF-8 encoding by doing bit masking. Blocks of data that can be written unchanged are written as blocks.

The overall time spent in the logger looks worrisome, lots of it is spent in boost::log and there's not much we can do here apart from modify the library or use a different logging backend.

It would be interesting to run the same benchmark on 4.2 and see how much time we spent in the old logger. It was also doing expensive formatting using iostreams and might also show up high up in this benchmark. The perf fallout so far from this project is one test being slower but a couple being faster.

Do you think the CPU perf is what's causing issues on Evergreen? I would think that it would affect many more benchmarks if that was the case.

 

Comment by Bruce Lucas (Inactive) [ 07/Mar/20 ]

Great work Mark. Do you have an idea how the cost of logging compares with v1? Will customers who are experiencing a performance problem causing a lot of slow query logging, or who intentionally set slowms to 0 or set log level to 1, see it more impacted by logging than before?

Comment by Mark Benvenuto [ 06/Mar/20 ]

I did two analyses:

  1. Verified I/O was reasonably efficient.
  2. Did sampling based profile of logging code

 

I/O

We are flushing whole lines. We are not writing a single log line over multiple calls. The write is occurring with ostream::flush().

 

Logging Code

Test case

YCSB, 50000 Records, workload a

MongoD args: --logpath=/dev/null --slowms=0 --storageEngine=inMemory --dbpath=/tmp/foo

Profiler: VTune 2020

 

Results:

If we look at the overall most expensive functions in this artificial workload to magnify the cost of logging, we see that logging is almost as expensive as doing updates and we spend about have the command invocation time in logging.

 

Breaking it down: open_record is unexpectedly expensive it does a hash table lookup just to determine if we need to log and takes two different read locks.

 

Next:

As we move into the meat of the logging, our test case is mostly slow query logs which need to log BSON.  As a result, we see how expensive things are for formatting BSONObj.

  1. writeString - (i.e. really escapeForJSON) - we need to avoid writing one character at a time. It should be optimized for ASCII at a small performance penalty for non-ascii.
  2. format_to - it is formatting a uint64_t integer and yet dominates the cost of formatting over a string.

 

Finally:

In the rest of the formatting code, we see that libfmt dominates the cost excessively. Most of this cost comes from the library having to parse the format string. Also, outputDateAsISOStringLocal is expensive because of a mix of tz_convert and sprintf.

And no, the dominant cost for libfmt is not coming from it appending the values to the output stream. It is from parsing the format specifiers.

Recommendations:

  1. Remove libfmt from the main logging code path. libfmt is just a new version of sprintf and sprintf is slow. We should call functions like itoa instead and do more manual formatting.
  2. Optimize escapeForJSON to stop appending one character at a time. It should be optimized for ASCII over UTF-8 and consider escaping an unlikely event.
  3. Improve outputDateAsISOStingLocal by avoiding sprintf and family.

 

Generated at Thu Feb 08 05:11:24 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.