[SERVER-2641] Use larger buffer for mongoimport Created: 28/Feb/11 Updated: 12/Oct/11 Resolved: 11/Oct/11 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Tools |
| Affects Version/s: | 1.6.5 |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Minor - P4 |
| Reporter: | Roger Binns | Assignee: | Brandon Diamond |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Ubuntu 10.10 AMD64 |
||
| Participants: |
| Description |
|
Currently mongoimport uses a 4kb buffer for reading data. This can be confirmed by running strace. I provide the data via standard input. Unfortunately this is a very small buffer size which means there are lots of read() calls. Since I am piping data in from another process this also causes lots of context switches. The Linux pipe buffer size is 64kb. Stdio default buffer size is 4 or 8kb. 4kb in mongoimport is really out of sync. Since the code is doing bulk reads, please use a larger size. 64kb at least would be nice. I'm supplying over 2GB of data. The net effect of the current value is that imports are slower than they should be. |
| Comments |
| Comment by Brandon Diamond [ 12/Oct/11 ] |
|
You're definitely right about the internal buffer; the trick seems to be setting the buffer size in such a way as causes the library to listen My understanding is that setbuf is not compatible with the c++ stream abstraction (istream); however, you can access the read buffer via rdbuf() and update the underlying byte array with pubsetbuf(). I've tried a few different configurations, but it would seem we're still seeing 4k reads from stdin (i.e., piping data into the process) and 8k reads from a file: I must confess that I do not have in depth knowledge of the c++ stream buffering mechanism. Perhaps I am overlooking something? Please feel free to peruse our code (we're open source) at https://github.com/mongodb/mongo/blob/master/tools/import.cpp – the community would certainly appreciate your help patching the issue. |
| Comment by Roger Binns [ 11/Oct/11 ] |
|
You are confusing two different things here In other words whatever library you are using (C++'s iostream?) has a 4MB buffer which it is filling by 4KB read system calls. It is the latter that is causing performance problems, especially with pipes. That the data is line oriented is immaterial since you already have to deal with lines as short as one byte or as large as many megabytes. dump/restore are used in the project but only for data already in mongodb or being moved between machines. I use mongoimport for the initial data population because the whole toolchain uses JSON. There are effectively a series of programs piped together starting with the real source data (not in JSON), doing conversion, normalization, various fix ups etc with the final part of the pipeline being mongoimport. This is done a lot during development. The pipes allow concurrency but performance is ultimately help up by the small reads in mongoimport. mongorestore couldn't be used in that context anyway as it requires complete files on disk rather than reading data as it is generated. The fix in this case may be as simple as calling setbuf on an iostream/stdio. |
| Comment by Brandon Diamond [ 11/Oct/11 ] |
|
Created |
| Comment by Brandon Diamond [ 11/Oct/11 ] |
|
As above, though the buffer size is large, we process data files line-by-line. Since the lines tend to be just a few bytes long, we're only buffering 4KB per read. To increase the size of the read, the code will need to be rewritten to load large chunks of data and then process these line-by-line. This change merits a new ticket since the issue isn't the buffer size so much as the way the tool is written. Regarding your own project: is there a reason you're not using mongodump/restore? Mongoexport/import are not necessarily designed to be fast so much as convenient. You could easily replace these with a custom utility using your favorite language's mongo driver. This might not be a bad idea given that mongoimport doesn't handle GBs of data well at the moment. |
| Comment by Roger Binns [ 11/Oct/11 ] |
|
strace is showing what the binary is actually doing. It would be far more efficient to use larger reads (such as 4mb) rather than lots of little 4kb ones, just one 4kb read is better than lots of smaller ones than that. My problem is that 4kb reads are not efficient. I use gigabytes of data (note the "mongo" in the product name). Each 4kb read requires a system call and if using pipes then also two context switches. That makes the import so much slower for non-trivial amounts of data. It doesn't matter if there is another internal buffer of a larger size - the read calls that actually get made at the end of the day are the problem. |
| Comment by Brandon Diamond [ 11/Oct/11 ] |
|
If desired, will open new ticket to improve lined-based reading in mongoimport. |
| Comment by Brandon Diamond [ 11/Oct/11 ] |
|
The buffer size (BUF_SIZE) appears to be 4 MB. I think the issue is that we're employing a line-based reading strategy (which makes sense in most csv/tsv/json cases). The library appears to buffer 4KB to make small reads more efficient; I believe this is what you're seeing with strace. |