I am using csv-parse
to handle a very large dataset (about 30M rows with about a dozen columns). The processing I previous did was just to stream it, replace some of the values by other values, and write it out. It took about half an hour to process the file.
I have just added a new feature into the pipeline, which needs to change the content for some (a small minority) rows in the context of other rows, which means I have to do it in the 2 passes: to read it once to load the data I need, to read it the second time and write the modified content. Therefore I expected it would take about an hour to complete.
As the object model includes number which I need to compare, I added a cast
function in the option field of parse
function in the csv-parse
library and use the Number()
function there. As part of the refactoring to eliminate undefined
/ ''
confusion (the data model considers empty string a missing value, rather than a valid value), I modified the code to add a cast
function to change ''
to undefined
as well when streaming every other CSVs. The cast function is just below:
function changeEmptyStringToUndefined(x : string) {
return x === '' ? undefined : x;
}
And the parser is created below:
parse(inputStream, {columns: true, cast: changeEmptyStringToUndefined});
Just by specifying the cast function, the processing time of my other CSVs had doubled even without other changes!
What’s the reason that using a cast function made it so slow in processing the CSV? In addition, if I wanted to change the program to only parse the large file once, how much memory would I need to store the complete, parsed content of a 3 GB CSV file with 30M rows, about 10 columns with 10 characters each?