Writing Data to a CSV File

Turner Kraus
2 min readOct 21, 2020

When going through Hack Reactor, I needed to generate 10 million unique records and supply them to a database.

At first, as a naive approach, I generated data points, added them to an array, and used a method of the DBMS (database management system) to add the entire array to the database. However, this was not sustainable. Sure, it’s simple, but the issue comes from the computer’s (and the program’s) memory.

When data is pushed into an array, it is saved in the computer’s working memory (Random Access Memory, or RAM). This is a finite resource for the computer, and it can quickly be overloaded if you are interested in storing millions of pieces of data in it. The better approach is to stream the data into a CSV file and handle only small pieces of data at a time.

When you stream data (using something such as fs.createWriteStream), the data is incrementally passed into the file which you have created. It’s kind of like the difference between a bucket and a hose. A bucket holds a lot of water, and when it’s time to get rid of the water, you have to dump it. There’s not really an elegant way to slowly transfer water with a bucket. A hose, on the other hand, works incrementally. It passes water from point A to point B continuously (in a stream, one might say), and no dumping is required.

The computer is working the same way. Pushing all the information into an array will eventually overload your bucket (er…computer) and you’ll have a mess on your hands. Instead, if you use a hose, you have more control over how much water (or, in the case of computers, information) gets passed along. You could even take a short break and bend the hose if things are moving too quickly! And that’s precisely what I ended up doing with my data generation script.

At times when the stream got overloaded, it emits an event called ‘drain’. I took advantage of this and instructed the program to only write data when the stream was not draining. That way, the flow was sure to be productive and useful to the receiver.

Speaking of the receiver, I used a CSV (comma separated values) file to store my data. It’s a bare bones type of file that is pretty self-explanatory. It’s basically a table, but instead of vertical lines, it has commas, and instead of horizontal lines, it has line breaks. Because this file is so flexible and simple, it’s often used to input large amounts of data into a database — perfect for my use case.

All in all, my data generation script in its simplest form ended up taking about 2 1/2 minutes to generate 10 million records and save it to a CSV. Add in about 10 minutes more to copy the data over to my databases, and I had 10 million records available to be requested in under 15 minutes. Pretty cool!

--

--