reading a csv file with a million rows in parallel in c#
I'm not sure it's a good idea. Depending on your hardware, the CPU won't be a bottleneck, the disk read speed will.
Another point: if your storage hardware is a magnetic hard disk, then then disk read speed is strongly related to how the file is physically stored in the disk; if the file is not fragmented (i.e. all file chunks are sequentially stored on the disk), you'll have better performances if you read line by line sequentially.
One solution would be to read the whole file in one time (if you have enough memory space, for 1 million row it should be OK) using File.ReadAllLines
, store all lines in a string array, then process (i.e. parse using string.Split
...etc.) in your Parallel.Foreach
, if the rows order is not important.
Related videos on Youtube
Next Door Engineer
Amateur Engineer and Data Analytics Professional.
Updated on June 21, 2022Comments
-
Next Door Engineer almost 2 years
I have a CVS file with over 1 Million rows of data. I am planning to read them in parallel to improve efficiency. Can I do something like the following or is there a more efficient method?
namespace ParallelData { public partial class ParallelData : Form { public ParallelData() { InitializeComponent(); } private static readonly char[] Separators = { ',', ' ' }; private static void ProcessFile() { var lines = File.ReadLines("BigData.csv"); var numbers = ProcessRawNumbers(lines); var rowTotal = new List<double>(); var totalElements = 0; foreach (var values in numbers) { var sumOfRow = values.Sum(); rowTotal.Add(sumOfRow); totalElements += values.Count; } MessageBox.Show(totalElements.ToString()); } private static List<List<double>> ProcessRawNumbers(IEnumerable<string> lines) { var numbers = new List<List<double>>(); /*System.Threading.Tasks.*/ Parallel.ForEach(lines, line => { lock (numbers) { numbers.Add(ProcessLine(line)); } }); return numbers; } private static List<double> ProcessLine(string line) { var list = new List<double>(); foreach (var s in line.Split(Separators, StringSplitOptions.RemoveEmptyEntries)) { double i; if (Double.TryParse(s, out i)) { list.Add(i); } } return list; } private void button2_Click(object sender, EventArgs e) { ProcessFile(); } } }
-
Magnus over 11 years@Giedrius No,
ReadLines
returns an enumerator not a list.
-
-
Next Door Engineer over 11 yearsCan you show an implementation of the above problem using BlockingCollection? I am a beginner and not an expert.