reading a csv file with a million rows in parallel in c#

12,380

I'm not sure it's a good idea. Depending on your hardware, the CPU won't be a bottleneck, the disk read speed will.

Another point: if your storage hardware is a magnetic hard disk, then then disk read speed is strongly related to how the file is physically stored in the disk; if the file is not fragmented (i.e. all file chunks are sequentially stored on the disk), you'll have better performances if you read line by line sequentially.

One solution would be to read the whole file in one time (if you have enough memory space, for 1 million row it should be OK) using File.ReadAllLines, store all lines in a string array, then process (i.e. parse using string.Split...etc.) in your Parallel.Foreach, if the rows order is not important.

Share:
12,380

Related videos on Youtube

Next Door Engineer
Author by

Next Door Engineer

Amateur Engineer and Data Analytics Professional.

Updated on June 21, 2022

Comments

  • Next Door Engineer
    Next Door Engineer almost 2 years

    I have a CVS file with over 1 Million rows of data. I am planning to read them in parallel to improve efficiency. Can I do something like the following or is there a more efficient method?

    namespace ParallelData
    {
    public partial class ParallelData : Form
    {
        public ParallelData()
        {
            InitializeComponent();
        }
    
        private static readonly char[] Separators = { ',', ' ' };
    
        private static void ProcessFile()
        {
            var lines = File.ReadLines("BigData.csv");
            var numbers = ProcessRawNumbers(lines);
    
            var rowTotal = new List<double>();
            var totalElements = 0;
    
            foreach (var values in numbers)
            {
                var sumOfRow = values.Sum();
                rowTotal.Add(sumOfRow);
                totalElements += values.Count;
            }
            MessageBox.Show(totalElements.ToString());
        }
    
        private static List<List<double>> ProcessRawNumbers(IEnumerable<string> lines)
        {
            var numbers = new List<List<double>>();
            /*System.Threading.Tasks.*/
            Parallel.ForEach(lines, line =>
            {
                lock (numbers)
                {
                    numbers.Add(ProcessLine(line));
                }
            });
            return numbers;
        }
    
        private static List<double> ProcessLine(string line)
        {
            var list = new List<double>();
            foreach (var s in line.Split(Separators, StringSplitOptions.RemoveEmptyEntries))
            {
                double i;
                if (Double.TryParse(s, out i))
                {
                    list.Add(i);
                }
            }
            return list;
        }
    
        private void button2_Click(object sender, EventArgs e)
        {
            ProcessFile();
        }
    }
    }
    
    • Magnus
      Magnus over 11 years
      @Giedrius No, ReadLines returns an enumerator not a list.
  • Next Door Engineer
    Next Door Engineer over 11 years
    Can you show an implementation of the above problem using BlockingCollection? I am a beginner and not an expert.