Best way to read csv file in C# to improve time efficiency

14,841

Solution 1

You can the built-in OleDb for that..

public void ImportCsvFile(string filename)
{
    FileInfo file = new FileInfo(filename);

    using (OleDbConnection con = 
            new OleDbConnection("Provider=Microsoft.Jet.OLEDB.4.0;Data Source=\"" +
            file.DirectoryName + "\";
            Extended Properties='text;HDR=Yes;FMT=Delimited(,)';"))
    {
        using (OleDbCommand cmd = new OleDbCommand(string.Format
                                  ("SELECT * FROM [{0}]", file.Name), con))
        {
            con.Open();

            // Using a DataTable to process the data
            using (OleDbDataAdapter adp = new OleDbDataAdapter(cmd))
            {
                DataTable tbl = new DataTable("MyTable");
                adp.Fill(tbl);

                //foreach (DataRow row in tbl.Rows)

                //Or directly make a list
                List<DataRow> list = dt.AsEnumerable().ToList();
            }
        }
    }
} 

See this and this for further reference.

Solution 2

Recently I faced the problem of parsing large CSV files as fast as possible for the same purpose: data aggregation and metrics calculation (in my case final goal was pivot table generation). I tested most popular CSV readers but found that they are just not designed for parsing CSV files with million of rows or more; JoshClose's CsvHelper is fast, but finally I was able to process CSV as a stream in 2x-4x times faster!

My approach is based on 2 assumptions:

  • avoid creation of strings when possible as this is waste of memory and CPU (= increases GC payload). Instead of that, parser result can be represented as set of 'field value' descriptors that hold only start and end position in buffer + some metadata (quoted value flag, number of double-quotes inside value), and string value is constructed only when needed.
  • use circular char[] buffer to read csv line to avoid excessive data copying
  • no abstractions, minimal methods calls - this enables effective JIT-optimizations (say, avoid array length checks). No LINQ, no iterators (foreach) - as for is much more efficient.

Real life usage numbers (pivot table by 200MB CSV file, 17 columns, only 3 columns are used to build a crosstab):

  • my custom CSV reader: ~1.9s
  • CsvHelper: ~6.1s

--- update ---

I've published my library that works as described above on github: https://github.com/nreco/csv

Nuget package: https://www.nuget.org/packages/NReco.Csv/

Share:
14,841
Next Door Engineer
Author by

Next Door Engineer

Amateur Engineer and Data Analytics Professional.

Updated on June 28, 2022

Comments

  • Next Door Engineer
    Next Door Engineer almost 2 years

    I have the following code to read in a large file, say with over a million rows. I am using Parallel and Linq approaches. Is there a better way to do it? If yes, then how?

            private static void ReadFile()
            {
                float floatTester = 0;
                List<float[]> result = File.ReadLines(@"largedata.csv")
                    .Where(l => !string.IsNullOrWhiteSpace(l))
                    .Select(l => new { Line = l, Fields = l.Split(new[] { ',' }, StringSplitOptions.RemoveEmptyEntries) })
                    .Select(x => x.Fields
                                  .Where(f => Single.TryParse(f, out floatTester))
                                  .Select(f => floatTester).ToArray())
                    .ToList();
    
                // now get your totals
                int numberOfLinesWithData = result.Count;
                int numberOfAllFloats = result.Sum(fa => fa.Length);
                MessageBox.Show(numberOfAllFloats.ToString());
            }
    
            private static readonly char[] Separators = { ',', ' ' };
    
            private static void ProcessFile()
            {
                var lines = File.ReadAllLines("largedata.csv");
                var numbers = ProcessRawNumbers(lines);
    
                var rowTotal = new List<double>();
                var totalElements = 0;
    
                foreach (var values in numbers)
                {
                    var sumOfRow = values.Sum();
                    rowTotal.Add(sumOfRow);
                    totalElements += values.Count;
                }
                MessageBox.Show(totalElements.ToString());
            }
    
            private static List<List<double>> ProcessRawNumbers(IEnumerable<string> lines)
            {
                var numbers = new List<List<double>>();
                /*System.Threading.Tasks.*/
                Parallel.ForEach(lines, line =>
                {
                    lock (numbers)
                    {
                        numbers.Add(ProcessLine(line));
                    }
                });
                return numbers;
            }
    
            private static List<double> ProcessLine(string line)
            {
                var list = new List<double>();
                foreach (var s in line.Split(Separators, StringSplitOptions.RemoveEmptyEntries))
                {
                    double i;
                    if (Double.TryParse(s, out i))
                    {
                        list.Add(i);
                    }
                }
                return list;
            }
    
            private void button1_Click(object sender, EventArgs e)
            {
                Stopwatch stopWatchParallel = new Stopwatch();
                stopWatchParallel.Start();
                ProcessFile();
                stopWatchParallel.Stop();
                // Get the elapsed time as a TimeSpan value.
                TimeSpan ts = stopWatchParallel.Elapsed;
    
                // Format and display the TimeSpan value.
                string elapsedTime = String.Format("{0:00}:{1:00}:{2:00}.{3:00}",
                    ts.Hours, ts.Minutes, ts.Seconds,
                    ts.Milliseconds / 10);
                MessageBox.Show(elapsedTime);
    
                Stopwatch stopWatchLinQ = new Stopwatch();
                stopWatchLinQ.Start();
                ReadFile();
                stopWatchLinQ.Stop();
                // Get the elapsed time as a TimeSpan value.
                TimeSpan ts2 = stopWatchLinQ.Elapsed;
    
                // Format and display the TimeSpan value.
                string elapsedTimeLinQ = String.Format("{0:00}:{1:00}:{2:00}.{3:00}",
                    ts2.Hours, ts.Minutes, ts.Seconds,
                    ts2.Milliseconds / 10);
                MessageBox.Show(elapsedTimeLinQ);
            }