Best way to read csv file in C# to improve time efficiency
Solution 1
You can the built-in OleDb for that..
public void ImportCsvFile(string filename)
{
FileInfo file = new FileInfo(filename);
using (OleDbConnection con =
new OleDbConnection("Provider=Microsoft.Jet.OLEDB.4.0;Data Source=\"" +
file.DirectoryName + "\";
Extended Properties='text;HDR=Yes;FMT=Delimited(,)';"))
{
using (OleDbCommand cmd = new OleDbCommand(string.Format
("SELECT * FROM [{0}]", file.Name), con))
{
con.Open();
// Using a DataTable to process the data
using (OleDbDataAdapter adp = new OleDbDataAdapter(cmd))
{
DataTable tbl = new DataTable("MyTable");
adp.Fill(tbl);
//foreach (DataRow row in tbl.Rows)
//Or directly make a list
List<DataRow> list = dt.AsEnumerable().ToList();
}
}
}
}
See this and this for further reference.
Solution 2
Recently I faced the problem of parsing large CSV files as fast as possible for the same purpose: data aggregation and metrics calculation (in my case final goal was pivot table generation). I tested most popular CSV readers but found that they are just not designed for parsing CSV files with million of rows or more; JoshClose's CsvHelper is fast, but finally I was able to process CSV as a stream in 2x-4x times faster!
My approach is based on 2 assumptions:
- avoid creation of strings when possible as this is waste of memory and CPU (= increases GC payload). Instead of that, parser result can be represented as set of 'field value' descriptors that hold only start and end position in buffer + some metadata (quoted value flag, number of double-quotes inside value), and string value is constructed only when needed.
- use circular char[] buffer to read csv line to avoid excessive data copying
- no abstractions, minimal methods calls - this enables effective JIT-optimizations (say, avoid array length checks). No LINQ, no iterators (
foreach
) - asfor
is much more efficient.
Real life usage numbers (pivot table by 200MB CSV file, 17 columns, only 3 columns are used to build a crosstab):
- my custom CSV reader: ~1.9s
- CsvHelper: ~6.1s
--- update ---
I've published my library that works as described above on github: https://github.com/nreco/csv
Nuget package: https://www.nuget.org/packages/NReco.Csv/
Next Door Engineer
Amateur Engineer and Data Analytics Professional.
Updated on June 28, 2022Comments
-
Next Door Engineer almost 2 years
I have the following code to read in a large file, say with over a million rows. I am using Parallel and Linq approaches. Is there a better way to do it? If yes, then how?
private static void ReadFile() { float floatTester = 0; List<float[]> result = File.ReadLines(@"largedata.csv") .Where(l => !string.IsNullOrWhiteSpace(l)) .Select(l => new { Line = l, Fields = l.Split(new[] { ',' }, StringSplitOptions.RemoveEmptyEntries) }) .Select(x => x.Fields .Where(f => Single.TryParse(f, out floatTester)) .Select(f => floatTester).ToArray()) .ToList(); // now get your totals int numberOfLinesWithData = result.Count; int numberOfAllFloats = result.Sum(fa => fa.Length); MessageBox.Show(numberOfAllFloats.ToString()); } private static readonly char[] Separators = { ',', ' ' }; private static void ProcessFile() { var lines = File.ReadAllLines("largedata.csv"); var numbers = ProcessRawNumbers(lines); var rowTotal = new List<double>(); var totalElements = 0; foreach (var values in numbers) { var sumOfRow = values.Sum(); rowTotal.Add(sumOfRow); totalElements += values.Count; } MessageBox.Show(totalElements.ToString()); } private static List<List<double>> ProcessRawNumbers(IEnumerable<string> lines) { var numbers = new List<List<double>>(); /*System.Threading.Tasks.*/ Parallel.ForEach(lines, line => { lock (numbers) { numbers.Add(ProcessLine(line)); } }); return numbers; } private static List<double> ProcessLine(string line) { var list = new List<double>(); foreach (var s in line.Split(Separators, StringSplitOptions.RemoveEmptyEntries)) { double i; if (Double.TryParse(s, out i)) { list.Add(i); } } return list; } private void button1_Click(object sender, EventArgs e) { Stopwatch stopWatchParallel = new Stopwatch(); stopWatchParallel.Start(); ProcessFile(); stopWatchParallel.Stop(); // Get the elapsed time as a TimeSpan value. TimeSpan ts = stopWatchParallel.Elapsed; // Format and display the TimeSpan value. string elapsedTime = String.Format("{0:00}:{1:00}:{2:00}.{3:00}", ts.Hours, ts.Minutes, ts.Seconds, ts.Milliseconds / 10); MessageBox.Show(elapsedTime); Stopwatch stopWatchLinQ = new Stopwatch(); stopWatchLinQ.Start(); ReadFile(); stopWatchLinQ.Stop(); // Get the elapsed time as a TimeSpan value. TimeSpan ts2 = stopWatchLinQ.Elapsed; // Format and display the TimeSpan value. string elapsedTimeLinQ = String.Format("{0:00}:{1:00}:{2:00}.{3:00}", ts2.Hours, ts.Minutes, ts.Seconds, ts2.Milliseconds / 10); MessageBox.Show(elapsedTimeLinQ); }