Excel wrongly converts ranges into dates, how to avoid it?

17,814

Solution 1

My question is in fact a duplicate of at least:

1) Stop Excel from automatically converting certain text values to dates

2) Excel: Default to TEXT rather than GENERAL when opening a .csv file

The possible solutions for Excel are to 1) either writing the fields with special double quotes like "May 16, 2011" as "=""May 16, 2011""" or 2) importing the csv/tsv file with the external data wizard and then selecting manually which columns you want to read as TEXT and not GENERAL (which could convert fields to dates)

As for my use case, I was only using Excel to remove some columns. None of the solutions was appealing to me because I wouldn't like to rewrite the tsv files with special quotes and because I had hundreds of columns and I didn't want to select each manually to be read as TEXT.

Therefore I wrote a scala script to filter tsv files by column names:

package com.jmcejuela.ml

import java.io.InputStream
import java.io.Writer

import scala.io.Codec
import scala.io.Source

import Table._

/**
 * Class to represent tables with a fixed size of columns. All rows have the same columns.
 */
class Table(val rows: Seq[Row]) {
  lazy val numDiffColumns = rows.foldLeft(Set[Int]())((set, row) => set + row.size)

  def toTSV(out: Writer) {
    if (rows.isEmpty) out.write(TableEmpty.toString)
    else {
      out.write(writeLineTSV(rows.head.map(_.name))) //header
      rows.foreach(r => out.write(writeLineTSV(r.map(_.value))))
      out.close
    }
  }

  /**
   * Get a Table with only the given columns.
   */
  def filterColumnsByName(columnNames: Set[String]): Table = {
    val existingNames = rows.head.map(_.name).toSet
    assert(columnNames.forall(n => existingNames.contains(n)), "You want to include column names that do not exist")
    new Table(rows.map { row => row.filter(col => columnNames.contains(col.name)) })
  }

}

object TableEmpty extends Table(Seq.empty) {
  override def toString = "Table(Empty)"
}

object Table {
  def apply(rows: Row*) = new Table(rows)

  type Row = Array[Column]

  /**
   * Column representation. Note that each column has a name and a value. Since the class Table
   * is a sequence of rows which are a size-fixed array of columns, the name field is redundant
   * for Table. However, this column representation could be used in the future to support
   * schemata-less tables.
   */
  case class Column(name: String, value: String)

  private def parseLineTSV(line: String) = line.split("\t")
  private def writeLineTSV(line: Seq[String]) = line.mkString("", "\t", "\n")

  /**
   * It is assumed that the first row gives the names to the columns
   */
  def fromTSV(in: InputStream)(implicit encoding: Codec = Codec.UTF8): Table = {
    val linesIt = Source.fromInputStream(in).getLines
    if (linesIt.isEmpty) TableEmpty
    else {
      val columnNames = parseLineTSV(linesIt.next)
      val padding = {
        //add padding of empty columns-fields to lines that do not include last fields because they are empty
        def infinite[A](x: A): Stream[A] = x #:: infinite(x)
        infinite("")
      }
      val rows = linesIt.map { line =>
        ((0 until columnNames.size).zip(parseLineTSV(line) ++: padding).map { case (index, field) => Column(columnNames(index), field) }).toArray
      }.toStream
      new Table(rows)
    }
  }
}

Solution 2

I think you best use the import facility in excel but you may have to manually change the file extension to a csv.

When importing be sure to select text for all the columns with these values.

Share:
17,814
juanmirocks
Author by

juanmirocks

Senior Developer / PhD in Computer Science and Text Mining

Updated on June 08, 2022

Comments

  • juanmirocks
    juanmirocks almost 2 years

    I have a .tsv file with some fields being ranges like 1 - 4. I want to read these fields as they are textually written. However, upon file opening excel converts automatically those range fields to dates. For instance 1 - 4 is converted to 4-Jan. If I try to format back the cell to another type, the value is already changed and I can only get a useless number (39816). Even if the range fields are within double quotes, the wrong conversion to date still takes place. How to avoid this behavior?