Parsing text files

7,778

Solution 1

To add the SQL text, you could try this command prompt one liner:

(for /f %i in (words.txt) do @echo INSERT INTO Words ^(word^) VALUES ^('%i'^)) > words.sql

To filter out lines in a text file longer than 7 characters, you could use another command line tool, findstr:

findstr /v /r ^.........*$ words.txt > shorter-words.txt

The /r option specifies that you want to use regex matching, and the /v option tells it to print lines that do not match. (Since it appears that findstr doesn't allow you to specify a character count range, I faked it with the "8 or more" pattern and the "do not match" option.)

Solution 2

Perl for sure, simply paste this script and run it in the same directory as the wordlist. Change your wordlist name to words.txt or alter the name in the script. You can redirect the output to a new file like so:

words.pl > list.txt

without further avail (whipped it together quick, can be chopped down a fair bit):

open FILE, "words.txt" or die $!;

my @words = <FILE>;

foreach $word(@words)
{
    print $word if(length($word) <= 8);
}

Solution 3

You can get the GNUWin32 sed for Windows XP.
Similarly AWK and Perl too.
That is if you are used to Unix scripting (if so also consider Cygwin).

Otherwise there is also PowerShell.

Solution 4

gVim is a worthy editing tool that has its origins in the venerable vi used on Unix systems. You will want to use the substitute command to do global search/replacements for each word.

AWK and Perl are very powerful tools, but overkill for what you need. You'll enjoy gVim since it is an editor first and foremost. The thing that rocks with gVim is that you are only one keystroke away from giving it a search/substitute/replace command which can be specified with the robust regular expression format.
Good luck.

Solution 5

Massively underestimated as a development tool is Microsoft Excel (or OpenOffice Spreadsheets). There is a max number of lines, but you might be able to take advantage of one of these tools.

Then you can just use the left, mid, if, etc. functions in the Spreadsheet in formulas that go to the right of your lines. They will automatically get copied with relative references.

Many times it's a lot easier than coding, unless you're a coder :) From there you can import, export, and do a lot of cool things even with text.

Share:
7,778

Related videos on Youtube

Joe Phillips
Author by

Joe Phillips

Updated on September 17, 2022

Comments

  • Joe Phillips
    Joe Phillips almost 2 years

    I encountered a situation tonight where I wanted to parse a text file. I had a very, very long word list that contained English words delimited by lines. I wanted to get rid of every word (or line) that was longer than 7 characters. This would be simple in Linux but I can't seem to find a simple solution in Windows XP. I tried using Notepad++ regular expression search, but that was a huge failure. I tried using the expression .{6,} without finding any matches. I'm really at a loss because I thought this sort of thing would be extremely easy and there would be tons of tools to accomplish a task like this. It seems like Notepad++ supports every other feature in the world except the very basic ones that seem the most obvious.

    Another one of my goals was to put some code before and after the word on each line.

    aardvark
    apple
    azolio
    

    would turn into

    INSERT INTO Words (word) VALUES ('aardvark');
    INSERT INTO Words (word) VALUES ('apple');
    INSERT INTO Words (word) VALUES ('azolio');
    

    What suggestions/tools/tips do you have to accomplish tasks similar to this in Windows XP?

  • Joe Phillips
    Joe Phillips almost 15 years
    I know plenty of scripting/programming but I don't really think it's necessary. This is one of those times I'm trying to get used to something that isn't a programming solution.
  • Eli Bendersky
    Eli Bendersky almost 15 years
    why? wouldn't it be easier to just program it? you also get to keep a script that can be just reused later
  • Joe Phillips
    Joe Phillips almost 15 years
    This is somewhat of a theoretical question for future reference. I'd much rather have the option of programming OR using a tool
  • Admin
    Admin almost 15 years
    This file has +150k words. I do not think Excel will even open it.
  • Joe Phillips
    Joe Phillips almost 15 years
    This is actually quite fast and amazing. I never knew you could do this with windows command prompt!
  • Joe Phillips
    Joe Phillips almost 15 years
    It managed to do the findstr command on a 1.66MB in just a few seconds. It then did the SQL portion of it in under 1 minute. Very impressive.
  • Admin
    Admin almost 15 years
    Yes, you're right, Excel will only do 65536 rows.
  • Admin
    Admin almost 15 years
    Excel 2003 and before has these limitations, but if you have it available to you, Excel 2007 has greatly increased these limits. See office.microsoft.com/en-us/excel/HP100738491033.aspx .
  • Dan Rosenstark
    Dan Rosenstark almost 15 years
    been there, it's a bummer. have you tried odesk? :)