Lex: a small program to count words in the input

28,227

You need a rule to deal with 'uninteresting' characters; you still need to count them.

You don't want to reject newlines.

You don't need the trailing context on the definition of word. You should probably include capital letters as character.

This seems to work:

%{
#include <stdio.h>
#include <stdlib.h>
int cno = 0, wno = 0, lno = 0; /*counts of characters, words and lines */
%}

character [a-zA-Z]
digit [0-9]
word ({character}|{digit})+
line \n

%%

{line} { cno++; lno++; }
{word} { wno++; cno += strlen(yytext); }
. { cno++; }

%%

int main(void)
{
    yylex();
    printf("Number of characters: %d; ", cno);
    printf("Number of words:      %d; ", wno);
    printf("Number of lines:      %d\n", lno);
    return 0;
}

When run on its own source code, the output was:

Number of characters: 463; Number of words:      65; Number of lines:      27

The standard wc command (which has a different definition of 'word') yields:

  27      73     463 xyz.l

This agrees on the number of lines and characters.

Share:
28,227
goldfrapp04
Author by

goldfrapp04

Distributed Systems Engineer @ Scalyr

Updated on August 20, 2020

Comments

  • goldfrapp04
    goldfrapp04 over 3 years

    I'm extremely new to Lex and the complete requirement of this problem is as follows:

    Write a Lex input file that will produce a program that counts characters, words, and lines in a text file and reports the counts. Define a word to be any sequence of letters and/or digits, without punctuation or spaces. Punctuation and white space do not count as words.

    Now I've written down the code:

    %{
    #include <stdio.h>
    #include <stdlib.h>
    int cno = 0, wno = 0, lno = 0; /*counts of characters, words and lines */
    %}
    character [a-z]
    digit [0-9]
    word ({character}|{digit})+[^({character}|{digit})]
    line \n
    %%
    {line} { lno++; REJECT; }
    {word} { wno++; REJECT; }
    {character} { cno++; }
    %%
    void main()
    { yylex();
      fprintf(stderr, "Number of characters: %d; Number of words: %d; Number of lines: %d\n", cno, wno, lno);
      return;
    }
    

    I tested it with the text file:

    this is line #1
    line #2 is here
    !@#$%^&*()
    haha hey hey
    

    And I got the output

       #1
     #2  
    !@#$%^&*()
    
    Number of characters: 30; Number of words: 45; Number of lines: 4
    

    But the correct output should be

    Number of characters: 30; Number of words: 11; Number of lines: 4
    

    I guess the error of "number of words" should be somehow due to every count of characters, so how should I modify my program to tackle with this?

    Also, there're some unnecessary output coming out (those punctuations). How should I modify my program to avoid them?

    Thank you very much.

  • goldfrapp04
    goldfrapp04 almost 12 years
    Thank you. There are some small misunderstandings of the requirement in your code but I've figured them out and I think I've come up with the correct code. Thanks a lot.
  • Jonathan Leffler
    Jonathan Leffler almost 12 years
    Hmmm...so you don't count characters that are not in words? The character count is the alphabetic characters in the words, not the punctuation and newlines and so on? But digits, while they're part of a word, do not count as characters? Interesting definitions; not immediately obvious from the question. But with those definitions and your sample data, you can end up with the 'right answer'.
  • neo1691
    neo1691 over 10 years
    You can also make use of lex internal variable yyleng instead of using strlen().