Creating histograms in bash

11,417

Solution 1

Following the same algorithm of my previous answer, I wrote a script in awk which is extremely fast (look at the picture). enter image description here

The script is the following:

#!/usr/bin/awk -f

BEGIN{
    bin_width=0.1;
    
}
{
    bin=int(($1-0.0001)/bin_width);
    if( bin in hist){
        hist[bin]+=1
    }else{
        hist[bin]=1
    }
}
END{
    for (h in hist)
        printf " * > %2.2f  ->  %i \n", h*bin_width, hist[h]
}
   

The bin_width is the width of each channel. To use the script just copy it in a file, make it executable (with chmod +x <namefile>) and run it with ./<namefile> <name_of_data_file>.

Solution 2

For this specific problem, I would drop the last digit, then count occurrences of sorted data:

cut -b1-3 | sort | uniq -c

which gives, on the specified input set:

  2 0.1
  1 0.3
  3 0.4
  1 0.9

Output formatting can be done by piping through this awk command:

| awk 'BEGIN{r=0.0}
       {while($2>r){printf "%1.1f-%1.1f %3d\n",r,r+0.1,0;r=r+.1}
       printf "%1.1f-%1.1f %3d\n",$2,$2+0.1,$1}
       END{while(r<0.9){printf "%1.1f-%1.1f %3d\n",r,r+0.1,0;r=r+.1}}'

Solution 3

The only loop you will find in this algorithm is around the line of the file.

This is an example on how to realize what you asked in bash. Probably bash is not the best language to do this since it is slow with math. I use bc, you can use awk if you prefer.

How the algorithm works

Imagine you have many bins: each bin correspond to an interval. Each bin will be characterized by a width (CHANNEL_DIM) and a position. The bins, all together, must be able to cover the entire interval where your data are casted. Doing the value of your number / bin_width you get the position of the bin. So you have just to add +1 to that bin. Here a much more detailed explanation.

#!/bin/bash

# This is the input: you can use $1 and $2 to read input as cmd line argument
FILE='bash_hist_test.dat'
CHANNEL_NUMBER=9  # They are actually 10: 0 is already a channel

# check the max and the min to define the dimension of the channels:
MAX=`sort -n $FILE | tail -n 1`
MIN=`sort -rn $FILE | tail -n 1`

# Define the channel width 
CHANNEL_DIM_LONG=`echo "($MAX-$MIN)/($CHANNEL_NUMBER)" | bc -l` 
CHANNEL_DIM=`printf '%2.2f' $CHANNEL_DIM_LONG `
# Probably printf is not the best function in this context because
#+the result could be system dependent.

# Determine the channel for a given number
# Usage: find_channel <number_to_histogram> <width_of_histogram_channel>
function find_channel(){
  NUMBER=$1
  CHANNEL_DIM=$2

  # The channel is found dividing the value for the channel width and 
  #+rounding it.
  RESULT_LONG=`echo $NUMBER/$CHANNEL_DIM | bc -l`
  RESULT=`printf '%.0f' $RESULT_LONG`
  echo $RESULT
}

# Read the file and do the computuation
while IFS='' read -r line || [[ -n "$line" ]]; do

  CHANNEL=`find_channel $line $CHANNEL_DIM`

  [[ -z HIST[$CHANNEL] ]] && HIST[$CHANNEL]=0
  let HIST[$CHANNEL]+=1
done < $FILE

counter=0
for i in ${HIST[*]}; do
  CHANNEL_START=`echo "$CHANNEL_DIM * $counter - .04" | bc -l`
  CHANNEL_END=`echo " $CHANNEL_DIM * $counter + .05" | bc`
  printf '%+2.1f : %2.1f => %i\n' $CHANNEL_START $CHANNEL_END $i
  let counter+=1
done

Hope this helps. Comment if you have other questions.

Share:
11,417
Chem-man17
Author by

Chem-man17

Updated on June 09, 2022

Comments

  • Chem-man17
    Chem-man17 almost 2 years

    EDIT

    I read the question that this is supposed to be a duplicate of (this one). I don't agree. In that question the aim is to get the frequencies of individual numbers in the column. However if I apply that solution to my problem, I'm still left with my initial problem of grouping the frequencies of the numbers in a particular range into the final histogram. i.e. if that solution tells me that the frequency of 0.45 is 2 and 0.44 is 1 (for my input data), I'm still left with the problem of grouping those two frequencies into a total of 3 for the range 0.4-0.5.

    END EDIT

    QUESTION-

    I have a long column of data with values between 0 and 1. This will be of the type-

    0.34
    0.45
    0.44
    0.12
    0.45
    0.98
    .
    .
    .
    

    A long column of decimal values with repetitions allowed.

    I'm trying to change it into a histogram sort of output such as (for the input shown above)-

    0.0-0.1  0
    0.1-0.2  1
    0.2-0.3  0
    0.3-0.4  1 
    0.4-0.5  3
    0.5-0.6  0
    0.6-0.7  0
    0.7-0.8  0
    0.8-0.9  0
    0.9-1.0  1
    

    Basically the first column has the lower and upper bounds of each range and the second column has the number of entries in that range.

    I wrote it (badly) as-

    for i in $(seq 0 0.1 0.9)
    do 
        awk -v var=$i '{if ($1 > var && $1 < var+0.1 ) print $1}' input | wc -l; 
    done
    

    Which basically does a wc -l of the entries it finds in each range.

    Output formatting is not a part of the problem. If I simply get the frequencies corresponding to the different bins , that will be good enough. Also please note that the bin size should be a variable like in my proposed solution.

    I already read this answer and want to avoid the loop. I'm sure there's a much much faster way in awk that bypasses the for loop. Can you help me out here?