R counting the occurrences of similar rows of data frame

r count duplicates dataframe

12,792

Solution 1

Ok, let's first do it in the easy case where you just have one column.

> data <- rep(sample(1000, 5),
              sample(5, 5))
> head(data)
[1] 435 435 435 278 278 278

Then you can just use rle to figure out the contiguous sequences:

> sequence(rle(data)$lengths)
[1] 1 2 3 1 2 3 4 5 1 2 3 4 1 2 1

Or altogether:

> head(cbind(data, sequence(rle(data)$lengths)))
[1,]  435 1
[2,]  435 2
[3,]  435 3
[4,]  278 1
[5,]  278 2
[6,]  278 3

For your case with multiple columns, there are probably a bunch of ways of applying this solution. Easiest might be to just paste the columns you care about together to form a single vector.

Solution 2

Okay I used the answer I had on another question and worked out a loop that I think will work. This is what I'm going to use:

cmpfun2 <- function(r) {
    count <- 0
    if (r[1] > 1)
    {
        for (row in 1:(r[1]-1))
        {
            if(all(r[27:51] == DF[row,27:51,drop=FALSE]))  # compare to row bind
            {
                count <- count + 1
            }
        }
    }
    return (count)
}
brows <- apply(DF[], 1, cmpfun2)
print(brows)

Please comment if I made a mistake and this won't work, but I think I've figured it out. Thanks!

12,792

Author by

Matt

Updated on June 04, 2022

Comments

Matt about 2 years

I have data in the following format called DF (this is just a made up simplified sample):

eval.num, eval.count, fitness, fitness.mean, green.h.0, green.v.0, offset.0 random
1         1           1500     1500          100        120        40       232342
2         2           1000     1250          100        120        40       11843
3         3           1250     1250          100        120        40       981340234
4         4           1000     1187.5        100        120        40       4363453
5         1           2000     2000          200        100        40       345902
6         1           3000     3000          150        90         10       943
7         1           2000     2000          90         90         100      9304358
8         2           1800     1900          90         90         100      284333

However, the eval.count column is incorrect and I need to fix it. It should report the number of rows with the same values for (green.h.0, green.v.0, and offset.0) by only looking at the previous rows.

The example above uses the expected values, but assume they are incorrect.

How can I add a new column (say "count") which will count all previous rows which have the same values of the specified variables?

I have gotten help on a similar problem of just selecting all rows with the same values for specified columns, so I supposed I could just write a loop around that, but it seems inefficient to me.

Leo Alekseyev about 14 years

Your solution looks OK but if I were you I'd at least try to understand Jonathan's answer above. RLE is often a quicker and more robust alternative compared to loops and index gymnastics.