Remove columns of dataframe based on conditions in R
Solution 1
I feel like this is all over-complicated. Condition 2 already includes all the rest of the conditions, as if there are at least two non-NA
values in a column, obviously the whole column aren't NA
s. And if there are at least two consecutive values in a column, then obviously this column contains more than one value. So instead of 3 conditions, this all sums up into a single condition (I prefer not to run many functions per column, rather after running diff
per column- vecotrize the whole thing):
cond <- colSums(is.na(sapply(df, diff))) < nrow(df) - 1
This works because if there are no consecutive values in a column, the whole column will become NA
s.
Then, just
df[, cond, drop = FALSE]
# A E
# 1 0.018 NA
# 2 0.017 NA
# 3 0.019 NA
# 4 0.018 NA
# 5 0.018 NA
# 6 0.015 0.037
# 7 0.016 0.031
# 8 0.019 0.025
# 9 0.016 0.035
# 10 0.018 0.035
# 11 0.017 0.043
# 12 0.023 0.040
# 13 0.022 0.042
Per your edit, it seems like you have a data.table
object and you also have a Date
column so the code would need some modifications.
cond <- df[, lapply(.SD, function(x) sum(is.na(diff(x)))) < .N - 1, .SDcols = -1]
df[, c(TRUE, cond), with = FALSE]
Some explanations:
- We want to ignore the first column in our calculations so we specify
.SDcols = -1
when operating on our.SD
(which means Sub Data indata.table
is) .N
is just the rows count (similar tonrow(df)
- Next step is to subset by condition. We need not forget to grab the first column too so we start with
c(TRUE,...
- Finally,
data.table
works with non standard evaluation by default, hence, if you want to select column as if you would in adata.frame
you will need to specifywith = FALSE
A better way though, would be just to remove the column by reference using := NULL
cond <- c(FALSE, df[, lapply(.SD, function(x) sum(is.na(diff(x)))) == .N - 1, .SDcols = -1])
df[, which(cond) := NULL]
Solution 2
Create logical vectors for each condition:
# condition 1
cond1 <- sapply(df, function(col) sum(!is.na(col)) < 2)
# condition 2
cond2 <- sapply(df, function(col) !any(diff(which(!is.na(col))) == 1))
# condition 3
cond3 <- sapply(df, function(col) all(is.na(col)))
Then combine them into one mask:
mask <- !(cond1 | cond2 | cond3)
> df[,mask,drop=F]
A E
1 0.018 NA
2 0.017 NA
3 0.019 NA
4 0.018 NA
5 0.018 NA
6 0.015 0.037
7 0.016 0.031
8 0.019 0.025
9 0.016 0.035
10 0.018 0.035
11 0.017 0.043
12 0.023 0.040
13 0.022 0.042
Aquarius
Areas of study interest inclusde investment banking, financial analysis and economics. Also, quite interested in behavioural studies and reading body language. Currently completing my higher education. R programming ethusiast. Advance skills in MS Excel.
Updated on June 23, 2022Comments
-
Aquarius almost 2 years
I have to remove columns in my dataframe which has over 4000 columns and 180 rows.The conditions I want to set in to remove the column in the dataframe are: (i) Remove the column if there are less then two values/entries in that column (ii) Remove the column if there are no two consecutive(one after the other) values in the column. (iii) Remove the column having all values as NA. I have provided with conditions on which a column is to be deleted. The aim here is not just to find a column by its name like in "How do you delete a column in data.table?". I Illustrate as follows:
A B C D E 0.018 NA NA NA NA 0.017 NA NA NA NA 0.019 NA NA NA NA 0.018 0.034 NA NA NA 0.018 NA NA NA NA 0.015 NA NA NA 0.037 0.016 NA NA NA 0.031 0.019 NA 0.4 NA 0.025 0.016 0.03 NA NA 0.035 0.018 NA NA NA 0.035 0.017 NA NA NA 0.043 0.023 NA NA NA 0.040 0.022 NA NA NA 0.042
Desired dataframe:
A E 0.018 NA 0.017 NA 0.019 NA 0.018 NA 0.018 NA 0.015 0.037 0.016 0.031 0.019 0.025 0.016 0.035 0.018 0.035 0.017 0.043 0.023 0.040 0.022 0.042
How can I incoporate these three conditions in one code. I would appreciate your help in this regard. Reproducible example
structure(list(Month = c("Jan-2000", "Feb-2000", "Mar-2000", "Apr-2000", "May-2000", "Jun-2000"), A.G.L.SJ.INVS...LON..DEAD...13.08.15 = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), ABACUS.GROUP.DEAD...18.02.09 = c(0.00829384766220866, 0.00332213653674028, 0, 0, NA, NA), ABB.R..IRS. = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_)), .Names = c("Month", "A.G.L.SJ.INVS...LON..DEAD...13.08.15", "ABACUS.GROUP.DEAD...18.02.09", "ABB.R..IRS."), class = c("data.table", "data.frame"), row.names = c(NA, -6L), .internal.selfref = <pointer: 0x0000000001c90788>)