How to use scikit-learn PCA for features reduction and know which features are discarded

35,189

Solution 1

The features that your PCA object has determined during fitting are in pca.components_. The vector space orthogonal to the one spanned by pca.components_ is discarded.

Please note that PCA does not "discard" or "retain" any of your pre-defined features (encoded by the columns you specify). It mixes all of them (by weighted sums) to find orthogonal directions of maximum variance.

If this is not the behaviour you are looking for, then PCA dimensionality reduction is not the way to go. For some simple general feature selection methods, you can take a look at sklearn.feature_selection

Solution 2

The projected features onto principal components will retain the important information (axes with maximum variances) and drop axes with small variances. This behavior is like to compression (Not discard).

And X_proj is the better name of X_new, because it is the projection of X onto principal components

You can reconstruct the X_rec as

X_rec = pca.inverse_transform(X_proj) # X_proj is originally X_new

Here, X_rec is close to X, but the less important information was dropped by PCA. So we can say X_rec is denoised.

In my opinion, I can say the noise is discard.

Solution 3

The answer marked above is incorrect. The sklearn site clearly states that the components_ array is sorted. so it can't be used to identify the important features.

components_ : array, [n_components, n_features] Principal axes in feature space, representing the directions of maximum variance in the data. The components are sorted by explained_variance_.

http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

Share:
35,189

Related videos on Youtube

gc5
Author by

gc5

Updated on July 09, 2022

Comments

  • gc5
    gc5 almost 2 years

    I am trying to run a PCA on a matrix of dimensions m x n where m is the number of features and n the number of samples.

    Suppose I want to preserve the nf features with the maximum variance. With scikit-learn I am able to do it in this way:

    from sklearn.decomposition import PCA
    
    nf = 100
    pca = PCA(n_components=nf)
    # X is the matrix transposed (n samples on the rows, m features on the columns)
    pca.fit(X)
    
    X_new = pca.transform(X)
    

    Now, I get a new matrix X_new that has a shape of n x nf. Is it possible to know which features have been discarded or the retained ones?

    Thanks

    • Tom Ron
      Tom Ron about 10 years
      Feature are not discarded they are projected to smaller dimension and suppose to reveal interesting connections between the different features.
    • gc5
      gc5 about 10 years
      Thanks Tom, I was thinking PCA could be used for feature selection, but (correct if I am wrong) it is only used to rescale the data on the principal components. As you read it I think I'll close the question.
    • eickenberg
      eickenberg about 10 years
      Your output matrix should be of shape (n, nf), not (nf, n).
  • gc5
    gc5 about 10 years
    I finally understood what PCA does (hopefully). Is there any preferred correlation function to compute if a feature is correlated with a principal component? In this way I think to be able to find the most representative dimensions in my dataset.. (correct me if I am wrong) .. may I use just Pearson or cosine similarity?
  • eickenberg
    eickenberg about 10 years
    Thumbs up for understanding PCA ;) -- In order to be able to answer your question, we need to be very clear about what is meant by feature and dimension. There is potential for confusion with both. The features you specified are the columns of your matrix. In order to see whether PCA component 0 makes use of feature i, you can compare pca.components_[0, i] to the rest of pca.components_[0]. So if I understand your question correctly, then the answer is to look at a given PC and see which of your features have the strongest weights.
  • eickenberg
    eickenberg about 10 years
    Disclaimer: If you select features according to weights in your principal components you may or may not obtain something interesting. Once again, PCA is not made for throwing away features as defined by the canonical axes. In order to be sure what you are doing, try selecting k features using sklearn.feature_selection.SelectKBest using sklearn.feature_selection.f_classif or sklearn.feature_selection.f_regression depending on whether your target is numerical or categorical
  • gc5
    gc5 about 10 years
    Ok I'll take a look to those. To answer your previous question, I see components as pseudo-samples, is it wrong? I use feature and dimension interchangeably. However, in order to get k features (as a kind of feature selection), I think I have to swap samples and features, to obtain PCs that are pseudo-features (and not pseudo-samples). I do not know if it's clear. In this scenario I could correlate each feature with each PC, to see if it shows the same behavior across all samples. Thanks anyway for the effort :)
  • gc5
    gc5 about 10 years
    Ok, maybe another step forward: PCs are not pseudo-samples but arrays of projections of the features on each principal component. So, if I did it correctly, if some of the features are over a certain threshold together in a PC (e.g. A = 0.75 and B = 0.9), and not relevant in the other PCs (say A = 0.1 and B = 0.05), maybe we can say that they can be summarized with B (if our objective is feature selection)..
  • Sos
    Sos over 4 years
    Guys, awsome discussion here, this was very interesting. Just to make sure @eickenberg, if I want to select the top 100 features that show highest weight on my PC1 (i.e. presumably the 100 most informative features) you'd then use pca.components_[0,:100] to select them?
  • Sos
    Sos over 4 years
    The components_ array is sorted according to explained variance, which means that components_[0] is PC1, components_[1] is PC2, etc, from highest to lowest explained variance. If I understood correctly, what the answer above says is that you can use these to then select which input features have the highest weight on each of these PCs
  • eickenberg
    eickenberg over 4 years
    Selecting pca.components_[0, :100] looks at the first 100 entries of the 0th row of that array. 0th row means first component, yes, but :100 will just select you the weights on the first 100 features in the order you input them. If you wanted to assess the weights by their size/magnitude (unclear if this is a good idea) then to identify them you'd want to do np.abs(pca.components_[0]).argsort()[::-1][:100] (sort/argsort start at smallest, so either use [::-1] or an approprate keyword to invert, then cut off at 100). Remove np.abs if you want to keep the sign.