Analyse en composantes principales en Python

Question

Analyse en composantes principales en Python

Demandé el 13 de Novembre, 2009: Quand la question a-t-elle été
46489 affichage: Nombre de visites la question a
5 Réponses: Nombre de réponses aux questions
Résolu: Situation réelle de la question

J'aimerais utiliser l'analyse en composantes principales (ACP) pour réduire la dimensionnalité. Est-ce que numpy ou scipy en dispose déjà, ou dois-je créer le mien en utilisant la méthode de l'analyse en composantes principales ? numpy.linalg.eigh ?

Je ne veux pas simplement utiliser la décomposition en valeurs singulières (SVD) parce que mes données d'entrée sont de haute dimension (~460 dimensions), donc je pense que la SVD sera plus lente que le calcul des vecteurs propres de la matrice de covariance.

J'espérais trouver une implémentation prête à l'emploi, déboguée, qui prendrait déjà les bonnes décisions pour savoir quand utiliser telle ou telle méthode, et qui ferait peut-être d'autres optimisations que je ne connais pas.

Demandé el 13 de Novembre, 2009 par Vebjorn Ljosa

Answer 1

5 Réponses

Answer 2

66voto

denis Points 7316

Des mois plus tard, voici une petite classe PCA, et une photo :

#!/usr/bin/env python
""" a small class for Principal Component Analysis
Usage:
    p = PCA( A, fraction=0.90 )
In:
    A: an array of e.g. 1000 observations x 20 variables, 1000 rows x 20 columns
    fraction: use principal components that account for e.g.
        90 % of the total variance

Out:
    p.U, p.d, p.Vt: from numpy.linalg.svd, A = U . d . Vt
    p.dinv: 1/d or 0, see NR
    p.eigen: the eigenvalues of A*A, in decreasing order (p.d**2).
        eigen[j] / eigen.sum() is variable j's fraction of the total variance;
        look at the first few eigen[] to see how many PCs get to 90 %, 95 % ...
    p.npc: number of principal components,
        e.g. 2 if the top 2 eigenvalues are >= `fraction` of the total.
        It's ok to change this; methods use the current value.

Methods:
    The methods of class PCA transform vectors or arrays of e.g.
    20 variables, 2 principal components and 1000 observations,
    using partial matrices U' d' Vt', parts of the full U d Vt:
    A ~ U' . d' . Vt' where e.g.
        U' is 1000 x 2
        d' is diag([ d0, d1 ]), the 2 largest singular values
        Vt' is 2 x 20.  Dropping the primes,

    d . Vt      2 principal vars = p.vars_pc( 20 vars )
    U           1000 obs = p.pc_obs( 2 principal vars )
    U . d . Vt  1000 obs, p.obs( 20 vars ) = pc_obs( vars_pc( vars ))
        fast approximate A . vars, using the `npc` principal components

    Ut              2 pcs = p.obs_pc( 1000 obs )
    V . dinv        20 vars = p.pc_vars( 2 principal vars )
    V . dinv . Ut   20 vars, p.vars( 1000 obs ) = pc_vars( obs_pc( obs )),
        fast approximate Ainverse . obs: vars that give ~ those obs.

Notes:
    PCA does not center or scale A; you usually want to first
        A -= A.mean(A, axis=0)
        A /= A.std(A, axis=0)
    with the little class Center or the like, below.

See also:
    http://en.wikipedia.org/wiki/Principal_component_analysis
    http://en.wikipedia.org/wiki/Singular_value_decomposition
    Press et al., Numerical Recipes (2 or 3 ed), SVD
    PCA micro-tutorial
    iris-pca .py .png

"""

from __future__ import division
import numpy as np
dot = np.dot
    # import bz.numpyutil as nu
    # dot = nu.pdot

__version__ = "2010-04-14 apr"
__author_email__ = "denis-bz-py at t-online dot de"

#...............................................................................
class PCA:
    def __init__( self, A, fraction=0.90 ):
        assert 0 <= fraction <= 1
            # A = U . diag(d) . Vt, O( m n^2 ), lapack_lite --
        self.U, self.d, self.Vt = np.linalg.svd( A, full_matrices=False )
        assert np.all( self.d[:-1] >= self.d[1:] )  # sorted
        self.eigen = self.d**2
        self.sumvariance = np.cumsum(self.eigen)
        self.sumvariance /= self.sumvariance[-1]
        self.npc = np.searchsorted( self.sumvariance, fraction ) + 1
        self.dinv = np.array([ 1/d if d > self.d[0] * 1e-6  else 0
                                for d in self.d ])

    def pc( self ):
        """ e.g. 1000 x 2 U[:, :npc] * d[:npc], to plot etc. """
        n = self.npc
        return self.U[:, :n] * self.d[:n]

    # These 1-line methods may not be worth the bother;
    # then use U d Vt directly --

    def vars_pc( self, x ):
        n = self.npc
        return self.d[:n] * dot( self.Vt[:n], x.T ).T  # 20 vars -> 2 principal

    def pc_vars( self, p ):
        n = self.npc
        return dot( self.Vt[:n].T, (self.dinv[:n] * p).T ) .T  # 2 PC -> 20 vars

    def pc_obs( self, p ):
        n = self.npc
        return dot( self.U[:, :n], p.T )  # 2 principal -> 1000 obs

    def obs_pc( self, obs ):
        n = self.npc
        return dot( self.U[:, :n].T, obs ) .T  # 1000 obs -> 2 principal

    def obs( self, x ):
        return self.pc_obs( self.vars_pc(x) )  # 20 vars -> 2 principal -> 1000 obs

    def vars( self, obs ):
        return self.pc_vars( self.obs_pc(obs) )  # 1000 obs -> 2 principal -> 20 vars

class Center:
    """ A -= A.mean() /= A.std(), inplace -- use A.copy() if need be
        uncenter(x) == original A . x
    """
        # mttiw
    def __init__( self, A, axis=0, scale=True, verbose=1 ):
        self.mean = A.mean(axis=axis)
        if verbose:
            print "Center -= A.mean:", self.mean
        A -= self.mean
        if scale:
            std = A.std(axis=axis)
            self.std = np.where( std, std, 1. )
            if verbose:
                print "Center /= A.std:", self.std
            A /= self.std
        else:
            self.std = np.ones( A.shape[-1] )
        self.A = A

    def uncenter( self, x ):
        return np.dot( self.A, x * self.std ) + np.dot( x, self.mean )

#...............................................................................
if __name__ == "__main__":
    import sys

    csv = "iris4.csv"  # wikipedia Iris_flower_data_set
        # 5.1,3.5,1.4,0.2  # ,Iris-setosa ...
    N = 1000
    K = 20
    fraction = .90
    seed = 1
    exec "\n".join( sys.argv[1:] )  # N= ...
    np.random.seed(seed)
    np.set_printoptions( 1, threshold=100, suppress=True )  # .1f
    try:
        A = np.genfromtxt( csv, delimiter="," )
        N, K = A.shape
    except IOError:
        A = np.random.normal( size=(N, K) )  # gen correlated ?

    print "csv: %s  N: %d  K: %d  fraction: %.2g" % (csv, N, K, fraction)
    Center(A)
    print "A:", A

    print "PCA ..." ,
    p = PCA( A, fraction=fraction )
    print "npc:", p.npc
    print "% variance:", p.sumvariance * 100

    print "Vt[0], weights that give PC 0:", p.Vt[0]
    print "A . Vt[0]:", dot( A, p.Vt[0] )
    print "pc:", p.pc()

    print "\nobs <-> pc <-> x: with fraction=1, diffs should be ~ 0"
    x = np.ones(K)
    # x = np.ones(( 3, K ))
    print "x:", x
    pc = p.vars_pc(x)  # d' Vt' x
    print "vars_pc(x):", pc
    print "back to ~ x:", p.pc_vars(pc)

    Ax = dot( A, x.T )
    pcx = p.obs(x)  # U' d' Vt' x
    print "Ax:", Ax
    print "A'x:", pcx
    print "max |Ax - A'x|: %.2g" % np.linalg.norm( Ax - pcx, np.inf )

    b = Ax  # ~ back to original x, Ainv A x
    back = p.vars(b)
    print "~ back again:", back
    print "max |back - x|: %.2g" % np.linalg.norm( back - x, np.inf )

# end pca.py

Répondu el 13 de Avril, 2010 par denis (7316 Points )

3 votes

Fyinfo, il y a un excellent exposé sur ACP robuste par C. Caramanis, janvier 2011.

Commenté el 1 de Février, 2011 par denis

0 votes

Ce code va-t-il produire cette image (Iris PCA) ? Si ce n'est pas le cas, pouvez-vous proposer une solution alternative dans laquelle le résultat serait cette image ? J'ai quelques difficultés à convertir ce code en c++ car je suis novice en python :)

Commenté el 28 de Février, 2014 par Orvyl

Answer 3

45voto

ali_m Points 7185

PCA utilisant numpy.linalg.svd est très facile. Voici une démonstration simple :

import numpy as np
import matplotlib.pyplot as plt
from scipy.misc import lena

# the underlying signal is a sinusoidally modulated image
img = lena()
t = np.arange(100)
time = np.sin(0.1*t)
real = time[:,np.newaxis,np.newaxis] * img[np.newaxis,...]

# we add some noise
noisy = real + np.random.randn(*real.shape)*255

# (observations, features) matrix
M = noisy.reshape(noisy.shape[0],-1)

# singular value decomposition factorises your data matrix such that:
# 
#   M = U*S*V.T     (where '*' is matrix multiplication)
# 
# * U and V are the singular matrices, containing orthogonal vectors of
#   unit length in their rows and columns respectively.
#
# * S is a diagonal matrix containing the singular values of M - these 
#   values squared divided by the number of observations will give the 
#   variance explained by each PC.
#
# * if M is considered to be an (observations, features) matrix, the PCs
#   themselves would correspond to the rows of S^(1/2)*V.T. if M is 
#   (features, observations) then the PCs would be the columns of
#   U*S^(1/2).
#
# * since U and V both contain orthonormal vectors, U*V.T is equivalent 
#   to a whitened version of M.

U, s, Vt = np.linalg.svd(M, full_matrices=False)
V = Vt.T

# PCs are already sorted by descending order 
# of the singular values (i.e. by the
# proportion of total variance they explain)

# if we use all of the PCs we can reconstruct the noisy signal perfectly
S = np.diag(s)
Mhat = np.dot(U, np.dot(S, V.T))
print "Using all PCs, MSE = %.6G" %(np.mean((M - Mhat)**2))

# if we use only the first 20 PCs the reconstruction is less accurate
Mhat2 = np.dot(U[:, :20], np.dot(S[:20, :20], V[:,:20].T))
print "Using first 20 PCs, MSE = %.6G" %(np.mean((M - Mhat2)**2))

fig, [ax1, ax2, ax3] = plt.subplots(1, 3)
ax1.imshow(img)
ax1.set_title('true image')
ax2.imshow(noisy.mean(0))
ax2.set_title('mean of noisy images')
ax3.imshow((s[0]**(1./2) * V[:,0]).reshape(img.shape))
ax3.set_title('first spatial PC')
plt.show()

Répondu el 5 de Septembre, 2012 par ali_m (7185 Points )

2 votes

Je réalise que je suis un peu en retard ici, mais le PO a spécifiquement demandé une solution qui évite décomposition de la valeur singulière.

Commenté el 4 de Février, 2015 par Alex A.

1 votes

@Alex Je m'en rends compte, mais je suis convaincu que SVD reste la bonne approche. Elle devrait être facilement assez rapide pour les besoins de l'OP (mon exemple ci-dessus, avec 262144 dimensions, ne prend que ~7.5 sec sur un ordinateur portable normal), et elle est beaucoup plus stable numériquement que la méthode de décomposition de l'aigle (voir le commentaire de dwf ci-dessous). Je note également que la réponse acceptée utilise également SVD !

Commenté el 4 de Février, 2015 par ali_m

0 votes

Je ne conteste pas que l'UDS soit la solution, je disais simplement que la réponse ne répond pas à la question telle qu'elle est formulée. C'est une bonne réponse, cependant, bon travail.

Commenté el 4 de Février, 2015 par Alex A.

Afficher 2 autres commentaires

Answer 4

34voto

Noam Peled Points 501

Vous pouvez utiliser sklearn :

import sklearn.decomposition as deco
import numpy as np

x = (x - np.mean(x, 0)) / np.std(x, 0) # You need to normalize your data first
pca = deco.PCA(n_components) # n_components is the components number after reduction
x_r = pca.fit(x).transform(x)
print ('explained variance (first %d components): %.2f'%(n_components, sum(pca.explained_variance_ratio_)))

Répondu el 22 de Août, 2013 par Noam Peled (501 Points )

0 votes

J'ai plus de 460 dimensions, et même si sklearn utilise SVD et que la question demandait du non-SVD, je pense que 460 dimensions est probablement correct.

Commenté el 12 de Octobre, 2013 par Dan S

0 votes

Vous pourriez également vouloir supprimer les colonnes ayant une valeur constante (std=0). Pour cela vous devez utiliser : remove_cols = np.where(np.all(x == np.mean(x, 0), 0))[0] Et ensuite x = np.delete(x, remove_cols, 1)

Commenté el 3 de Septembre, 2015 par Noam Peled

Answer 5

31voto

tom10 Points 19886

matplotlib.mlab a un Mise en œuvre de l'ACP .

Répondu el 13 de Novembre, 2009 par tom10 (19886 Points )

5 votes

Le lien pour ACP de matplotlib est mis à jour.

Commenté el 18 de Janvier, 2012 par Developer

3 votes

L'implémentation matplotlib.mlab de l'ACP utilise SVD.

Commenté el 29 de Mars, 2012 par Aman

3 votes

Voici une description plus détaillée de ses fonctions et de son mode d'emploi.

Commenté el 4 de Mars, 2013 par Dolan Antenucci

Answer 6

28voto

ChristopheD Points 38217

Vous pouvez consulter PDM .

Je n'ai pas eu l'occasion de le tester moi-même, mais je l'ai mis en signet exactement pour la fonctionnalité PCA.

Répondu el 13 de Novembre, 2009 par ChristopheD (38217 Points )

8 votes

Le PDM n'a pas été maintenu depuis 2012, ça ne semble pas être la meilleure solution.

Commenté el 9 de Janvier, 2015 par Marc Garcia

0 votes

La dernière mise à jour date du 09.03.2016, mais notez qu'il s'agit uniquement d'une version de correction de bogues : Note that from this release MDP is in maintenance mode. 13 years after its first public release, MDP has reached full maturity and no new features are planned in the future.

Commenté el 5 de Octobre, 2016 par Gabriel

Analyse en composantes principales en Python

Réponses

Questions en vedette

Top Tags

Prograide.com

Powered by:

Analyse en composantes principales en Python

Réponses

Questions en vedette

Top Tags

Dans notre réseau

Prograide.com

Powered by: