TfidfVectorizer dans scikit-learn: ValueError: np.nan est un document non valide

Question

TfidfVectorizer dans scikit-learn: ValueError: np.nan est un document non valide

Demandé el 3 de Septembre, 2016: Quand la question a-t-elle été
15557 affichage: Nombre de visites la question a
1 Réponses: Nombre de réponses aux questions
Résolu: Situation réelle de la question

Je suis en utilisant TfidfVectorizer de scikit-apprendre à faire certaines disposent d'extraction à partir des données de texte. J'ai un fichier CSV avec un Score de (peut-être +1 ou -1) et un Examen (texte). J'ai tiré ces données dans un DataFrame donc, je peux utiliser le Vectorizer.

C'est mon code:

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

df = pd.read_csv("train_new.csv",
             names = ['Score', 'Review'], sep=',')

# x = df['Review'] == np.nan
#
# print x.to_csv(path='FindNaN.csv', sep=',', na_rep = 'string', index=True)
#
# print df.isnull().values.any()

v = TfidfVectorizer(decode_error='replace', encoding='utf-8')
x = v.fit_transform(df['Review'])

C'est le traceback pour l'erreur que j'obtiens:

Traceback (most recent call last):
  File "/home/PycharmProjects/Review/src/feature_extraction.py", line 16, in <module>
x = v.fit_transform(df['Review'])
 File "/home/b/hw1/local/lib/python2.7/site-   packages/sklearn/feature_extraction/text.py", line 1305, in fit_transform
   X = super(TfidfVectorizer, self).fit_transform(raw_documents)
 File "/home/b/work/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 817, in fit_transform
self.fixed_vocabulary_)
 File "/home/b/work/local/lib/python2.7/site- packages/sklearn/feature_extraction/text.py", line 752, in _count_vocab
   for feature in analyze(doc):
 File "/home/b/work/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 238, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words)
 File "/home/b/work/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 118, in decode
 raise ValueError("np.nan is an invalid document, expected byte or "
 ValueError: np.nan is an invalid document, expected byte or unicode string.

J'ai vérifié le fichier CSV et DataFrame pour tout ce qui est en cours de lecture comme NaN mais je ne peux pas trouver quoi que ce soit. Il y a 18000 lignes, aucun retour isnan comme Vrai.

C'est ce qu' df['Review'].head() ressemble:

  0    This book is such a life saver.  It has been s...
  1    I bought this a few times for my older son and...
  2    This is great for basics, but I wish the space...
  3    This book is perfect!  I'm a first time new mo...
  4    During your postpartum stay at the hospital th...
  Name: Review, dtype: object

Demandé el 3 de Septembre, 2016 par boltthrower

Answer 1

1 Réponses

Answer 2

133voto

Nickil Maveli Points 16776

Vous devez convertir la chaîne dtype object en unicode comme cela est clairement mentionné dans le traçage.

 x = v.fit_transform(df['Review'].values.astype('U'))  ## Even astype(str) would work

Depuis la page Doc de TFIDF Vectorizer:

fit_transform (raw_documents, y = None)

Paramètres: raw_documents: iterable
un itérable qui produit des objets str , unicode ou file

Répondu el 3 de Septembre, 2016 par Nickil Maveli (16776 Points )

TfidfVectorizer dans scikit-learn: ValueError: np.nan est un document non valide

Réponse

Questions en vedette

Top Tags

Prograide.com

Powered by:

TfidfVectorizer dans scikit-learn: ValueError: np.nan est un document non valide

Réponse

Questions en vedette

Top Tags

Dans notre réseau

Prograide.com

Powered by: