Implémentation simple de la similarité N-Gram, tf-idf et Cosine en Python

Question

Implémentation simple de la similarité N-Gram, tf-idf et Cosine en Python

Demandé el 4 de Mars, 2010: Quand la question a-t-elle été
14258 affichage: Nombre de visites la question a
5 Réponses: Nombre de réponses aux questions
Résolu: Situation réelle de la question

Je dois comparer les documents stockés dans une base de données et obtenir un score de similarité compris entre 0 et 1.

La méthode que je dois utiliser doit être très simple. Implémentation d'une version vanille de n-grammes (où il est possible de définir combien de grammes utiliser), ainsi qu'une implémentation simple des similitudes tf-idf et Cosine.

Y at-il un programme qui peut faire cela? Ou devrais-je commencer à écrire cela à partir de zéro?

Demandé el 4 de Mars, 2010 par seanieb

Answer 1

5 Réponses

Answer 2

52voto

roman Points 474

Découvrez NLTK paquet: http://www.nltk.org il a tout ce dont vous avez besoin

Pour le cosine_similarity:


def cosine_distance(u, v):
    """
    Returns the cosine of the angle between vectors v and u. This is equal to
    u.v / |u||v|.
    """
    return numpy.dot(u, v) / (math.sqrt(numpy.dot(u, u)) * math.sqrt(numpy.dot(v, v)))

Pour ngrams:


def ngrams(sequence, n, pad_left=False, pad_right=False, pad_symbol=None):
    """
    A utility that produces a sequence of ngrams from a sequence of items.
    For example:

    >>> ngrams([1,2,3,4,5], 3)
    [(1, 2, 3), (2, 3, 4), (3, 4, 5)]

    Use ingram for an iterator version of this function.  Set pad_left
    or pad_right to true in order to get additional ngrams:

    >>> ngrams([1,2,3,4,5], 2, pad_right=True)
    [(1, 2), (2, 3), (3, 4), (4, 5), (5, None)]

    @param sequence: the source data to be converted into ngrams
    @type sequence: C{sequence} or C{iterator}
    @param n: the degree of the ngrams
    @type n: C{int}
    @param pad_left: whether the ngrams should be left-padded
    @type pad_left: C{boolean}
    @param pad_right: whether the ngrams should be right-padded
    @type pad_right: C{boolean}
    @param pad_symbol: the symbol to use for padding (default is None)
    @type pad_symbol: C{any}
    @return: The ngrams
    @rtype: C{list} of C{tuple}s
    """

    if pad_left:
        sequence = chain((pad_symbol,) * (n-1), sequence)
    if pad_right:
        sequence = chain(sequence, (pad_symbol,) * (n-1))
    sequence = list(sequence)

    count = max(0, len(sequence) - n + 1)
    return [tuple(sequence[i:i+n]) for i in range(count)]

pour le tf-idf, vous aurez pour calculer la distribution de la première, je suis à l'aide de Lucene pour le faire, mais vous pouvez très bien faire quelque chose de similaire avec NLTK, utilisez FreqDist:

http://nltk.googlecode.com/svn/trunk/doc/book/ch01.html#frequency_distribution_index_term

si vous aimez pylucene, cela va vous dire comment comute tf.idf

    # reader = lucene.IndexReader(FSDirectory.open(index_loc))
    docs = reader.numDocs()
    for i in xrange(docs):
        tfv = reader.getTermFreqVector(i, fieldname)
        if tfv:
            rec = {}
            terms = tfv.getTerms()
            frequencies = tfv.getTermFrequencies()
            for (t,f,x) in zip(terms,frequencies,xrange(maxtokensperdoc)):
                    df= searcher.docFreq(Term(fieldname, t)) # number of docs with the given term
                        tmap.setdefault(t, len(tmap))
                        rec[t] = sim.tf(f) * sim.idf(df, max_doc)  #compute TF.IDF
            # and normalize the values using cosine normalization
            if cosine_normalization:
                denom = sum([x**2 for x in rec.values()])**0.5
                for k,v in rec.items():
                    rec[k] = v / denom

Répondu el 2 de Mai, 2010 par roman (474 Points )

Answer 3

27voto

Tarantula Points 4231

Si cela vous intéresse, j'ai participé à une série de didacticiels ( parties I et II ) sur tf-idf et sur l'utilisation du module Python Scikits.learn (sklearn) .

Répondu el 22 de Octobre, 2011 par Tarantula (4231 Points )

Answer 4

9voto

alvas Points 4333

Voici une réponse avec seulement python + numpy , en bref:

Cosinus :

 def cosine_sim(u,v):
    return np.dot(u,v) / (sqrt(np.dot(u,u)) * sqrt(np.dot(v,v)))

Ngrams :

 def ngrams(sentence, n):
  return zip(*[sentence.split()[i:] for i in range(n)])

TF-IDF (c'est un peu bizarre mais ça marche):

 def tfidf(corpus, vocab):
    """
    INPUT:

    corpus = [('this is a foo bar', [1, 1, 0, 1, 1, 0, 0, 1]), 
    ('foo bar bar black sheep', [0, 2, 1, 1, 0, 0, 1, 0]), 
    ('this is a sentence', [1, 0, 0, 0, 1, 1, 0, 1])]

    vocab = ['a', 'bar', 'black', 'foo', 'is', 'sentence', 
    'sheep', 'this']

    OUTPUT:

    [[0.300, 0.300, 0.0, 0.300, 0.300, 0.0, 0.0, 0.300], 
    [0.0, 0.600, 0.600, 0.300, 0.0, 0.0, 0.600, 0.0], 
    [0.375, 0.0, 0.0, 0.0, 0.375, 0.75, 0.0, 0.375]]

    """
    def termfreq(matrix, doc, term):
        try: return matrix[doc][term] / float(sum(matrix[doc].values()))
        except ZeroDivisionError: return 0
    def inversedocfreq(matrix, term):
        try: 
            return float(len(matrix)) /sum([1 for i,_ in enumerate(matrix) if matrix[i][term] > 0])
        except ZeroDivisionError: return 0

    matrix = [{k:v for k,v in zip(vocab, i[1])} for i in corpus]
    tfidf = defaultdict(dict)
    for doc,_ in enumerate(matrix):
        for term in matrix[doc]:
            tf = termfreq(matrix,doc,term)
            idf = inversedocfreq(matrix, term)
            tfidf[doc][term] = tf*idf

    return [[tfidf[doc][term] for term in vocab] for doc,_ in enumerate(tfidf)]

Voici la longue réponse aux tests:

 import numpy as np
from math import sqrt, log
from itertools import chain, product
from collections import defaultdict

def cosine_sim(u,v):
    return np.dot(u,v) / (sqrt(np.dot(u,u)) * sqrt(np.dot(v,v)))

def ngrams(sentence, n):
  return zip(*[sentence.split()[i:] for i in range(n)])

def tfidf(corpus, vocab):
    """
    INPUT:

    corpus = [('this is a foo bar', [1, 1, 0, 1, 1, 0, 0, 1]), 
    ('foo bar bar black sheep', [0, 2, 1, 1, 0, 0, 1, 0]), 
    ('this is a sentence', [1, 0, 0, 0, 1, 1, 0, 1])]

    vocab = ['a', 'bar', 'black', 'foo', 'is', 'sentence', 
    'sheep', 'this']

    OUTPUT:

    [[0.300, 0.300, 0.0, 0.300, 0.300, 0.0, 0.0, 0.300], 
    [0.0, 0.600, 0.600, 0.300, 0.0, 0.0, 0.600, 0.0], 
    [0.375, 0.0, 0.0, 0.0, 0.375, 0.75, 0.0, 0.375]]

    """
    def termfreq(matrix, doc, term):
        try: return matrix[doc][term] / float(sum(matrix[doc].values()))
        except ZeroDivisionError: return 0
    def inversedocfreq(matrix, term):
        try: 
            return float(len(matrix)) /sum([1 for i,_ in enumerate(matrix) if matrix[i][term] > 0])
        except ZeroDivisionError: return 0

    matrix = [{k:v for k,v in zip(vocab, i[1])} for i in corpus]
    tfidf = defaultdict(dict)
    for doc,_ in enumerate(matrix):
        for term in matrix[doc]:
            tf = termfreq(matrix,doc,term)
            idf = inversedocfreq(matrix, term)
            tfidf[doc][term] = tf*idf

    return [[tfidf[doc][term] for term in vocab] for doc,_ in enumerate(tfidf)]


def corpus2vectors(corpus):
    def vectorize(sentence, vocab):
        return [sentence.split().count(i) for i in vocab]
    vectorized_corpus = []
    vocab = sorted(set(chain(*[i.lower().split() for i in corpus])))
    for i in corpus:
        vectorized_corpus.append((i, vectorize(i, vocab)))
    return vectorized_corpus, vocab

def create_test_corpus():
    sent1 = "this is a foo bar"
    sent2 = "foo bar bar black sheep"
    sent3 = "this is a sentence"

    all_sents = [sent1,sent2,sent3]
    corpus, vocab = corpus2vectors(all_sents)
    return corpus, vocab

def test_cosine():
    corpus, vocab = create_test_corpus()

    for sentx, senty in product(corpus, corpus):
        print sentx[0]
        print senty[0]
        print "cosine =", cosine_sim(sentx[1], senty[1])
        print

def test_ngrams():
    corpus, vocab = create_test_corpus()
    for sentx in corpus:
        print sentx[0]
        print ngrams(sentx[0],2)
        print ngrams(sentx[0],3)
        print

def test_tfidf():
    corpus, vocab = create_test_corpus()
    print corpus
    print vocab
    print tfidf(corpus, vocab)

print "Testing cosine..."
test_cosine()
print
print "Testing ngrams..."
test_ngrams()
print
print "Testing tfidf..."
test_tfidf()
print

[en dehors]:

 Testing cosine...
this is a foo bar
this is a foo bar
cosine = 1.0

this is a foo bar
foo bar bar black sheep
cosine = 0.507092552837

this is a foo bar
this is a sentence
cosine = 0.67082039325

foo bar bar black sheep
this is a foo bar
cosine = 0.507092552837

foo bar bar black sheep
foo bar bar black sheep
cosine = 1.0

foo bar bar black sheep
this is a sentence
cosine = 0.0

this is a sentence
this is a foo bar
cosine = 0.67082039325

this is a sentence
foo bar bar black sheep
cosine = 0.0

this is a sentence
this is a sentence
cosine = 1.0


Testing ngrams...
this is a foo bar
[('this', 'is'), ('is', 'a'), ('a', 'foo'), ('foo', 'bar')]
[('this', 'is', 'a'), ('is', 'a', 'foo'), ('a', 'foo', 'bar')]

foo bar bar black sheep
[('foo', 'bar'), ('bar', 'bar'), ('bar', 'black'), ('black', 'sheep')]
[('foo', 'bar', 'bar'), ('bar', 'bar', 'black'), ('bar', 'black', 'sheep')]

this is a sentence
[('this', 'is'), ('is', 'a'), ('a', 'sentence')]
[('this', 'is', 'a'), ('is', 'a', 'sentence')]


Testing tfidf...
[('this is a foo bar', [1, 1, 0, 1, 1, 0, 0, 1]), ('foo bar bar black sheep', [0, 2, 1, 1, 0, 0, 1, 0]), ('this is a sentence', [1, 0, 0, 0, 1, 1, 0, 1])]
['a', 'bar', 'black', 'foo', 'is', 'sentence', 'sheep', 'this']
[[0.30000000000000004, 0.30000000000000004, 0.0, 0.30000000000000004, 0.30000000000000004, 0.0, 0.0, 0.30000000000000004], [0.0, 0.6000000000000001, 0.6000000000000001, 0.30000000000000004, 0.0, 0.0, 0.6000000000000001, 0.0], [0.375, 0.0, 0.0, 0.0, 0.375, 0.75, 0.0, 0.375]]

Répondu el 22 de Mars, 2014 par alvas (4333 Points )

Answer 5

4voto

larsmans Points 167484

Dans le cas où vous êtes toujours intéressé à ce problème, j'ai fait quelque chose de très similaire à l'aide de Lucene, Java et Python. Voici quelques extraits de mon code.

Lucene prétraite les documents et les requêtes à l'aide de soi-disant analyseurs. Celui-ci utilise Lucene intégré de n-grammes de filtre:

class NGramAnalyzer(Analyzer):
    '''Analyzer that yields n-grams for minlength <= n <= maxlength'''
    def __init__(self, minlength, maxlength):
        self.minlength = minlength
        self.maxlength = maxlength
    def tokenStream(self, field, reader):
        lower = ASCIIFoldingFilter(LowerCaseTokenizer(reader))
        return NGramTokenFilter(lower, self.minlength, self.maxlength)

Pour activer une liste d' ngrams en Document:

doc = Document()
doc.add(Field('n-grams', ' '.join(ngrams),
        Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES))

Pour stocker un document dans un index:

wr = IndexWriter(index_dir, NGramAnalyzer(), True,
                 IndexWriter.MaxFieldLength.LIMITED)
wr.addDocument(doc)

Création de requêtes est un peu plus difficile que de Lucene QueryParser s'attend à un langage de requêtes avec les opérateurs spéciaux, des citations, etc., mais il peut être contourné (comme en partie expliqué ici).

Répondu el 29 de Juillet, 2010 par larsmans (167484 Points )

Answer 6

3voto

Penang Points 772

Pour notre recherche d'Information Sûr, nous utilisons un code qui est écrit par notre professeur en Java. Désolé, aucun python port. "Il est publié à des fins éducatives et à des fins de recherche uniquement en vertu de la Licence Publique Générale de GNU."

Vous pouvez consulter la documentation http://userweb.cs.utexas.edu/~mooney/ir-cours/doc/

Mais plus précisément de départ: http://userweb.cs.utexas.edu/users/mooney/ir-course/doc/ir/vsr/HashMapVector.html

Vous pouvez le télécharger http://userweb.cs.utexas.edu/users/mooney/ir-course/

Répondu el 4 de Mars, 2010 par Penang (772 Points )

Implémentation simple de la similarité N-Gram, tf-idf et Cosine en Python

Réponses

Questions en vedette

Top Tags

Prograide.com

Powered by:

Implémentation simple de la similarité N-Gram, tf-idf et Cosine en Python

Réponses

Questions en vedette

Top Tags

Dans notre réseau

Prograide.com

Powered by: