Un meilleur algorithme de classement de similitude pour les chaînes de longueur variable

Question

Un meilleur algorithme de classement de similitude pour les chaînes de longueur variable

Demandé el 17 de Mars, 2009: Quand la question a-t-elle été
26144 affichage: Nombre de visites la question a
5 Réponses: Nombre de réponses aux questions
Résolu: Situation réelle de la question

Je cherche un algorithme de similitude de chaîne qui donne de meilleurs résultats sur longueur variable cordes que celles qui sont habituellement suggèrent (distance de levenshtein, soundex, etc.).

Par exemple,

Chaîne donnée A: « Robert »,

Puis chaîne B: « Amy Robertson »

serait un meilleur match que

String c : « Richard »

En outre, de préférence, cet algorithme doit être indépendant du langage (fonctionne aussi en langues autres que l’anglais).

Demandé el 17 de Mars, 2009 par marzagao

Answer 1

5 Réponses

Answer 2

159voto

marzagao Points 1701

Simon Blanc de Catalysoft a écrit un article sur un très habile algorithme qui compare adjacentes paires de caractères qui fonctionne vraiment bien pour mes besoins:

http://www.catalysoft.com/articles/StrikeAMatch.html

Simon a une version Java de l'algorithme et ci-dessous, j'ai écrit un PL/Ruby version (prises à partir de la plaine de version de ruby fait dans le forum connexes entrée de commentaire par Mark Wong-VanHaren), de sorte que je peux l'utiliser dans mes requêtes PostgreSQL:

CREATE FUNCTION string_similarity(str1 varchar, str2 varchar)
RETURNS float8 AS '

str1.downcase! 
pairs1 = (0..str1.length-2).collect {|i| str1[i,2]}.reject {
  |pair| pair.include? " "}
str2.downcase! 
pairs2 = (0..str2.length-2).collect {|i| str2[i,2]}.reject {
  |pair| pair.include? " "}
union = pairs1.size + pairs2.size 
intersection = 0 
pairs1.each do |p1| 
  0.upto(pairs2.size-1) do |i| 
    if p1 == pairs2[i] 
      intersection += 1 
      pairs2.slice!(i) 
      break 
    end 
  end 
end 
(2.0 * intersection) / union

' LANGUAGE 'plruby';

Fonctionne comme un charme!

Répondu el 17 de Mars, 2009 par marzagao (1701 Points )

Answer 3

77voto

Michael La Voie Points 12445

Le service de marzagao est génial. Je l'ai converti en C #, alors j'ai pensé le poster ici:

Lien Pastebin

 /// <summary>
/// This class implements string comparison algorithm
/// based on character pair similarity
/// Source: http://www.catalysoft.com/articles/StrikeAMatch.html
/// </summary>
public class SimilarityTool
{
    /// <summary>
    /// Compares the two strings based on letter pair matches
    /// </summary>
    /// <param name="str1"></param>
    /// <param name="str2"></param>
    /// <returns>The percentage match from 0.0 to 1.0 where 1.0 is 100%</returns>
    public double CompareStrings(string str1, string str2)
    {
        List<string> pairs1 = WordLetterPairs(str1.ToUpper());
        List<string> pairs2 = WordLetterPairs(str2.ToUpper());

        int intersection = 0;
        int union = pairs1.Count + pairs2.Count;

        for (int i = 0; i < pairs1.Count; i++)
        {
            for (int j = 0; j < pairs2.Count; j++)
            {
                if (pairs1[i] == pairs2[j])
                {
                    intersection++;
                    pairs2.RemoveAt(j);//Must remove the match to prevent "GGGG" from appearing to match "GG" with 100% success

                    break;
                }
            }
        }

        return (2.0 * intersection) / union;
    }

    /// <summary>
    /// Gets all letter pairs for each
    /// individual word in the string
    /// </summary>
    /// <param name="str"></param>
    /// <returns></returns>
    private List<string> WordLetterPairs(string str)
    {
        List<string> AllPairs = new List<string>();

        // Tokenize the string and put the tokens/words into an array
        string[] Words = Regex.Split(str, @"\s");

        // For each word
        for (int w = 0; w < Words.Length; w++)
        {
            if (!string.IsNullOrEmpty(Words[w]))
            {
                // Find the pairs of characters
                String[] PairsInWord = LetterPairs(Words[w]);

                for (int p = 0; p < PairsInWord.Length; p++)
                {
                    AllPairs.Add(PairsInWord[p]);
                }
            }
        }

        return AllPairs;
    }

    /// <summary>
    /// Generates an array containing every 
    /// two consecutive letters in the input string
    /// </summary>
    /// <param name="str"></param>
    /// <returns></returns>
    private string[] LetterPairs(string str)
    {
        int numPairs = str.Length - 1;

        string[] pairs = new string[numPairs];

        for (int i = 0; i < numPairs; i++)
        {
            pairs[i] = str.Substring(i, 2);
        }

        return pairs;
    }
}

Répondu el 2 de Novembre, 2009 par Michael La Voie (12445 Points )

Answer 4

44voto

John Rutledge Points 188

Voici une autre version de la réponse de marzagao , celle-ci écrite en Python:

 def get_bigrams(string):
    '''
    Takes a string and returns a list of bigrams
    '''
    s = string.lower()
    return [s[i:i+2] for i in xrange(len(s) - 1)]

def string_similarity(str1, str2):
    '''
    Perform bigram comparison between two strings
    and return a percentage match in decimal form
    '''
    pairs1 = get_bigrams(str1)
    pairs2 = get_bigrams(str2)
    union  = len(pairs1) + len(pairs2)
    hit_count = 0
    for x in pairs1:
        for y in pairs2:
            if x == y:
                hit_count += 1
                break
    return (2.0 * hit_count) / union

if __name__ == "__main__":
    '''
    Run a test using the example taken from:
    http://www.catalysoft.com/articles/StrikeAMatch.html
    '''
    w1 = 'Healed'
    words = ['Heard', 'Healthy', 'Help', 'Herded', 'Sealed', 'Sold']

    for w2 in words:
        print('Healed --- ' + w2)
        print(string_similarity(w1, w2))
        print

Répondu el 28 de Juillet, 2011 par John Rutledge (188 Points )

Answer 5

21voto

xiaomao Points 1576

Une version plus courte de la réponse de @John Rutledge:

 def get_bigrams(string):
    '''
    Takes a string and returns a list of bigrams
    '''
    s = string.lower()
    return {s[i:i+2] for i in xrange(len(s) - 1)}

def string_similarity(str1, str2):
    '''
    Perform bigram comparison between two strings
    and return a percentage match in decimal form
    '''
    pairs1 = get_bigrams(str1)
    pairs2 = get_bigrams(str2)
    intersection = pairs1 & pairs2
    return (2.0 * len(intersection)) / (len(pairs1) + len(pairs2))

Répondu el 31 de Janvier, 2013 par xiaomao (1576 Points )

Answer 6

17voto

Igal Alkon Points 81

Voici mon PHP de mise en œuvre de propositions de StrikeAMatch algorithme, par Simon Blanc. les avantages (comme il est dit dans le lien):

Une véritable réflexion lexicale de similarité de chaînes de caractères avec de légères différences doivent être reconnues comme étant similaire. En particulier, une importante chaîne de chevauchement doit pointer vers un niveau élevé de similitude entre les cordes.
Une robustesse aux changements de l'ordre des mots - deux chaînes de caractères qui contiennent les mêmes mots, mais dans un ordre différent, devrait être reconnu comme étant similaire. Sur l'autre main, si une chaîne est juste un hasard anagramme de caractères contenues dans les autres, alors il devrait (normalement) être reconnu comme dissemblables.

L'Indépendance de langue - l'algorithme devrait fonctionner non seulement en anglais, mais dans de nombreuses langues différentes.

<?php
/**
 * LetterPairSimilarity algorithm implementation in PHP
 * @author Igal Alkon
 * @link http://www.catalysoft.com/articles/StrikeAMatch.html
 */
class LetterPairSimilarity
{
    /**
     * @param $str
     * @return mixed
     */
    private function wordLetterPairs($str)
    {
        $allPairs = array();

        // Tokenize the string and put the tokens/words into an array

        $words = explode(' ', $str);

        // For each word
        for ($w = 0; $w < count($words); $w++)
        {
            // Find the pairs of characters
            $pairsInWord = $this->letterPairs($words[$w]);

            for ($p = 0; $p < count($pairsInWord); $p++)
            {
                $allPairs[] = $pairsInWord[$p];
            }
        }

        return $allPairs;
    }

    /**
     * @param $str
     * @return array
     */
    private function letterPairs($str)
    {
        $numPairs = mb_strlen($str)-1;
        $pairs = array();

        for ($i = 0; $i < $numPairs; $i++)
        {
            $pairs[$i] = mb_substr($str,$i,2);
        }

        return $pairs;
    }

    /**
     * @param $str1
     * @param $str2
     * @return float
     */
    public function compareStrings($str1, $str2)
    {
        $pairs1 = $this->wordLetterPairs(strtoupper($str1));
        $pairs2 = $this->wordLetterPairs(strtoupper($str2));

        $intersection = 0;

        $union = count($pairs1) + count($pairs2);

        for ($i=0; $i < count($pairs1); $i++)
        {
            $pair1 = $pairs1[$i];

            $pairs2 = array_values($pairs2);
            for($j = 0; $j < count($pairs2); $j++)
            {
                $pair2 = $pairs2[$j];
                if ($pair1 === $pair2)
                {
                    $intersection++;
                    unset($pairs2[$j]);
                    break;
                }
            }
        }

        return (2.0*$intersection)/$union;
    }
}

Répondu el 4 de Septembre, 2012 par Igal Alkon (81 Points )

Un meilleur algorithme de classement de similitude pour les chaînes de longueur variable

Réponses

Questions en vedette

Top Tags

Prograide.com

Powered by:

Un meilleur algorithme de classement de similitude pour les chaînes de longueur variable

Réponses

Questions en vedette

Top Tags

Dans notre réseau

Prograide.com

Powered by: