6/14/2006

NOTE: String distance metrics

for
William W. Cohen, Pradeep Ravikumar, S. E. F. (2003). A comparison of string metrics for matching names and records. the Workshop on Data Cleaning and Object Consoliation.

and
http://secondstring.sourceforge.net

1.edit distance: (the differences in position matters)

  • Levenstein distance
  • Monge-Elkan distance
  • Smith-Waterman distance
  • Jaro smilarity distance

2.token based ditance (strings are deemed as multisets of words)
  • Jaccard similarity
  • TFIDF (cosine similarity)
  • Jensen-shannon distance
  • FS (fellegi and sunter) distance

3.hybrid distance (a combination of token-based and string based metrics)
  • a variant of Monge-Elkan distance
  • softTFIDF

No comments: