NOTE: String distance metrics
for
William W. Cohen, Pradeep Ravikumar, S. E. F. (2003). A comparison of string metrics for matching names and records. the Workshop on Data Cleaning and Object Consoliation.
and
http://secondstring.sourceforge.net
1.edit distance: (the differences in position matters)
- Levenstein distance
- Monge-Elkan distance
- Smith-Waterman distance
- Jaro smilarity distance
2.token based ditance (strings are deemed as multisets of words)
- Jaccard similarity
- TFIDF (cosine similarity)
- Jensen-shannon distance
- FS (fellegi and sunter) distance
3.hybrid distance (a combination of token-based and string based metrics)
- a variant of Monge-Elkan distance
- softTFIDF
No comments:
Post a Comment