6/13/2006

NOTE: Blocking methods for record linkage

1. record linkage:
http://en.wikipedia.org/wiki/Record_linkage
Record linkage also known as deduplication, refers to the task of finding entries that refer to the same entity in two or more files. Record linkage is an appropriate technique when you have to join data sets that do not have a unique database key in common. A data set that have been through Record linkage is said to be linked.

2.Blocking methods
are used in record linkage systems to reduce the number of candidate record comparison pairs to a feasible number whilst still maintaining linkage accuracy.

Blocking methods partition the data sets into blocks or clusters of records which share a blocking attribute or are otherwise similar with respect to a defined criterion.

e.g. from [ref2.]
standard traditional blocking

sorted neighbourhood blocking
bigram indexing
canopy clustering with TFIDF



----------------------------
ref.
----------------------------
1) Ivan P. Fellegi, A. B. S. (1969). A theory for record linkage. Journal of the American Statistical Association. 64: 1183-1210.

2) Rohan Baxter, Peter Christen, A. T. C. (2003). A comparison of fast blocking methods for record linkage. ACM Workshop on Data Clearning, Record Linkage, and Object Identification.

No comments: