【正文】
EDT Similarity Join ? Tokenize: ? Each record is a set of tokens from a finite universe. ? Suppose each record is a single text document ? x = “yes as soon as possible” ? y = “as soon as possible please” ? x = {A, B, C, D, E} ? y = {B, C, D, E, F} word yes as soon as1 possbile please token A B C D E F 參考文獻(xiàn) Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu. Efficient Similarity Joins for Near Duplicate Detection. WWW 2020. Guoliang Li, Dong Deng, Jiannan Wang, Jianhua Feng. PassJoin: A Partition based Method for Similarity Joins. VLDB 2020.