用docsim/doc2vec/LSH比较两个文档之间的相似度

http://blog.csdn.net/vs412237401/article/details/52238248

在我们做文本处理的时候，经常需要对两篇文档是否相似做处理或者根据输入的文档，找出最相似的文档。

如需转载，请注明出处。

幸好gensim提供了这样的工具，具体的处理思路如下，对于中文文本的比较，先需要做分词处理，根据分词的结果生成一个字典，然后再根据字典把原文档转化成向量。然后去训练相似度。把对应的文档构建一个索引，原文描述如下：

The main class is Similarity, which builds an index for a given set of documents. Once the index is built, you can perform efficient queries like “Tell me how similar is this query document to each document in the index?”. The result is a vector of numbers as large as the size of the initial set of documents, that is, one float for each index document. Alternatively, you can also request only the top-N most similar index documents to the query.

第一种方法，使用docsim(推荐使用，结果比较稳定)

示例代码：为了清楚的查看结果，对训练数据做了标号

# 训练样本raw_documents = [ '0无偿居间介绍买卖毒品的行为应如何定性', '1吸毒男动态持有大量毒品的行为该如何认定', '2如何区分是非法种植毒品原植物罪还是非法制造毒品罪', '3为毒贩贩卖毒品提供帮助构成贩卖毒品罪', '4将自己吸食的毒品原价转让给朋友吸食的行为该如何认定', '5为获报酬帮人购买毒品的行为该如何认定', '6毒贩出狱后再次够买毒品途中被抓的行为认定', '7虚夸毒品功效劝人吸食毒品的行为该如何认定', '8妻子下落不明丈夫又与他人登记结婚是否为无效婚姻', '9一方未签字办理的结婚登记是否有效', '10夫妻双方1990年按农村习俗举办婚礼没有结婚证一方可否起诉离婚', '11结婚前对方父母出资购买的住房写我们二人的名字有效吗', '12身份证被别人冒用无法登记结婚怎么办？', '13同居后又与他人登记结婚是否构成重婚罪', '14未办登记只举办结婚仪式可起诉离婚吗', '15同居多年未办理结婚登记，是否可以向法院起诉要求离婚']corpora_documents = []for item_text in raw_documents: item_str = util_words_cut.get_class_words_list(item_text) corpora_documents.append(item_str)# 生成字典和向量语料dictionary = corpora.Dictionary(corpora_documents)corpus = [dictionary.doc2bow(text) for text in corpora_documents]similarity = Similarity('-Similarity-index', corpus, num_features=400)test_data_1 = '你好，我想问一下我想离婚他不想离，孩子他说不要，是六个月就自动生效离婚'test_cut_raw_1 = util_words_cut.get_class_words_list(test_data_1)test_corpus_1 = dictionary.doc2bow(test_cut_raw_1)similarity.num_best = 5print(similarity[test_corpus_1]) # 返回最相似的样本材料,(index_of_document, similarity) tuplesprint('################################')test_data_2 = '家人因涉嫌运输毒品被抓，她只是去朋友家探望朋友的，结果就被抓了，还在朋友家收出毒品，可家人的身上和行李中都没有。现在已经拘留10多天了，请问会被判刑吗'test_cut_raw_2 = util_words_cut.get_class_words_list(test_data_2)test_corpus_2 = dictionary.doc2bow(test_cut_raw_2)similarity.num_best = 5print(similarity[test_corpus_2]) # 返回最相似的样本材料,(index_of_document, similarity) tuples运行结果如下：/usr/bin/python3.4 /data/work/python-workspace/test_doc_similarity.pyBuilding prefix dict from the default dictionary ...Building prefix dict from the default dictionary ...Loading model from cache /tmp/jieba.cacheLoading model from cache /tmp/jieba.cacheLoading model cost 0.521 seconds.Loading model cost 0.521 seconds.Prefix dict has been built succesfully.Prefix dict has been built succesfully.adding document #0 to Dictionary(0 unique tokens: [])built Dictionary(61 unique tokens: ['丈夫', '法院', '结婚', '住房', '出资']...) from 16 documents (total 89 corpus positions)starting similarity index under -Similarity-index[(14, 0.[**************]75), (15, 0.[**************]75), (10, 0.[**************]21)]################################creating sparse indexcreating sparse matrix from corpusPROGRESS: at document #0/16created ' with 86 stored elements in Compressed Sparse Row format>creating sparse shard #0saving index shard to -Similarity-index.0saving SparseMatrixSimilarity object under -Similarity-index.0, separately Noneloading SparseMatrixSimilarity object from -Similarity-index.0[(6, 0.[**************]25), (2, 0.[**************]41), (4, 0.[**************]18), (1, 0.[**************]52), (5, 0.[**************]52)]Process finished with exit code 0

对于第1个测试问题：原文档中14,15,10和其相似，后面是对应的相似度

对于第2个测试问题：原文档中6,2,4,1,5和其相似，后面是对应的相似度

第二种方法，使用doc2vec

看了gensim的官方文档，写的不好，同样是使用上面的数据做测试，代码及结果如下：

# 使用doc2vec来判断cores = multiprocessing.cpu_count()print(cores)corpora_documents = []for i, item_text in enumerate(raw_documents): words_list = util_words_cut.get_class_words_list(item_text) document = TaggedDocument(words=words_list, tags=[i]) corpora_documents.append(document)print(corpora_documents[:2])model = Doc2Vec(size=89, min_count=1, iter=10)model.build_vocab(corpora_documents)model.train(corpora_documents)print('#########', model.vector_size)test_data_1 = '你好，我想问一下我想离婚他不想离，孩子他说不要，是六个月就自动生效离婚'test_cut_raw_1 = util_words_cut.get_class_words_list(test_data_1)print(test_cut_raw_1)inferred_vector = model.infer_vector(test_cut_raw_1)print(inferred_vector)sims = model.docvecs.most_similar([inferred_vector], topn=3)print(sims)控制台打印的相关信息如下：Pattern library is not installed, lemmatization won't be available.'pattern' package not found; tag filters are not available for EnglishBuilding prefix dict from the default dictionary ...Building prefix dict from the default dictionary ...Loading model from cache /tmp/jieba.cacheLoading model from cache /tmp/jieba.cache4Loading model cost 0.513 seconds.Loading model cost 0.513 seconds.Prefix dict has been built succesfully.Prefix dict has been built succesfully.consider setting layer size to a multiple of 4 for greater performancecollecting all words and their countsPROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tagscollected 61 word types and 16 unique tags from a corpus of 16 examples and 89 wordsmin_count=1 retains 61 unique words (drops 0)min_count leaves 89 word corpus (100% of original 89)deleting the raw counts dictionary of 61 itemssample=0 downsamples 0 most-common wordsdownsampling leaves estimated 89 word corpus (100.0% of prior 89)estimated required memory for 61 words and 89 dimensions: 91828 bytesconstructing a huffman tree from 61 wordsbuilt huffman tree with maximum node depth 7resetting layer weightstraining model with 1 workers on 61 vocabulary and 89 features, using sg=0 hs=1 sample=0 negative=0expecting 16 sentences, matching count from corpus used for vocabulary survey[TaggedDocument(words=['无偿', '居间', '介绍', '买卖', '毒品', '定性'], tags=[0]), TaggedDocument(words=['吸毒', '动态', '持有', '毒品', '认定'], tags=[1])]worker thread finished; awaiting finish of 0 more threadstraining on 890 raw words (1050 effective words) took 0.0s, 506992 effective words/sunder 10 jobs per worker: consider setting a smaller `batch_words' for smoother alpha decay######### 89['离婚', '孩子', '自动', '生效', '离婚'][ 2.54629389e-03 1.87756249e-03 -9.76708368e-04 -5.15014399e-03 -7.54948880e-04 -3.74549557e-03 5.37392031e-03 3.35739669e-03 -3.50345811e-03 2.63415743e-03 -1.32059853e-03 -4.15759953e-03 -2.39425618e-03 -6.20105816e-03 -1.42006821e-03 -4.64246795e-03 3.78829846e-03 1.47493952e-03 4.49652784e-03 -5.57655795e-03 -1.40081509e-04 -7.10823014e-03 -5.34327468e-04 -4.21888893e-03 -2.96280603e-03 6.52066898e-04 5.98943839e-03 -4.01164964e-03 2.49637989e-03 -9.08742077e-04 4.65002051e-03 9.24886088e-04 1.67128560e-03 -1.93383044e-03 -4.58135502e-03 1.78024184e-03 -9.60796722e-04 7.26479106e-04 4.50814469e-03 2.58095766e-04 -4.53767460e-03 -1.72883295e-03 -3.89566552e-03 4.85864235e-03 5.90517826e-04 4.30173194e-03 3.37816169e-03 -1.08716707e-03 1.85196218e-03 1.94042712e-03 1.20989932e-03 -4.69703926e-03 -5.35873650e-03 -1.35291950e-03 -4.62053996e-03 2.15436472e-03 4.05823253e-03 8.01778078e-05 -3.84314684e-03 1.11574796e-03 -4.36050585e-03 -3.31182266e-03 -2.15692003e-03 -2.09038518e-03 4.50274721e-03 -1.85286190e-04 -5.09306230e-03 -1.12043330e-04 8.25022871e-04 2.60405545e-03 -1.73542544e-03 5.14509249e-03 -9.16058663e-04 1.01291772e-03 -7.90049613e-04 4.20650374e-03 -3.00139328e-03 3.34924040e-03 -2.11520446e-03 4.79168072e-03 2.11459701e-03 -3.07943812e-03 -5.09956060e-03 -2.34926818e-03 7.30032055e-03 -5.31428820e-03 -2.96888268e-03 4.95154131e-03 3.09590902e-03][(15, 0.[**************]4), (14, 0.[**************]95), (10, 0.[**************]11)]precomputing L2-norms of doc weight vectors使用doc2vec结果不是很稳定，可能是我没有正确的使用吧，不过我看官方文档也没有找到比较有用的信息

文档相关链接如下： https://radimrehurek.com/gensim/models/doc2vec.html

第三种方式：使用LSH(LSH原理请见百度搜索)

sciket-learn提供了lsh的实现，当然github上也有lsh的实现。sciket-learn上是提供的lsh树。

LSH Forest: Locality Sensitive Hashing forest [1] is an alternative method for vanilla approximate nearest neighbor search methods. LSH forest data structure has been implemented using sorted arrays and binary search and 32 bit fixed-length hashes. Random projection is used as the hash family which approximates cosine distance.

还是使用同样的测试数据，代码如下：

# 使用lsh来处理tfidf_vectorizer = TfidfVectorizer(min_df=3, max_features=None, ngram_range=(1, 2), use_idf=1, smooth_idf=1,sublinear_tf=1)train_documents = []for item_text in raw_documents: item_str = util_words_cut.get_class_words_with_space(item_text) train_documents.append(item_str)x_train = tfidf_vectorizer.fit_transform(train_documents)test_data_1 = '你好，我想问一下我想离婚他不想离，孩子他说不要，是六个月就自动生效离婚'test_cut_raw_1 = util_words_cut.get_class_words_with_space(test_data_1)x_test = tfidf_vectorizer.transform([test_cut_raw_1])lshf = LSHForest(random_state=42)lshf.fit(x_train.toarray())distances, indices = lshf.kneighbors(x_test.toarray(), n_neighbors=3)print(distances)print(indices)控制台打钱的信息如下，基本和docsim一致

[[ 0.42264973 0.42264973 0.48875208]][[10 15 14]]

以上是自己找出来的用来比较文本相似度的实现，不过一般lsh比较适合做短文本的比较。