用docsim/doc2vec/LSH比较两个文档之间的相似度

http://blog.csdn.net/vs412237401/article/details/52238248

在我们做文本处理的时候,经常需要对两篇文档是否相似做处理或者根据输入的文档,找出最相似的文档。

如需转载,请注明出处。

幸好gensim提供了这样的工具,具体的处理思路如下,对于中文文本的比较,先需要做分词处理,根据分词的结果生成一个字典,然后再根据字典把原文档转化成向量。然后去训练相似度。把对应的文档构建一个索引,原文描述如下:

The main class is Similarity, which builds an index for a given set of documents. Once the index is built, you can perform efficient queries like “Tell me how similar is this query document to each document in the index?”. The result is a vector of numbers as large as the size of the initial set of documents, that is, one float for each index document. Alternatively, you can also request only the top-N most similar index documents to the query.

第一种方法,使用docsim(推荐使用,结果比较稳定)

示例代码:为了清楚的查看结果,对训练数据做了标号

# 训练样本raw_documents = [ '0无偿居间介绍买卖毒品的行为应如何定性', '1吸毒男动态持有大量毒品的行为该如何认定', '2如何区分是非法种植毒品原植物罪还是非法制造毒品罪', '3为毒贩贩卖毒品提供帮助构成贩卖毒品罪', '4将自己吸食的毒品原价转让给朋友吸食的行为该如何认定', '5为获报酬帮人购买毒品的行为该如何认定', '6毒贩出狱后再次够买毒品途中被抓的行为认定', '7虚夸毒品功效劝人吸食毒品的行为该如何认定', '8妻子下落不明丈夫又与他人登记结婚是否为无效婚姻', '9一方未签字办理的结婚登记是否有效', '10夫妻双方1990年按农村习俗举办婚礼没有结婚证 一方可否起诉离婚', '11结婚前对方父母出资购买的住房写我们二人的名字有效吗', '12身份证被别人冒用无法登记结婚怎么办?', '13同居后又与他人登记结婚是否构成重婚罪', '14未办登记只举办结婚仪式可起诉离婚吗', '15同居多年未办理结婚登记,是否可以向法院起诉要求离婚']corpora_documents = []for item_text in raw_documents: item_str = util_words_cut.get_class_words_list(item_text) corpora_documents.append(item_str)# 生成字典和向量语料dictionary = corpora.Dictionary(corpora_documents)corpus = [dictionary.doc2bow(text) for text in corpora_documents]similarity = Similarity('-Similarity-index', corpus, num_features=400)test_data_1 = '你好,我想问一下我想离婚他不想离,孩子他说不要,是六个月就自动生效离婚'test_cut_raw_1 = util_words_cut.get_class_words_list(test_data_1)test_corpus_1 = dictionary.doc2bow(test_cut_raw_1)similarity.num_best = 5print(similarity[test_corpus_1]) # 返回最相似的样本材料,(index_of_document, similarity) tuplesprint('################################')test_data_2 = '家人因涉嫌运输毒品被抓,她只是去朋友家探望朋友的,结果就被抓了,还在朋友家收出毒品,可家人的身上和行李中都没有。现在已经拘留10多天了,请问会被判刑吗'test_cut_raw_2 = util_words_cut.get_class_words_list(test_data_2)test_corpus_2 = dictionary.doc2bow(test_cut_raw_2)similarity.num_best = 5print(similarity[test_corpus_2]) # 返回最相似的样本材料,(index_of_document, similarity) tuples运行结果如下:/usr/bin/python3.4 /data/work/python-workspace/test_doc_similarity.pyBuilding prefix dict from the default dictionary ...Building prefix dict from the default dictionary ...Loading model from cache /tmp/jieba.cacheLoading model from cache /tmp/jieba.cacheLoading model cost 0.521 seconds.Loading model cost 0.521 seconds.Prefix dict has been built succesfully.Prefix dict has been built succesfully.adding document #0 to Dictionary(0 unique tokens: [])built Dictionary(61 unique tokens: ['丈夫', '法院', '结婚', '住房', '出资']...) from 16 documents (total 89 corpus positions)starting similarity index under -Similarity-index[(14, 0.[**************]75), (15, 0.[**************]75), (10, 0.[**************]21)]################################creating sparse indexcreating sparse matrix from corpusPROGRESS: at document #0/16created ' with 86 stored elements in Compressed Sparse Row format>creating sparse shard #0saving index shard to -Similarity-index.0saving SparseMatrixSimilarity object under -Similarity-index.0, separately Noneloading SparseMatrixSimilarity object from -Similarity-index.0[(6, 0.[**************]25), (2, 0.[**************]41), (4, 0.[**************]18), (1, 0.[**************]52), (5, 0.[**************]52)]Process finished with exit code 0

对于第1个测试问题:原文档中14,15,10和其相似,后面是对应的相似度

对于第2个测试问题:原文档中6,2,4,1,5和其相似,后面是对应的相似度

第二种方法,使用doc2vec

看了gensim的官方文档,写的不好,同样是使用上面的数据做测试,代码及结果如下:

# 使用doc2vec来判断cores = multiprocessing.cpu_count()print(cores)corpora_documents = []for i, item_text in enumerate(raw_documents): words_list = util_words_cut.get_class_words_list(item_text) document = TaggedDocument(words=words_list, tags=[i]) corpora_documents.append(document)print(corpora_documents[:2])model = Doc2Vec(size=89, min_count=1, iter=10)model.build_vocab(corpora_documents)model.train(corpora_documents)print('#########', model.vector_size)test_data_1 = '你好,我想问一下我想离婚他不想离,孩子他说不要,是六个月就自动生效离婚'test_cut_raw_1 = util_words_cut.get_class_words_list(test_data_1)print(test_cut_raw_1)inferred_vector = model.infer_vector(test_cut_raw_1)print(inferred_vector)sims = model.docvecs.most_similar([inferred_vector], topn=3)print(sims)控制台打印的相关信息如下:Pattern library is not installed, lemmatization won't be available.'pattern' package not found; tag filters are not available for EnglishBuilding prefix dict from the default dictionary ...Building prefix dict from the default dictionary ...Loading model from cache /tmp/jieba.cacheLoading model from cache /tmp/jieba.cache4Loading model cost 0.513 seconds.Loading model cost 0.513 seconds.Prefix dict has been built succesfully.Prefix dict has been built succesfully.consider setting layer size to a multiple of 4 for greater performancecollecting all words and their countsPROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tagscollected 61 word types and 16 unique tags from a corpus of 16 examples and 89 wordsmin_count=1 retains 61 unique words (drops 0)min_count leaves 89 word corpus (100% of original 89)deleting the raw counts dictionary of 61 itemssample=0 downsamples 0 most-common wordsdownsampling leaves estimated 89 word corpus (100.0% of prior 89)estimated required memory for 61 words and 89 dimensions: 91828 bytesconstructing a huffman tree from 61 wordsbuilt huffman tree with maximum node depth 7resetting layer weightstraining model with 1 workers on 61 vocabulary and 89 features, using sg=0 hs=1 sample=0 negative=0expecting 16 sentences, matching count from corpus used for vocabulary survey[TaggedDocument(words=['无偿', '居间', '介绍', '买卖', '毒品', '定性'], tags=[0]), TaggedDocument(words=['吸毒', '动态', '持有', '毒品', '认定'], tags=[1])]worker thread finished; awaiting finish of 0 more threadstraining on 890 raw words (1050 effective words) took 0.0s, 506992 effective words/sunder 10 jobs per worker: consider setting a smaller `batch_words' for smoother alpha decay######### 89['离婚', '孩子', '自动', '生效', '离婚'][ 2.54629389e-03 1.87756249e-03 -9.76708368e-04 -5.15014399e-03 -7.54948880e-04 -3.74549557e-03 5.37392031e-03 3.35739669e-03 -3.50345811e-03 2.63415743e-03 -1.32059853e-03 -4.15759953e-03 -2.39425618e-03 -6.20105816e-03 -1.42006821e-03 -4.64246795e-03 3.78829846e-03 1.47493952e-03 4.49652784e-03 -5.57655795e-03 -1.40081509e-04 -7.10823014e-03 -5.34327468e-04 -4.21888893e-03 -2.96280603e-03 6.52066898e-04 5.98943839e-03 -4.01164964e-03 2.49637989e-03 -9.08742077e-04 4.65002051e-03 9.24886088e-04 1.67128560e-03 -1.93383044e-03 -4.58135502e-03 1.78024184e-03 -9.60796722e-04 7.26479106e-04 4.50814469e-03 2.58095766e-04 -4.53767460e-03 -1.72883295e-03 -3.89566552e-03 4.85864235e-03 5.90517826e-04 4.30173194e-03 3.37816169e-03 -1.08716707e-03 1.85196218e-03 1.94042712e-03 1.20989932e-03 -4.69703926e-03 -5.35873650e-03 -1.35291950e-03 -4.62053996e-03 2.15436472e-03 4.05823253e-03 8.01778078e-05 -3.84314684e-03 1.11574796e-03 -4.36050585e-03 -3.31182266e-03 -2.15692003e-03 -2.09038518e-03 4.50274721e-03 -1.85286190e-04 -5.09306230e-03 -1.12043330e-04 8.25022871e-04 2.60405545e-03 -1.73542544e-03 5.14509249e-03 -9.16058663e-04 1.01291772e-03 -7.90049613e-04 4.20650374e-03 -3.00139328e-03 3.34924040e-03 -2.11520446e-03 4.79168072e-03 2.11459701e-03 -3.07943812e-03 -5.09956060e-03 -2.34926818e-03 7.30032055e-03 -5.31428820e-03 -2.96888268e-03 4.95154131e-03 3.09590902e-03][(15, 0.[**************]4), (14, 0.[**************]95), (10, 0.[**************]11)]precomputing L2-norms of doc weight vectors使用doc2vec结果不是很稳定,可能是我没有正确的使用吧,不过我看官方文档也没有找到比较有用的信息

文档相关链接如下: https://radimrehurek.com/gensim/models/doc2vec.html

第三种方式:使用LSH(LSH原理请见百度搜索)

sciket-learn提供了lsh的实现,当然github上也有lsh的实现。sciket-learn上是提供的lsh树。

LSH Forest: Locality Sensitive Hashing forest [1] is an alternative method for vanilla approximate nearest neighbor search methods. LSH forest data structure has been implemented using sorted arrays and binary search and 32 bit fixed-length hashes. Random projection is used as the hash family which approximates cosine distance.

还是使用同样的测试数据,代码如下:

# 使用lsh来处理tfidf_vectorizer = TfidfVectorizer(min_df=3, max_features=None, ngram_range=(1, 2), use_idf=1, smooth_idf=1,sublinear_tf=1)train_documents = []for item_text in raw_documents: item_str = util_words_cut.get_class_words_with_space(item_text) train_documents.append(item_str)x_train = tfidf_vectorizer.fit_transform(train_documents)test_data_1 = '你好,我想问一下我想离婚他不想离,孩子他说不要,是六个月就自动生效离婚'test_cut_raw_1 = util_words_cut.get_class_words_with_space(test_data_1)x_test = tfidf_vectorizer.transform([test_cut_raw_1])lshf = LSHForest(random_state=42)lshf.fit(x_train.toarray())distances, indices = lshf.kneighbors(x_test.toarray(), n_neighbors=3)print(distances)print(indices)控制台打钱的信息如下,基本和docsim一致

[[ 0.42264973 0.42264973 0.48875208]][[10 15 14]]

以上是自己找出来的用来比较文本相似度的实现,不过一般lsh比较适合做短文本的比较。

http://blog.csdn.net/vs412237401/article/details/52238248

在我们做文本处理的时候,经常需要对两篇文档是否相似做处理或者根据输入的文档,找出最相似的文档。

如需转载,请注明出处。

幸好gensim提供了这样的工具,具体的处理思路如下,对于中文文本的比较,先需要做分词处理,根据分词的结果生成一个字典,然后再根据字典把原文档转化成向量。然后去训练相似度。把对应的文档构建一个索引,原文描述如下:

The main class is Similarity, which builds an index for a given set of documents. Once the index is built, you can perform efficient queries like “Tell me how similar is this query document to each document in the index?”. The result is a vector of numbers as large as the size of the initial set of documents, that is, one float for each index document. Alternatively, you can also request only the top-N most similar index documents to the query.

第一种方法,使用docsim(推荐使用,结果比较稳定)

示例代码:为了清楚的查看结果,对训练数据做了标号

# 训练样本raw_documents = [ '0无偿居间介绍买卖毒品的行为应如何定性', '1吸毒男动态持有大量毒品的行为该如何认定', '2如何区分是非法种植毒品原植物罪还是非法制造毒品罪', '3为毒贩贩卖毒品提供帮助构成贩卖毒品罪', '4将自己吸食的毒品原价转让给朋友吸食的行为该如何认定', '5为获报酬帮人购买毒品的行为该如何认定', '6毒贩出狱后再次够买毒品途中被抓的行为认定', '7虚夸毒品功效劝人吸食毒品的行为该如何认定', '8妻子下落不明丈夫又与他人登记结婚是否为无效婚姻', '9一方未签字办理的结婚登记是否有效', '10夫妻双方1990年按农村习俗举办婚礼没有结婚证 一方可否起诉离婚', '11结婚前对方父母出资购买的住房写我们二人的名字有效吗', '12身份证被别人冒用无法登记结婚怎么办?', '13同居后又与他人登记结婚是否构成重婚罪', '14未办登记只举办结婚仪式可起诉离婚吗', '15同居多年未办理结婚登记,是否可以向法院起诉要求离婚']corpora_documents = []for item_text in raw_documents: item_str = util_words_cut.get_class_words_list(item_text) corpora_documents.append(item_str)# 生成字典和向量语料dictionary = corpora.Dictionary(corpora_documents)corpus = [dictionary.doc2bow(text) for text in corpora_documents]similarity = Similarity('-Similarity-index', corpus, num_features=400)test_data_1 = '你好,我想问一下我想离婚他不想离,孩子他说不要,是六个月就自动生效离婚'test_cut_raw_1 = util_words_cut.get_class_words_list(test_data_1)test_corpus_1 = dictionary.doc2bow(test_cut_raw_1)similarity.num_best = 5print(similarity[test_corpus_1]) # 返回最相似的样本材料,(index_of_document, similarity) tuplesprint('################################')test_data_2 = '家人因涉嫌运输毒品被抓,她只是去朋友家探望朋友的,结果就被抓了,还在朋友家收出毒品,可家人的身上和行李中都没有。现在已经拘留10多天了,请问会被判刑吗'test_cut_raw_2 = util_words_cut.get_class_words_list(test_data_2)test_corpus_2 = dictionary.doc2bow(test_cut_raw_2)similarity.num_best = 5print(similarity[test_corpus_2]) # 返回最相似的样本材料,(index_of_document, similarity) tuples运行结果如下:/usr/bin/python3.4 /data/work/python-workspace/test_doc_similarity.pyBuilding prefix dict from the default dictionary ...Building prefix dict from the default dictionary ...Loading model from cache /tmp/jieba.cacheLoading model from cache /tmp/jieba.cacheLoading model cost 0.521 seconds.Loading model cost 0.521 seconds.Prefix dict has been built succesfully.Prefix dict has been built succesfully.adding document #0 to Dictionary(0 unique tokens: [])built Dictionary(61 unique tokens: ['丈夫', '法院', '结婚', '住房', '出资']...) from 16 documents (total 89 corpus positions)starting similarity index under -Similarity-index[(14, 0.[**************]75), (15, 0.[**************]75), (10, 0.[**************]21)]################################creating sparse indexcreating sparse matrix from corpusPROGRESS: at document #0/16created ' with 86 stored elements in Compressed Sparse Row format>creating sparse shard #0saving index shard to -Similarity-index.0saving SparseMatrixSimilarity object under -Similarity-index.0, separately Noneloading SparseMatrixSimilarity object from -Similarity-index.0[(6, 0.[**************]25), (2, 0.[**************]41), (4, 0.[**************]18), (1, 0.[**************]52), (5, 0.[**************]52)]Process finished with exit code 0

对于第1个测试问题:原文档中14,15,10和其相似,后面是对应的相似度

对于第2个测试问题:原文档中6,2,4,1,5和其相似,后面是对应的相似度

第二种方法,使用doc2vec

看了gensim的官方文档,写的不好,同样是使用上面的数据做测试,代码及结果如下:

# 使用doc2vec来判断cores = multiprocessing.cpu_count()print(cores)corpora_documents = []for i, item_text in enumerate(raw_documents): words_list = util_words_cut.get_class_words_list(item_text) document = TaggedDocument(words=words_list, tags=[i]) corpora_documents.append(document)print(corpora_documents[:2])model = Doc2Vec(size=89, min_count=1, iter=10)model.build_vocab(corpora_documents)model.train(corpora_documents)print('#########', model.vector_size)test_data_1 = '你好,我想问一下我想离婚他不想离,孩子他说不要,是六个月就自动生效离婚'test_cut_raw_1 = util_words_cut.get_class_words_list(test_data_1)print(test_cut_raw_1)inferred_vector = model.infer_vector(test_cut_raw_1)print(inferred_vector)sims = model.docvecs.most_similar([inferred_vector], topn=3)print(sims)控制台打印的相关信息如下:Pattern library is not installed, lemmatization won't be available.'pattern' package not found; tag filters are not available for EnglishBuilding prefix dict from the default dictionary ...Building prefix dict from the default dictionary ...Loading model from cache /tmp/jieba.cacheLoading model from cache /tmp/jieba.cache4Loading model cost 0.513 seconds.Loading model cost 0.513 seconds.Prefix dict has been built succesfully.Prefix dict has been built succesfully.consider setting layer size to a multiple of 4 for greater performancecollecting all words and their countsPROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tagscollected 61 word types and 16 unique tags from a corpus of 16 examples and 89 wordsmin_count=1 retains 61 unique words (drops 0)min_count leaves 89 word corpus (100% of original 89)deleting the raw counts dictionary of 61 itemssample=0 downsamples 0 most-common wordsdownsampling leaves estimated 89 word corpus (100.0% of prior 89)estimated required memory for 61 words and 89 dimensions: 91828 bytesconstructing a huffman tree from 61 wordsbuilt huffman tree with maximum node depth 7resetting layer weightstraining model with 1 workers on 61 vocabulary and 89 features, using sg=0 hs=1 sample=0 negative=0expecting 16 sentences, matching count from corpus used for vocabulary survey[TaggedDocument(words=['无偿', '居间', '介绍', '买卖', '毒品', '定性'], tags=[0]), TaggedDocument(words=['吸毒', '动态', '持有', '毒品', '认定'], tags=[1])]worker thread finished; awaiting finish of 0 more threadstraining on 890 raw words (1050 effective words) took 0.0s, 506992 effective words/sunder 10 jobs per worker: consider setting a smaller `batch_words' for smoother alpha decay######### 89['离婚', '孩子', '自动', '生效', '离婚'][ 2.54629389e-03 1.87756249e-03 -9.76708368e-04 -5.15014399e-03 -7.54948880e-04 -3.74549557e-03 5.37392031e-03 3.35739669e-03 -3.50345811e-03 2.63415743e-03 -1.32059853e-03 -4.15759953e-03 -2.39425618e-03 -6.20105816e-03 -1.42006821e-03 -4.64246795e-03 3.78829846e-03 1.47493952e-03 4.49652784e-03 -5.57655795e-03 -1.40081509e-04 -7.10823014e-03 -5.34327468e-04 -4.21888893e-03 -2.96280603e-03 6.52066898e-04 5.98943839e-03 -4.01164964e-03 2.49637989e-03 -9.08742077e-04 4.65002051e-03 9.24886088e-04 1.67128560e-03 -1.93383044e-03 -4.58135502e-03 1.78024184e-03 -9.60796722e-04 7.26479106e-04 4.50814469e-03 2.58095766e-04 -4.53767460e-03 -1.72883295e-03 -3.89566552e-03 4.85864235e-03 5.90517826e-04 4.30173194e-03 3.37816169e-03 -1.08716707e-03 1.85196218e-03 1.94042712e-03 1.20989932e-03 -4.69703926e-03 -5.35873650e-03 -1.35291950e-03 -4.62053996e-03 2.15436472e-03 4.05823253e-03 8.01778078e-05 -3.84314684e-03 1.11574796e-03 -4.36050585e-03 -3.31182266e-03 -2.15692003e-03 -2.09038518e-03 4.50274721e-03 -1.85286190e-04 -5.09306230e-03 -1.12043330e-04 8.25022871e-04 2.60405545e-03 -1.73542544e-03 5.14509249e-03 -9.16058663e-04 1.01291772e-03 -7.90049613e-04 4.20650374e-03 -3.00139328e-03 3.34924040e-03 -2.11520446e-03 4.79168072e-03 2.11459701e-03 -3.07943812e-03 -5.09956060e-03 -2.34926818e-03 7.30032055e-03 -5.31428820e-03 -2.96888268e-03 4.95154131e-03 3.09590902e-03][(15, 0.[**************]4), (14, 0.[**************]95), (10, 0.[**************]11)]precomputing L2-norms of doc weight vectors使用doc2vec结果不是很稳定,可能是我没有正确的使用吧,不过我看官方文档也没有找到比较有用的信息

文档相关链接如下: https://radimrehurek.com/gensim/models/doc2vec.html

第三种方式:使用LSH(LSH原理请见百度搜索)

sciket-learn提供了lsh的实现,当然github上也有lsh的实现。sciket-learn上是提供的lsh树。

LSH Forest: Locality Sensitive Hashing forest [1] is an alternative method for vanilla approximate nearest neighbor search methods. LSH forest data structure has been implemented using sorted arrays and binary search and 32 bit fixed-length hashes. Random projection is used as the hash family which approximates cosine distance.

还是使用同样的测试数据,代码如下:

# 使用lsh来处理tfidf_vectorizer = TfidfVectorizer(min_df=3, max_features=None, ngram_range=(1, 2), use_idf=1, smooth_idf=1,sublinear_tf=1)train_documents = []for item_text in raw_documents: item_str = util_words_cut.get_class_words_with_space(item_text) train_documents.append(item_str)x_train = tfidf_vectorizer.fit_transform(train_documents)test_data_1 = '你好,我想问一下我想离婚他不想离,孩子他说不要,是六个月就自动生效离婚'test_cut_raw_1 = util_words_cut.get_class_words_with_space(test_data_1)x_test = tfidf_vectorizer.transform([test_cut_raw_1])lshf = LSHForest(random_state=42)lshf.fit(x_train.toarray())distances, indices = lshf.kneighbors(x_test.toarray(), n_neighbors=3)print(distances)print(indices)控制台打钱的信息如下,基本和docsim一致

[[ 0.42264973 0.42264973 0.48875208]][[10 15 14]]

以上是自己找出来的用来比较文本相似度的实现,不过一般lsh比较适合做短文本的比较。


相关内容

  • 货币政策对中国股市的影响研究
  • 货币政策对中国股市的影响研究 -基于VECM模型的实证分析 卢宗辉 随着资本市场的发展,我国货币政策变量内生性逐渐增强,不再是货币政策单向地影响股票市场,股票市场反过来也会通过交易效应.资产配臵效应.替代效应以及通过影响货币乘数来影响利率和货币供应量.因此,货币政策变量的内生性使得用传统的静态回归分 ...

  • XML数据相似度研究
  • 计 算 机 工 程 第31卷 第11期 Computer EngineeringVol.31 博士论文 1000 0025 A 中图分类号 XML数据相似度研究 张丙奇 赵章界 ±±¾100080 XML数据的大量出现为信息检索智能信息处理提供了机遇和挑战智能处理的基础在对XML数据特征进行深入分析 ...

  • 植物NAC转录因子的研究进展
  • 植物NAC转录因子的研究进展 康桂娟 曾日中 聂智毅 黎瑜 代龙军 段翠芳 (中国热带农业科学院橡胶研究所 农业部橡胶树生物学与遗传资源利用重点实验室,省部共建国家重点实验室培育基地-海南省热带作物 栽培生理学重点实验室,儋州 571737) 摘 要: NAC(NAM. ATAF1.ATAF2和CU ...

  • 动态网页信息抽取算法与模型
  • 第3章 动态网页信息抽取算法与模型 3.1 动态网页的类型与特点 据统计 WWW 上有80%的Web 页面是动态网页 例如搜索引擎返回的搜索结果 它们通常是由网站 动 网上商店的商品信 的后台数据库通过某种通用的模板构成 态网页的类型十分广泛 义 息页面都是典型的动态网页 因此具有十分相似的页面结构 ...

  • 文本特征提取方法研究
  • 作 者:田文颖 文本特征提取方法研究 一.课题背景概述 文本挖掘是一门交叉性学科, 涉及数据挖掘.机器学习.模式识别.人工智能.统计学.计算机语言学.计算机网络技术.信息学等多个领域.文本挖掘就是从大量的文档中发现隐含知识和模式的一种方法和工具, 它从数据挖掘发展而来, 但与传统的数据挖掘又有许多不 ...

  • 互联网热点话题发现的设计与实现
  • DOI 互联网热点话题发现的设计与实现 杨安琨 (武汉邮电科学研究院通信与信息系统,武汉,430074) 摘要:针对互联网信息规模不断增加,数据结构杂乱无章等问题,本文设计一种基于互联网热点话题的发现模型及实现方案.本文分别就系统整体架构和具体实现进行了说明,本系统采用Java编程实现,具有半实时性 ...

  • 内容管理系统实现的关键技术
  • ・44・(总838) 内容管理系统实现的关键技术2007年 文章编号:1003-5850(2007) 08-0044-03 内容管理系统实现的关键技术 The Pivotal Technology about Implementation of Content Management System 张 ...

  • 国科大现代信息检索第二次作业
  • 国科大2013年秋季<现代信息检索>第二次作业(第六章到第十五章) 以下1-16每题6分,第17题3分,共计100分. 1. 习题 6-10 考虑图6-9中的3篇文档Doc1.Doc2.Doc3中几个词项的tf情况,采用图6-8中的idf 值来计算所有词项 图6-9 习题 6-10中所使 ...

  • 信息检索检索向量空间模型
  • 信息检索检索 向量空间模型 一:算法描述 在文本挖掘.搜索引擎应用中,文本的特征表示是挖掘工作的基础,它对文本进行预处理,抽取代表其特征的元数据,这些特征可以用结构化的形式保存,作为文档的中间表示形式.向量空间模型(VectorSpaceModel)是近年来应用较多的文本特征表示方法之一,它是由Ge ...