問題描述
我有一個包含大約 800 萬篇新聞文章的語料庫,我需要將它們的 TFIDF 表示為稀疏矩陣.對于相對較少數量的樣本,我已經能夠使用 scikit-learn 做到這一點,但我相信它不能用于如此龐大的數據集,因為它首先將輸入矩陣加載到內存中,這是一個昂貴的過程.
I have a corpus which has around 8 million news articles, I need to get the TFIDF representation of them as a sparse matrix. I have been able to do that using scikit-learn for relatively lower number of samples, but I believe it can't be used for such a huge dataset as it loads the input matrix into memory first and that's an expensive process.
有誰知道,為大型數據集提取 TFIDF 向量的最佳方法是什么?
Does anyone know, what would be the best way to extract out the TFIDF vectors for large datasets?
推薦答案
Gensim 有一個高效的 tf-idf 模型 并且不需要一次將所有內容都保存在內存中.
Gensim has an efficient tf-idf model and does not need to have everything in memory at once.
您的語料庫只需要是一個可迭代的,因此它不需要一次將整個語料庫保存在內存中.
Your corpus simply needs to be an iterable, so it does not need to have the whole corpus in memory at a time.
make_wiki 腳本在 Wikipedia 上運行大約根據評論,50m 在筆記本電腦上.
The make_wiki script runs over Wikipedia in about 50m on a laptop according to the comments.
這篇關于大型數據集的 TFIDF的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網!