我已经索引了大约一千个Lucene文档,我想检索所有文档中所有术语的每个文档的术语频率,这就是我为事物编制索引的方式
HashMap<Integer,String> documentList = getEachDocumentSeparated();
Analyzer analyzer = new StandardAnalyzer();
Directory index = FSDirectory.open(Paths.get(RESULT_ADDRESS));
IndexWriterConfig config = new IndexWriterConfig(analyzer);
config.setOpenmode(IndexWriterConfig.Openmode.CREATE);
IndexWriter w = new IndexWriter(index,config);
FieldType fieldType = new FieldType((TextField.TYPE_STORED));
IndexOptions indexOptions = IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS;
fieldType.setIndexOptions(indexOptions);
for (Map.Entry<Integer,String> pair : documentList.entryset())
{
Document doc = new Document();
Field bodyField = new Field("body",pair.getvalue(),fieldType);
doc.add(new StringField("id",pair.getKey(),Field.Store.YES));
doc.add(bodyField);
w.addDocument(doc);
}
例如,我想获得一个像下面这样的向量
期,1(5),2(10),330(2),500(1),1001(3)
表示文件1中的sterm
重复了5次,而文件2中的{{1}}也重复了10次,依此类推...