问题描述
我正在构建一个 Lucene 索引并添加文档.
I'm building a Lucene Index and adding Documents.
我有一个多值字段,在本例中我将使用类别.
I have a field that is multi-valued, for this example I'll use Categories.
一个项目可以有很多类别,例如,牛仔裤可以属于服装、裤子、男装、女装等.
An Item can have many categories, for example, Jeans can fall under Clothing, Pants, Men's, Women's, etc.
将字段添加到文档时,逗号会有所不同吗?Lucene 会直接忽略它们吗?如果我将逗号更改为空格会有所不同吗?这会自动使该字段成为多值吗?
When adding the field to a document, do commas make a difference? Will Lucene simply ignore them? if I change commas to spaces will there be a difference? Does this automatically make the field multi-valued?
String categoriesForItem = getCategories(); // returns "category1, category2, cat3" from a DB call
categoriesForItem = categoriesForItem.replaceAll(",", " ").trim(); // not sure if to remove comma
doc.add(new StringField("categories", categoriesForItem , Field.Store.YES)); // doc is a Document
我这样做正确吗?还是有其他方法可以创建多值字段?
Am I doing this correctly? or is there another way to create multivalued fields?
感谢任何帮助/建议.
推荐答案
这将是为每个文档索引多值字段的更好方法
This would be a better way to index multiValued fields per document
String categoriesForItem = getCategories(); // get "category1, category2, cat3" from a DB call
String [] categoriesForItems = categoriesForItem.split(",");
for(String cat : categoriesForItems) {
doc.add(new StringField("categories", cat , Field.Store.YES)); // doc is a Document
}
当同名的多个字段出现在一个文档中时,倒排索引和术语向量都会按照添加字段的顺序在逻辑上将字段的标记相互附加.
Whenever multiple fields with the same name appear in one document, both the inverted index and term vectors will logically append the tokens of the field to one another, in the order the fields were added.
同样在分析阶段,两个不同的值将通过 setPositionIncrementGap() 自动通过位置增量分隔.让我解释一下为什么需要这样做.
Also during the analysis phase two different values will be seperated by a position increment via setPositionIncrementGap() automatically. Let me explain why this is needed.
文档 D1 中的类别"字段有两个值 - foo bar"和foo baz"现在,如果您要进行短语查询bar foo",则不应出现 D1.这是通过在同一字段的两个值之间添加额外的增量来确保的.
Your field "categories" in Document D1 has two values - "foo bar" and "foo baz" Now if you were to do a phrase query "bar foo" D1 should not come up. This is ensure by adding an extra increment between two values of the same field.
如果您自己连接字段值并依赖分析器将其拆分为多个值,bar foo"将返回 D1,这是不正确的.
If you yourself concatenate the field values and rely on the analyzer to split it into multiple values "bar foo" would return D1 which would be incorrect.
这篇关于向 Lucene 文档添加多值字符串字段,逗号重要吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!