regain 检索工具两个配置文件的翻译

news/2024/7/21 7:30:41 标签: lucene, Excel, PHP, XML, Apache
工作快两年了,今天经理又把去年的那个regain的检索拿出来,让以最快的速度整理好,让跑起来。呵呵,记得刚接触的时候自己还是个刚离开校园的毛头小子,捣鼓了一个月没弄好,最后让给经理了。现在拿到手里,又有时间就自己把里面的配置文件翻译一下:
其实主要有连个配置文件:CrawlerConfiguration.xml(建索引时使用),SearchConfiguration.xml(搜索索引时使用)

下载网址http://regain.sourceforge.net/download.php

CrawlerConfiguration.xml
<?xml version="1.0" encoding="GBK"?>

<!DOCTYPE configuration [
<!ENTITY amp "&">
<!ENTITY lt "<">
<!ENTITY minus "-">
]>

<!--
| Configuration for the regain crawler (for creating a search index)
|翻译:为regain爬虫准备的配置文件,该配置文件用来创建查询索引
| You can find a detailed description of all configuration tags here:
| http://regain.murfman.de/wiki/en/index.php/CrawlerConfiguration.xml
|翻译:你可以在下列网址中找到详细的关于该配置中所有标签的描述文件,http://regain.murfman.de/wiki/en/index.php/CrawlerConfiguration.xml
| You can find more configration examples in the CrawlerConfiguration_examples.xml.
|翻译:你也可以在CrawlerConfiguration_examples.xml.文件中找到更多的例子
+-->
<configuration>

<!--
| Enter your HTTP proxy settings here (Look at the preferences of your browser)
|翻译:在这里输入你的http代理,可以查看的你的浏览器操作参数
+-->
<proxy>
<!--
<host>proxy</host>
<port>3128</port>
<user>HansWurst</user>
<password>gkxy23</password>
-->
</proxy>


<!--
| The list of URLs where the spidering will start.
|翻译:spidering开始查找资料的URLs列表
| Enter the start page of your web site resp. a file system folder here.
|翻译:输入你的web地址,spidering将从这里开始。这里是一个系统文件夹
| NOTE: The examples are in a comment. Thus, if you add your path in one of
| them, then don't forget to uncomment them.
|翻译:注意例子中都有注释,所以如果在例子中添加了自己的路径,记住做标记
+-->
<startlist>
<!-- Directory parsing 目录解析-->
<!--
<start parse="true" index="false">file://c:/Eigene Dateien</start>
set the place where the document to located
翻译:设置一个文件下载存放的位置
file://E:/eclipse 3.2/workspace/SIS/WebRoot/FileDepository ${SEARCHDIR}
-->
<start index="false" parse="true">file://${WORKDIR}FileDepository</start>
<!-- HTML parsing -->
<!--
<start parse="true" index="true">http://www.mydomain.de/some/path/</start>
-->
</startlist>


<!--
| The whitelist containing prefixes an URL must have to be processed
|翻译:白名单包含一个URL必须处理的前缀
| Enter the domain of your web site here.
|翻译:在这里键入web地址
+-->
<whitelist>
<prefix>file://</prefix>
</whitelist>


<!--
| The blacklist containing prefixes an URL must NOT have to be processed
|翻译:黑名单列举了后缀一个URL不要处理的前缀
| Enter sub directories you don't want to be indexed here.
|翻译:在这里键入你不希望被索引的地址
+-->
<blacklist>
<!--
<prefix>http://www.mydomain.de/some/dynamic/content/</prefix>
<regex>/backup/[^/]*$</regex>
-->
</blacklist>


<!--
| ==================================================================================
| That's all you have to configure! The rest of this file is advanced configuration.
|翻译:以上是所有需要配置的地方,这个文件中下面的部分是高级配置
| ==================================================================================
+-->

<!--
| The preferences for the search index.
|翻译:查询索引参数
+-->
<searchIndex>
<!--
The directory where the index should be located ${SEARCHDIR}
翻译:索引应该被放置的目录
-->
<dir>${SEARCHDIR}searchindex</dir>
<!--
| Specifies the analyzer type to use.
| 翻译:指定分析机类型以便使用
| You may specify the class name of the analyzer or you use one of the
| following aliases:
| * english: For the english language
| (alias for org.apache.lucene.analysis.standard.StandardAnalyzer)
| * german: For the german language
| (alias for org.apache.lucene.analysis.de.GermanAnalyzer)
| 翻译:你可以指定分析机的类名,也可以任意选取下面的别名中的一个
| english:针对英文环境,是org.apache.lucene.analysis.standard.StandardAnalyzer的别名
| german:针对德文环境,是org.apache.lucene.analysis.de.GermanAnalyzer的别名
+-->
<analyzerType>english</analyzerType>
<!--
<analyzerType>german</analyzerType>
<analyzerType>chinese</analyzerType>
<analyzerType>paoding</analyzerType>

-->

<!--
| Contains all words that should not be indexed.
| Separate the words by a blank.
|翻译:包含了所有的不必被索引的单词,把这些单词用空白分开
+-->
<stopwordList>
einer eine eines einem einen der die das dass da?du er sie es was wer wie
wir und oder ohne mit am im in aus auf ist sein war wird ihr ihre ihres als
für von mit dich dir mich mir mein sein kein durch wegen wird
</stopwordList>
<!-- italian:
<stopwordList>
di a da in con su per tra fra io tu egli ella essa noi voi essi loro che cui
se e n?anche inoltre neanche o ovvero oppure ma per?eppure anzi invece
bens?tuttavia quindi dunque perci?pertanto cio?infatti ossia non come
mentre perch?quando mio mia miei mie tuo tua tuoi tue suo sua suoi sue
nostro nostre nostri nostre vostro vostre vostri vostre il lo la i gli le un
uno una degli delle alcuno alcuna alcune qualcuno qualcuna nessuno nessuna
molto molte molti molte poco parecchio assai
</stopwordList>
-->

<!--
| Contains all words that should not be changed by an analyser when indexed.
| Separate the words by a blank.
|翻译:包含所有的被分析机索引时不应该改变的内容。把这些单词用空白分开
+-->
<exclusionList></exclusionList>

<!--
| The names of the fields of which to prefetch the destinct values.
| Separate the field names by a blank.
|翻译:
| Put in the names of the fields you use a search:input_fieldlist tag for.
| The values shown in the list will then be extracted by the crawler and not
| by the search mask, which prevents a slow first loading of a page for huge
| indexes.
|翻译:放置用来查询的字段名称,在列表中列举的值将被爬虫提取出来,但是不会被查询到,这些值阻止了页面第一次加载更多的索引
+-->
<valuePrefetchFields>mimetype</valuePrefetchFields>

<!--
| Specifies wether the whole content should be stored in the index for the
| purpose of a content preview
|翻译:指定为了能够预览内容是否所有内容需要被存储在索引中。
+-->
<storeContentForPreview>true</storeContentForPreview>

</searchIndex>


<!--
| The preparators in the order they should be applied. Preparators that aren't listed
| here will be applied after the listed ones.
|翻译:在序列中列举的preparators需要被应用,没有被列举的将在列举的后面被应用
| You can use this list...
| ... to define the priority (= order) of the preparators
| ... to disable preparators
| ... to configure preparators
|翻译:该属性有如下用途:
| ... 定义preparators的属性(= order)
| ... 禁用preparators
| ... 配置preparators
+-->
<preparatorList>
<!--
| Enable this preparator if you want to use the text extractor of
| Microsoft Windows. This preparator is able to read tons of file formats.
|翻译:如果你想应用这个提取的text文字,就使用preparator,preparator可以读取文件格式
| NOTE: Under Windows 2000 you have to make sure that reg.exe is installed
| (It's part of the "Support Tools").
| For details see: http://support.microsoft.com/kb/301423
|翻译:注意在windows2000以下的版本中,你需要确保安装了reg.exe(reg.exe是一个支持工具);
|详细资料可以参考网址 http://support.microsoft.com/kb/301423
+-->
<preparator enabled="false">
<class>.IfilterPreparator</class>
</preparator>

<!--
| Enable this preparator if you want to use MS Excel for indexing your Excel
| documents.
|翻译:如果您要索引Excel格式文件内容,那么就使用preparator
+-->
<preparator enabled="false">
<class>.JacobMsExcelPreparator</class>
</preparator>

<!--
| Enable this preparator if you want to use MS Word for indexing your Word
| documents.
|翻译:如果您要索引Word格式文件内容,那么就使用preparator
+-->
<preparator enabled="false">
<class>.JacobMsWordPreparator</class>
</preparator>

<!--
| Enable this preparator if you want to use MS Powerpoint for indexing your
| Powerpoint documents.
|翻译:如果您要索引Powerpoint格式文件内容,那么就使用preparator
+-->
<preparator enabled="false">
<class>.JacobMsPowerPointPreparator</class>
</preparator>

<!--
| This tells regain that it should first try the SimpleRtfPreparator for RTF
| files. Only if this one fails the SwingRtfPreparator is used
| (which is much slower).
|翻译:下面用来通知regain,首先使用SimpleRtfPreparator,只用当SimpleRtfPreparator失败了才使用SwingRtfPreparator
|SwingRtfPreparator必须延迟。
+-->
<preparator>
<class>.SimpleRtfPreparator</class>
</preparator>
<preparator>
<class>.SwingRtfPreparator</class>
</preparator>

<!--
| This preparator may be used if you have an external program that can
| extract text. It's disabled by default.
|翻译:如果你有一个可以提取text的外部项目,下面的preparator可以使用,默认情况下他是被禁用的
+-->
<preparator enabled="false">
<class>.ExternalPreparator</class>
<config>
<section name="command">
<param name="urlPattern">\.ps$</param>
<param name="commandLine">ps2ascii ${filename}</param>
<param name="checkExitCode">false</param>
</section>
</config>
</preparator>

<!--
CatchAll-preparator on basis of EmptyPreparator
翻译:在EmptyPreparator中缓存所有的preparator
-->
<preparator priority="-10">
<class>.EmptyPreparator</class>
<urlPattern>.*</urlPattern>
</preparator>
</preparatorList>


<!--
| The index may be extended with auxiliary fields. These are fields that have
| been generated from the URL of an document.
| 翻译:通过辅助域索引可以扩充,这里有通过一个文档的url产生的字段。
| Example: If you have a directory with a sub directory for every project,
| then you may create a field with the project's name.
| 翻译:例如:有这样一种情况,现在有一个所有项目都有子目录的目录,这时你就会用这个项目的名称产生一个字段
| The folling tag will create a field "project" with the value "otto23"
| from the URL "file://c:/projects/otto23/docs/Spez.doc":
|翻译:下面的标签将从地址为"file://c:/projects/otto23/docs/Spez.doc"的url中
| 产生一个名称为"project",值为"otto23"的字段
| <auxiliaryField name="project" regexGroup="1">
| <regex>^file://c:/projects/([^/]*)</regex>
| </auxiliaryField>
|
| URLs that doen't match will get no "project" field.
|翻译:URLs不匹配的,将不能得到"project"字段。
| Having done this you may search for "Offer project:otto23" and you will get
| only hits from this project directory.
|翻译:假设已经做了这些,你也许会查询"Offer project:otto23",这样你将只从该project目录获得结果集
+-->
<auxiliaryFieldList>
<!--
Don't change these two fields. But you may add your own.
翻译:不要更改这两个字段,但是你可以增加属于自己的条件。
-->
<auxiliaryField name="extension" regexGroup="1" toLowercase="true">
<regex>\.([^\.]*)$</regex>
</auxiliaryField>
<auxiliaryField name="location" regexGroup="1" store="false" tokenize="true">
<regex>^(.*)$</regex>
</auxiliaryField>
<auxiliaryField name="mimetype" regexGroup="1" >
<regex>^()$</regex>
</auxiliaryField>
</auxiliaryFieldList>


<!-- The regular expressions that indentify URLs in HTML. -->
<!-- This configuration part is no longer neccessary -->
<!--htmlParserPatternList>
<pattern parse="true" index="true" regexGroup="1">="([^"]*(/|htm|html|jsp|php\d?|asp))"</pattern>
<pattern parse="false" index="false" regexGroup="1">="([^"]*\.(js|css|jpg|gif|png))"</pattern>
<pattern parse="false" index="true" regexGroup="1">="([^"]*\.[^\."]{3})"</pattern>
</htmlParserPatternList-->
</configuration>



下面是SearchConfiguration.xml
<?xml version="1.0" encoding="GBK"?>

<!DOCTYPE configuration [
<!ENTITY amp "&">
<!ENTITY lt "<">
]>

<!--
| Configuration for the regain search mask.
|翻译:regain search 的配置文件
|
| Normally you only have to specify the directory where the search index is
| located. You do this in the <dir> tag of the <index name="main"> (line 74).
|翻译:一般的您只需要指定查询索引所在的目录就可以了,在这个配置文件中你在 <index name="main">标签下的
|<dir> 目录中指定

| You can find a detailed description of all configuration tags here:
|翻译:你可以在下面的这个网址中找到所有的配置标签的详细的说明
| http://regain.murfman.de/wiki/en/index.php/SearchConfiguration.xml

+-->
<configuration>

<!-- The search indexes 查询索引-->
<indexList>
<!--
| All settings defined in this section are applied to all indexes unless
|翻译: 所有的在section中定义的设置被应用于所有的索引中,除非设置被重新定义
| they redefine the setting.
+-->
<defaultSettings>
<!--
1 <defaultSettings>: The cascaded default settings
2<index>: The settings for one index.

-->
<!--
| The regular expression that identifies URLs that should be opened in
| a new window.
| 翻译:在一个新窗口中打开的规则的整齐的标时urls的表达式
+-->
<openInNewWindowRegex>.(pdf|rtf|doc|xls|ppt)$</openInNewWindowRegex>

<!--
| Specifies whether the file-to-http-bridge should be used for file-URLs.
|翻译:指定file-to-http-bridge是否被用于file-URLs
| Mozilla browsers have a security mechanism that blocks loading file-URLs
|翻译:Mozilla浏览器有一个安全机制,他限制从已经下载的http页面中下载 file-URLs
| from pages loaded via http. To be able to load files from the search
| results, regain offers the file-to-http-bridge that provides all files that
| are listed in the index via http.
|翻译:为了实现从查询结果中下载文件,file-to-http-bridge是regain提供的,是提供给所有的通过http在索引中列举的文件
+-->
<useFileToHttpBridge>true</useFileToHttpBridge>

<!--
| The index fields to search by default.
|翻译:默认的查询索引字段
| NOTE: The user may search in other fields also using the
| "field:"-operator. Read the lucene query syntax for details:
| http://jakarta.apache.org/lucene/docs/queryparsersyntax.html
|翻译:注意:用户在其他域中也许用"field:"-operator;请阅读lucene查询句法详细了解
|网址是:http://jakarta.apache.org/lucene/docs/queryparsersyntax.html
+-->
<searchFieldList>content title headlines location filename</searchFieldList>
<!--
| The SearchAccessController to use.
| 翻译:应用查询访问控制器
| This is a part of the access control system that ensures that only those
| documents are shown in the search results that the user is allowed to
| read.
|翻译:访问控制系统的一部分,这部分的作用是保证只有用户允许阅读的文件出现在查询结果中
| If you specify a SearchAccessController, don't forget to specify the
| CrawlerAccessController counterpart in the CrawlerConfiguration.xml!
|翻译:如果您要指定SearchAccessController(查询访问控制器),请确定修改CrawlerConfiguration.xml
|中的爬虫反问控制器对应的字段。
+-->
<!--
<searchAccessController>
<class jar="myAccess.jar">mypackage.MySearchAccessController</class>
<config>
<param name="bla">blubb</param>
</config>
</searchAccessController>
-->
<!--
|
| Specifies whether the search terms should by highlighted whithin the
| search results (summary, title)
|翻译:指定在查询结果(summary, title)中,查询部分需要被高亮显示
+-->
<Highlighting>true</Highlighting>

</defaultSettings>

<!-- The search index 'main' 查询索引'main' -->
<index name="main" default="true" isparent="true">
<!--
The directory where the index is located
翻译:索引存放的位置
-->
<dir>${SEARCHDIR}searchindex</dir>
</index>
<!--
| A child index of 'main'
|翻译:子索引存放的位置
+-->
<!--
<index name="main1" default="true" isparent="false" parent="main">
<dir>searchindex_1</dir>
</index>
-->

<!-- The search index 'example' 查询索引'example' 例子-->
<index name="example">
<!-- The directory where the index is located 索引存放的目录-->
<dir>c:\Temp\searchindex_example</dir>

<rewriteRules>
<rule prefix="file://c:/example/www-data" replacement="http://www.mydomain.de"/>
</rewriteRules>
</index>
</indexList>

</configuration>



http://www.niftyadmin.cn/n/869554.html

相关文章

oracle去掉html标签,SQlServer自定义函数去除字段的HTML标签

通常用富文本框编辑内容存在数据库中&#xff0c;在做列表显示数据的时候不想显示原来的html标签&#xff0c;可以调用下面的自定义函数实现文本去htmlsql自定义函数alter function [dbo].[dropHtmlTag](cont varchar(max))returns varchar(max)asbegin--去掉style标签declare …

只选择年度,只选择年月

[b]最近一直很忙着开发&#xff0c;开发过程中我们的需求工作同事告诉某一个日期显示部分只要年&#xff0c;另一个日期显示部分只要年月&#xff0c;沟通结果是我被他说服了&#xff0c;一句很强的话&#xff1a;“客户就是这样需要的”。我哑然&#xff01;&#xff01; 终于…

oracle数据库的调用存储过程,Java:在oracle数据库中调用存储过程

为了能够捕获Oracle数据库中的过程返回,请尝试此操作.public static void main(String[] args) {try {Class.forName("oracle.jdbc.driver.OracleDriver");String url "jdbc:oracle:thin:localhost:1521:xe";Connection con DriverManager.getConnection…

跳槽了,flex过期

[b] 跳槽了&#xff0c;最终还是决定离开原来的公司&#xff0c;开始一个新的发展。在原来的公司已经工作1年7个半月了&#xff0c;公司里原来的同事都成了好朋友&#xff0c;跟上级相处的也很好&#xff0c;可某些原因还是出来闯闯吧。下一家公司用的是flex&#xff0c;于是就…

php中的操作符有哪些类型,php中操作符与迭代整理

操作符与迭代在开发应用中用到的非常的多了这里给各位整理了一篇关于php操作符与迭代整理教程&#xff0c;希望下面的文章能够帮助到你。10.操作符10.1 算术操作符算术操作符通常用于整型或双精度类型的数据。如果将它们应用于字符串&#xff0c;PHP会试图将这些字符串转换成一…

spring整合dwr

引用为什么非要利用Spring来整合DWR呢&#xff1f; 个人见解&#xff1a; 在一个项目中&#xff0c;尤其是利用SSH&#xff08;StrutsHibernateSpring&#xff09;整合开发的时候通常是利用Spring来进行管理的&#xff0c;因此即使在使用AJAX技术的项目中仍希望由Spring来进行整…

怎样用matlab构造10到45魔方矩阵,数学实验作业1--答案

数学实验-作业1—及部分答案(要求&#xff1a;1. 每次上机课下课之前提交&#xff0c;文件名如&#xff1a;数学091朝鲁第一次作业.doc。2. 交至邮箱&#xff1a;matlabzuoyetijiaohttp://www.doczj.com/doc/32b377df69dc5022aaea00bd.html3.作业实行5分制&#xff0c;依次为A&…

flex 代码格式化

今天从网上下载了一个插件供flex ,MyEclipse代码格式化操作。 代码格式化 如果你还在为官方Flex Builder不能够将代码有效的格式化排版而头疼&#xff0c;你可以尝试一下这个插件 项目地址&#xff1a;http://sourceforge.net/projects/flexformatter/ 使用方法&#xff1a;…