<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:georss="http://www.georss.org/georss" xmlns:georss="http://www.georss.org/georss" xmlns:gml="http://www.opengis.net/gml"
	>
<channel>
	<title>Comments on: Index Microsoft Office Files with Lucene</title>
	<atom:link href="http://www.acidum.de/2009/01/07/index-microsoft-office-files-with-lucene/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.acidum.de/2009/01/07/index-microsoft-office-files-with-lucene/</link>
	<description></description>
	<lastBuildDate>Thu, 07 Jan 2010 10:37:41 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: Christoph Hartmann</title>
		<link>http://www.acidum.de/2009/01/07/index-microsoft-office-files-with-lucene/comment-page-1/#comment-7260</link>
		<dc:creator>Christoph Hartmann</dc:creator>
		<pubDate>Sat, 24 Oct 2009 09:34:19 +0000</pubDate>
		<guid isPermaLink="false">http://www.acidum.de/?p=178#comment-7260</guid>
		<description>Hey Taddy,

sorry for my late reply, but i was quite busy over the last weeks. I would use Tika and write my custom indexer. To extract content from zip files you should use the java.util.zip package. Take a look at the p&lt;a href=&quot;http://java.sun.com/j2se/1.5.0/docs/api/java/util/zip/package-tree.html&quot; rel=&quot;nofollow&quot;&gt;ackage description&lt;/a&gt; and Suns article about &lt;a href=&quot;http://java.sun.com/developer/technicalArticles/Programming/compression/&quot; rel=&quot;nofollow&quot;&gt;compressing and decompresing data using java apis&lt;/a&gt;. 

Hope that helps
Christoph</description>
		<content:encoded><![CDATA[<p>Hey Taddy,</p>
<p>sorry for my late reply, but i was quite busy over the last weeks. I would use Tika and write my custom indexer. To extract content from zip files you should use the java.util.zip package. Take a look at the p<a href="http://java.sun.com/j2se/1.5.0/docs/api/java/util/zip/package-tree.html" rel="nofollow">ackage description</a> and Suns article about <a href="http://java.sun.com/developer/technicalArticles/Programming/compression/" rel="nofollow">compressing and decompresing data using java apis</a>. </p>
<p>Hope that helps<br />
Christoph</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Teddy Bear</title>
		<link>http://www.acidum.de/2009/01/07/index-microsoft-office-files-with-lucene/comment-page-1/#comment-7178</link>
		<dc:creator>Teddy Bear</dc:creator>
		<pubDate>Thu, 15 Oct 2009 03:15:46 +0000</pubDate>
		<guid isPermaLink="false">http://www.acidum.de/?p=178#comment-7178</guid>
		<description>Hi Christoph,
Here some questions,hope you can help me.
i want extract content from the file inside zip,how can i do?Thanks
Regards,
Teddy</description>
		<content:encoded><![CDATA[<p>Hi Christoph,<br />
Here some questions,hope you can help me.<br />
i want extract content from the file inside zip,how can i do?Thanks<br />
Regards,<br />
Teddy</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Christoph Hartmann</title>
		<link>http://www.acidum.de/2009/01/07/index-microsoft-office-files-with-lucene/comment-page-1/#comment-6137</link>
		<dc:creator>Christoph Hartmann</dc:creator>
		<pubDate>Wed, 22 Jul 2009 18:17:18 +0000</pubDate>
		<guid isPermaLink="false">http://www.acidum.de/?p=178#comment-6137</guid>
		<description>Dear Selvam,

as far as I Know Apache POI (that is used by Tika) just understand msg files. Take a look at the &lt;a href=&quot;http://lucene.apache.org/tika/formats.html&quot; rel=&quot;nofollow&quot;&gt;Tika website&lt;/a&gt; and the &lt;a href=&quot;http://poi.apache.org/hsmf/index.html&quot; rel=&quot;nofollow&quot;&gt;POI website&lt;/a&gt;. 

If your task is to index all mails, it will be quite challenging to use the outlook mail storage from Java. It would be easier to use IMAP or POP3 instead.

Regards,
Christoph</description>
		<content:encoded><![CDATA[<p>Dear Selvam,</p>
<p>as far as I Know Apache POI (that is used by Tika) just understand msg files. Take a look at the <a href="http://lucene.apache.org/tika/formats.html" rel="nofollow">Tika website</a> and the <a href="http://poi.apache.org/hsmf/index.html" rel="nofollow">POI website</a>. </p>
<p>If your task is to index all mails, it will be quite challenging to use the outlook mail storage from Java. It would be easier to use IMAP or POP3 instead.</p>
<p>Regards,<br />
Christoph</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Selvam</title>
		<link>http://www.acidum.de/2009/01/07/index-microsoft-office-files-with-lucene/comment-page-1/#comment-6113</link>
		<dc:creator>Selvam</dc:creator>
		<pubDate>Tue, 21 Jul 2009 12:02:57 +0000</pubDate>
		<guid isPermaLink="false">http://www.acidum.de/?p=178#comment-6113</guid>
		<description>Hi Christoph,

Excellent post,I want to index outlook files using tika.I do able to index other file formats using tika.Could i need to post .pst files or .msg files ?,I got no clue regarding outlook file indexing.I hope you can help me out.</description>
		<content:encoded><![CDATA[<p>Hi Christoph,</p>
<p>Excellent post,I want to index outlook files using tika.I do able to index other file formats using tika.Could i need to post .pst files or .msg files ?,I got no clue regarding outlook file indexing.I hope you can help me out.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Christoph Hartmann</title>
		<link>http://www.acidum.de/2009/01/07/index-microsoft-office-files-with-lucene/comment-page-1/#comment-3292</link>
		<dc:creator>Christoph Hartmann</dc:creator>
		<pubDate>Thu, 16 Apr 2009 15:12:24 +0000</pubDate>
		<guid isPermaLink="false">http://www.acidum.de/?p=178#comment-3292</guid>
		<description>Hi Brindha,

take a look into the Readme.txt that explain how to build the project. As described you require maven 2 to build the project. After you added the dependencies to your local maven repo as described it shoud not a problem to build the project via mvn install. You will find the dependencies within the pom.xml.

To register a new tika parser take a look within the directory src/main/resources/tika-config. Just add a new parser entry as I did for ImageParser:
&lt;code&gt;
&lt;parser name=&quot;parse-image&quot; class=&quot;de.acidum.tika.ImageParser&quot;&gt;
&lt;mime&gt;image/bmp&lt;/mime&gt;
&lt;mime&gt;image/gif&lt;/mime&gt;
&lt;mime&gt;image/jpeg&lt;/mime&gt;
&lt;mime&gt;image/png&lt;/mime&gt;
&lt;mime&gt;image/tiff&lt;/mime&gt;
&lt;mime&gt;image/vnd.wap.wbmp&lt;/mime&gt;
&lt;mime&gt;image/x-icon&lt;/mime&gt;
&lt;mime&gt;image/x-psd&lt;/mime&gt;
&lt;mime&gt;image/x-xcf&lt;/mime&gt;
&lt;/parser&gt; 
&lt;/code&gt;

Hope that helps.</description>
		<content:encoded><![CDATA[<p>Hi Brindha,</p>
<p>take a look into the Readme.txt that explain how to build the project. As described you require maven 2 to build the project. After you added the dependencies to your local maven repo as described it shoud not a problem to build the project via mvn install. You will find the dependencies within the pom.xml.</p>
<p>To register a new tika parser take a look within the directory src/main/resources/tika-config. Just add a new parser entry as I did for ImageParser:<br />
<code><br />
&lt;parser name="parse-image" class="de.acidum.tika.ImageParser"&gt;<br />
&lt;mime&gt;image/bmp&lt;/mime&gt;<br />
&lt;mime&gt;image/gif&lt;/mime&gt;<br />
&lt;mime&gt;image/jpeg&lt;/mime&gt;<br />
&lt;mime&gt;image/png&lt;/mime&gt;<br />
&lt;mime&gt;image/tiff&lt;/mime&gt;<br />
&lt;mime&gt;image/vnd.wap.wbmp&lt;/mime&gt;<br />
&lt;mime&gt;image/x-icon&lt;/mime&gt;<br />
&lt;mime&gt;image/x-psd&lt;/mime&gt;<br />
&lt;mime&gt;image/x-xcf&lt;/mime&gt;<br />
&lt;/parser&gt;<br />
</code></p>
<p>Hope that helps.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Brindha</title>
		<link>http://www.acidum.de/2009/01/07/index-microsoft-office-files-with-lucene/comment-page-1/#comment-3290</link>
		<dc:creator>Brindha</dc:creator>
		<pubDate>Thu, 16 Apr 2009 15:04:26 +0000</pubDate>
		<guid isPermaLink="false">http://www.acidum.de/?p=178#comment-3290</guid>
		<description>Hi Christoph,
   I have tried to execute this application. But I couldnt.Cant we write all those 4 java files in a single folder and execute in normal java manner?If so please guide me how to do.

  If i run this one, I got the error jar files missing. will you please guide me to complete my project.Tomorrow is my deadline.

  Please tell me how to register  parser within tika-config.xml

  and aso where to place all dependencies

Thanks,
Brindha</description>
		<content:encoded><![CDATA[<p>Hi Christoph,<br />
   I have tried to execute this application. But I couldnt.Cant we write all those 4 java files in a single folder and execute in normal java manner?If so please guide me how to do.</p>
<p>  If i run this one, I got the error jar files missing. will you please guide me to complete my project.Tomorrow is my deadline.</p>
<p>  Please tell me how to register  parser within tika-config.xml</p>
<p>  and aso where to place all dependencies</p>
<p>Thanks,<br />
Brindha</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Christoph Hartmann</title>
		<link>http://www.acidum.de/2009/01/07/index-microsoft-office-files-with-lucene/comment-page-1/#comment-3008</link>
		<dc:creator>Christoph Hartmann</dc:creator>
		<pubDate>Tue, 07 Apr 2009 14:29:45 +0000</pubDate>
		<guid isPermaLink="false">http://www.acidum.de/?p=178#comment-3008</guid>
		<description>Hey Simon, 
sorry for the delay, but I was quite busy over the last two weeks. Therefore I had no time to answer your questions. Regarding your questions:

1. Tika return a raw word document that may contain your keywords. As far as I can tell, the string contains your keyword... If you require a fix, just write a new Tika parser. You may take a look at org.apache.tika.parser.microsoft.OfficeParser before. This class uses Apache Poi to extract data from word, excel etc. I wrote a custom image parser, that is used by Tika. (de.acidum.tika.ImageParser). Do not forget to register your parser within tika-config.xml

2. The de.acidum.indexer.core.Indexer class contains a main method. Should be easy to start, if you&#039;ve already downloaded all dependencies.

Hope that helps
Christoph</description>
		<content:encoded><![CDATA[<p>Hey Simon,<br />
sorry for the delay, but I was quite busy over the last two weeks. Therefore I had no time to answer your questions. Regarding your questions:</p>
<p>1. Tika return a raw word document that may contain your keywords. As far as I can tell, the string contains your keyword&#8230; If you require a fix, just write a new Tika parser. You may take a look at org.apache.tika.parser.microsoft.OfficeParser before. This class uses Apache Poi to extract data from word, excel etc. I wrote a custom image parser, that is used by Tika. (de.acidum.tika.ImageParser). Do not forget to register your parser within tika-config.xml</p>
<p>2. The de.acidum.indexer.core.Indexer class contains a main method. Should be easy to start, if you&#8217;ve already downloaded all dependencies.</p>
<p>Hope that helps<br />
Christoph</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Simon</title>
		<link>http://www.acidum.de/2009/01/07/index-microsoft-office-files-with-lucene/comment-page-1/#comment-2965</link>
		<dc:creator>Simon</dc:creator>
		<pubDate>Mon, 06 Apr 2009 06:21:28 +0000</pubDate>
		<guid isPermaLink="false">http://www.acidum.de/?p=178#comment-2965</guid>
		<description>Is this website still active?</description>
		<content:encoded><![CDATA[<p>Is this website still active?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Simon</title>
		<link>http://www.acidum.de/2009/01/07/index-microsoft-office-files-with-lucene/comment-page-1/#comment-2836</link>
		<dc:creator>Simon</dc:creator>
		<pubDate>Mon, 30 Mar 2009 06:14:01 +0000</pubDate>
		<guid isPermaLink="false">http://www.acidum.de/?p=178#comment-2836</guid>
		<description>Hello again,

I have a new question, i downloaded and installed the Lucene+Tika example. How do i test it?

With regards
Simon</description>
		<content:encoded><![CDATA[<p>Hello again,</p>
<p>I have a new question, i downloaded and installed the Lucene+Tika example. How do i test it?</p>
<p>With regards<br />
Simon</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Simon</title>
		<link>http://www.acidum.de/2009/01/07/index-microsoft-office-files-with-lucene/comment-page-1/#comment-2801</link>
		<dc:creator>Simon</dc:creator>
		<pubDate>Fri, 27 Mar 2009 13:34:20 +0000</pubDate>
		<guid isPermaLink="false">http://www.acidum.de/?p=178#comment-2801</guid>
		<description>Hello Cristoph,

I am curently trying to build a similar application but i have the following problem: The handler.toString() method, if there is a table of contents in the office document returns some confusing strings like

HYPERLINK \l &quot;_Toc198978976&quot;   Αξιόπιστη Διαδικασία Έκδοσης          PAGEREF _Toc198978976 \h   7

This is messing up the indexing completely because if someone wants to find a date like 1989 the document will be returned even if the time period does not exist in its contents. Do you know if there is a way around this problem?

With regards
Simon</description>
		<content:encoded><![CDATA[<p>Hello Cristoph,</p>
<p>I am curently trying to build a similar application but i have the following problem: The handler.toString() method, if there is a table of contents in the office document returns some confusing strings like</p>
<p>HYPERLINK \l &#8220;_Toc198978976&#8243;   Αξιόπιστη Διαδικασία Έκδοσης          PAGEREF _Toc198978976 \h   7</p>
<p>This is messing up the indexing completely because if someone wants to find a date like 1989 the document will be returned even if the time period does not exist in its contents. Do you know if there is a way around this problem?</p>
<p>With regards<br />
Simon</p>
]]></content:encoded>
	</item>
</channel>
</rss>

<!-- Dynamic Page Served (once) in 1.525 seconds -->
