<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:georss="http://www.georss.org/georss" xmlns:gml="http://www.opengis.net/gml"
>

<channel>
	<title>acidum.de &#187; Index</title>
	<atom:link href="http://www.acidum.de/tag/index/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.acidum.de</link>
	<description></description>
	<lastBuildDate>Sun, 08 Nov 2009 20:12:26 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Index Microsoft Office Files with Lucene</title>
		<link>http://www.acidum.de/2009/01/07/index-microsoft-office-files-with-lucene/</link>
		<comments>http://www.acidum.de/2009/01/07/index-microsoft-office-files-with-lucene/#comments</comments>
		<pubDate>Wed, 07 Jan 2009 10:34:29 +0000</pubDate>
		<dc:creator>Christoph Hartmann</dc:creator>
				<category><![CDATA[Development]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[Index]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Search Engine]]></category>
		<category><![CDATA[Tika]]></category>

		<guid isPermaLink="false">http://www.acidum.de/?p=178</guid>
		<description><![CDATA[Within my current research project I faced the challenge to index a whole bunch of files. To be platform independent the Java programming language was the first choice. Then I came along the Lucene project. 
 Lucene is an open-source project that &#8220;provides Java-based indexing and search technology&#8221;. I have to mention that Lucene is [...]]]></description>
			<content:encoded><![CDATA[<p>Within my current research project I faced the challenge to index a whole bunch of files. To be platform independent the Java programming language was the first choice. Then I came along the <a href="http://lucene.apache.org/">Lucene</a> project. </p>
<p> Lucene is an open-source project that &#8220;provides Java-based indexing and search technology&#8221;. I have to mention that Lucene is a framework library instead of an out-of-the-box application. If you think of indexing your files, you may have Microsoft Office files, Adobe pfds or OpenOffice documents in mind. None of these file format can be indexed by Lucene with the standard configuration. But Lucene provides a great API to do parsing of files by other code. </p>
<p>I looked at two projects:</p>
<ul>
<li><a href="http://lucene.apache.org/tika/">Tika</a></li>
<li><a href="http://aperture.sourceforge.net/">Aperture</a></li>
</ul>
<p>While Tika is not available as a binary download Aperture is. I decided for Tika due to Maven support and clean source code. Aperture comes along a whole bunch of dependencies which make it quite complex to figure out what is really required. Although Tika is only available via source code I&#8217;ve done the Lucene Tika integration within an half hour.</p>
<p>Just download the Tika source code via<br />
<code>svn checkout http://svn.apache.org/repos/asf/lucene/tika/trunk tika</code> and use maven to install the binary into your local maven repository. </p>
<p>The following part do the core binding between Tika and Lucene. It asks our own written ContentParser which returns a Lucene <a href="http://lucene.apache.org/java/2_4_0/api/core/org/apache/lucene/document/Document.html">Document</a>. </p>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;">logger.<span style="color: #006633;">debug</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;Indexing &quot;</span> <span style="color: #339933;">+</span> file<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">try</span> <span style="color: #009900;">&#123;</span>
	<span style="color: #003399;">Document</span> doc <span style="color: #339933;">=</span> <span style="color: #000066; font-weight: bold;">null</span><span style="color: #339933;">;</span>
	<span style="color: #666666; font-style: italic;">// parse the document</span>
	<span style="color: #000000; font-weight: bold;">synchronized</span> <span style="color: #009900;">&#40;</span>contentParserAccess<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
		doc <span style="color: #339933;">=</span> contentParser.<span style="color: #006633;">getDocument</span><span style="color: #009900;">&#40;</span>file<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
	<span style="color: #009900;">&#125;</span>
&nbsp;
	<span style="color: #666666; font-style: italic;">// put it into Lucene</span>
	<span style="color: #000000; font-weight: bold;">if</span> <span style="color: #009900;">&#40;</span>doc <span style="color: #339933;">!=</span> <span style="color: #000066; font-weight: bold;">null</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
		writer.<span style="color: #006633;">addDocument</span><span style="color: #009900;">&#40;</span>doc<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
	<span style="color: #009900;">&#125;</span> <span style="color: #000000; font-weight: bold;">else</span> <span style="color: #009900;">&#123;</span>
		logger.<span style="color: #006633;">error</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;Cannot handle &quot;</span>
				<span style="color: #339933;">+</span> file.<span style="color: #006633;">getAbsolutePath</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">+</span> <span style="color: #0000ff;">&quot;; skipping&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
	<span style="color: #009900;">&#125;</span>
<span style="color: #009900;">&#125;</span> <span style="color: #000000; font-weight: bold;">catch</span> <span style="color: #009900;">&#40;</span><span style="color: #003399;">IOException</span> e<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
	logger.<span style="color: #006633;">error</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;Cannot index &quot;</span> <span style="color: #339933;">+</span> file.<span style="color: #006633;">getAbsolutePath</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>
			<span style="color: #339933;">+</span> <span style="color: #0000ff;">&quot;; skipping (&quot;</span> <span style="color: #339933;">+</span> e.<span style="color: #006633;">getMessage</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">+</span> <span style="color: #0000ff;">&quot;)&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
<span style="color: #009900;">&#125;</span></pre></div></div>

<p>The ContentParser calls the TikaParser for each file and put the metadata it returns into a Lucene document. The most difficult part is determine the Mime-Type. Unfortunately Tike does not use it by default. Therefore we have to call the suitable method MimeTypes repo = config.getMimeRepository() and repo.getMimeType(bufIn). Afterward we have to reset the stream to the start. Otherwise Tika could not retrieve the data properly. </p>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;"><span style="color: #000000; font-weight: bold;">package</span> <span style="color: #006699;">de.acidum.indexer.tika.parser</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #000000; font-weight: bold;">import</span> ...
&nbsp;
<span style="color: #008000; font-style: italic; font-weight: bold;">/**
*
* This class the is the bridge between Lucene and Tika. It uses Tika to
* retrieve the file content and metadata and generates a Lucene document.
*
* @author Christoph Hartmann
*
*/</span>
<span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000000; font-weight: bold;">class</span> TikaDocumentParser <span style="color: #000000; font-weight: bold;">implements</span> ContentParser <span style="color: #009900;">&#123;</span>
&nbsp;
Logger logger <span style="color: #339933;">=</span> Logger.<span style="color: #006633;">getLogger</span><span style="color: #009900;">&#40;</span><span style="color: #000000; font-weight: bold;">this</span>.<span style="color: #006633;">getClass</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
AutoDetectParser autoDetectParser<span style="color: #339933;">;</span>
TikaConfig config<span style="color: #339933;">;</span>
&nbsp;
<span style="color: #000000; font-weight: bold;">public</span> TikaDocumentParser<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
	<span style="color: #000000; font-weight: bold;">try</span> <span style="color: #009900;">&#123;</span>
		<span style="color: #666666; font-style: italic;">// load tika config to replace the image parser with our own</span>
		<span style="color: #003399;">InputStream</span> is <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">this</span>.<span style="color: #006633;">getClass</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>.<span style="color: #006633;">getClassLoader</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>
				.<span style="color: #006633;">getResourceAsStream</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;tika-config.xml&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
		config <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> TikaConfig<span style="color: #009900;">&#40;</span>is<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
		<span style="color: #666666; font-style: italic;">// use tika's auto detect parser</span>
		autoDetectParser <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> AutoDetectParser<span style="color: #009900;">&#40;</span>config<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
	<span style="color: #009900;">&#125;</span> <span style="color: #000000; font-weight: bold;">catch</span> <span style="color: #009900;">&#40;</span><span style="color: #003399;">Exception</span> e<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
		logger.<span style="color: #006633;">error</span><span style="color: #009900;">&#40;</span>e<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
	<span style="color: #009900;">&#125;</span>
<span style="color: #009900;">&#125;</span>
&nbsp;
<span style="color: #000000; font-weight: bold;">private</span> <span style="color: #003399;">Document</span> getDocument<span style="color: #009900;">&#40;</span><span style="color: #003399;">InputStream</span> input, MimeType mimeType<span style="color: #009900;">&#41;</span>
		<span style="color: #000000; font-weight: bold;">throws</span> ContentParserException <span style="color: #009900;">&#123;</span>
	<span style="color: #003399;">Document</span> doc <span style="color: #339933;">=</span> <span style="color: #000066; font-weight: bold;">null</span><span style="color: #339933;">;</span>
	<span style="color: #000000; font-weight: bold;">try</span> <span style="color: #009900;">&#123;</span>
		Metadata metadata <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> Metadata<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
		<span style="color: #000000; font-weight: bold;">if</span> <span style="color: #009900;">&#40;</span>mimeType <span style="color: #339933;">!=</span> <span style="color: #000066; font-weight: bold;">null</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
			metadata.<span style="color: #006633;">set</span><span style="color: #009900;">&#40;</span>Metadata.<span style="color: #006633;">CONTENT_TYPE</span>, mimeType.<span style="color: #006633;">getName</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
		<span style="color: #009900;">&#125;</span>
		<span style="color: #003399;">ContentHandler</span> handler <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> BodyContentHandler<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
		<span style="color: #000000; font-weight: bold;">try</span> <span style="color: #009900;">&#123;</span>
			autoDetectParser.<span style="color: #006633;">parse</span><span style="color: #009900;">&#40;</span>input, handler, metadata<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
		<span style="color: #009900;">&#125;</span> <span style="color: #000000; font-weight: bold;">catch</span> <span style="color: #009900;">&#40;</span><span style="color: #003399;">Exception</span> e<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
			<span style="color: #000000; font-weight: bold;">throw</span> <span style="color: #000000; font-weight: bold;">new</span> ContentParserException<span style="color: #009900;">&#40;</span>e<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
		<span style="color: #009900;">&#125;</span>
&nbsp;
		doc <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">Document</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
		<span style="color: #666666; font-style: italic;">// add the content to lucene index document</span>
		doc.<span style="color: #006633;">add</span><span style="color: #009900;">&#40;</span><span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">Field</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;body&quot;</span>, handler.<span style="color: #006633;">toString</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>, <span style="color: #003399;">Field</span>.<span style="color: #006633;">Store</span>.<span style="color: #006633;">NO</span>,
				<span style="color: #003399;">Field</span>.<span style="color: #006633;">Index</span>.<span style="color: #006633;">ANALYZED</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
		<span style="color: #666666; font-style: italic;">// add meta data</span>
		<span style="color: #003399;">String</span><span style="color: #009900;">&#91;</span><span style="color: #009900;">&#93;</span> names <span style="color: #339933;">=</span> metadata.<span style="color: #006633;">names</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
		<span style="color: #000000; font-weight: bold;">for</span> <span style="color: #009900;">&#40;</span><span style="color: #003399;">String</span> name <span style="color: #339933;">:</span> names<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
			<span style="color: #003399;">String</span> value <span style="color: #339933;">=</span> metadata.<span style="color: #006633;">get</span><span style="color: #009900;">&#40;</span>name<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
			doc.<span style="color: #006633;">add</span><span style="color: #009900;">&#40;</span><span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">Field</span><span style="color: #009900;">&#40;</span>name, value, <span style="color: #003399;">Field</span>.<span style="color: #006633;">Store</span>.<span style="color: #006633;">YES</span>,
					<span style="color: #003399;">Field</span>.<span style="color: #006633;">Index</span>.<span style="color: #006633;">ANALYZED</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
		<span style="color: #009900;">&#125;</span>
&nbsp;
	<span style="color: #009900;">&#125;</span> <span style="color: #000000; font-weight: bold;">finally</span> <span style="color: #009900;">&#123;</span>
		<span style="color: #000000; font-weight: bold;">try</span> <span style="color: #009900;">&#123;</span>
			input.<span style="color: #006633;">close</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
		<span style="color: #009900;">&#125;</span> <span style="color: #000000; font-weight: bold;">catch</span> <span style="color: #009900;">&#40;</span><span style="color: #003399;">IOException</span> e<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
			<span style="color: #000000; font-weight: bold;">throw</span> <span style="color: #000000; font-weight: bold;">new</span> ContentParserException<span style="color: #009900;">&#40;</span>e<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
		<span style="color: #009900;">&#125;</span>
	<span style="color: #009900;">&#125;</span>
&nbsp;
	<span style="color: #000000; font-weight: bold;">return</span> doc<span style="color: #339933;">;</span>
<span style="color: #009900;">&#125;</span>
&nbsp;
<span style="color: #000000; font-weight: bold;">public</span> <span style="color: #003399;">Document</span> getDocument<span style="color: #009900;">&#40;</span><span style="color: #003399;">File</span> file<span style="color: #009900;">&#41;</span> <span style="color: #000000; font-weight: bold;">throws</span> ContentParserException <span style="color: #009900;">&#123;</span>
&nbsp;
	<span style="color: #003399;">InputStream</span> input<span style="color: #339933;">;</span>
	<span style="color: #000000; font-weight: bold;">try</span> <span style="color: #009900;">&#123;</span>
		input <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">FileInputStream</span><span style="color: #009900;">&#40;</span>file<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
		<span style="color: #000000; font-weight: bold;">if</span> <span style="color: #009900;">&#40;</span>input <span style="color: #339933;">==</span> <span style="color: #000066; font-weight: bold;">null</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
			<span style="color: #003399;">System</span>.<span style="color: #006633;">out</span>
					.<span style="color: #006633;">println</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;Could not open stream from specified resource: &quot;</span>
							<span style="color: #339933;">+</span> file.<span style="color: #006633;">getName</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
		<span style="color: #009900;">&#125;</span>
&nbsp;
		<span style="color: #003399;">Document</span> doc <span style="color: #339933;">=</span> getDocument<span style="color: #009900;">&#40;</span>input<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
		<span style="color: #666666; font-style: italic;">// add the file name to the meta data</span>
		<span style="color: #000000; font-weight: bold;">if</span> <span style="color: #009900;">&#40;</span>doc <span style="color: #339933;">!=</span> <span style="color: #000066; font-weight: bold;">null</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
			<span style="color: #000000; font-weight: bold;">try</span> <span style="color: #009900;">&#123;</span>
				doc.<span style="color: #006633;">add</span><span style="color: #009900;">&#40;</span><span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">Field</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;filename&quot;</span>, file.<span style="color: #006633;">getCanonicalPath</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>,
						<span style="color: #003399;">Field</span>.<span style="color: #006633;">Store</span>.<span style="color: #006633;">YES</span>, <span style="color: #003399;">Field</span>.<span style="color: #006633;">Index</span>.<span style="color: #006633;">NO</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
			<span style="color: #009900;">&#125;</span> <span style="color: #000000; font-weight: bold;">catch</span> <span style="color: #009900;">&#40;</span><span style="color: #003399;">IOException</span> e<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
				<span style="color: #666666; font-style: italic;">// TODO Auto-generated catch block</span>
				e.<span style="color: #006633;">printStackTrace</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
			<span style="color: #009900;">&#125;</span>
		<span style="color: #009900;">&#125;</span>
&nbsp;
		<span style="color: #000000; font-weight: bold;">return</span> doc<span style="color: #339933;">;</span>
&nbsp;
	<span style="color: #009900;">&#125;</span> <span style="color: #000000; font-weight: bold;">catch</span> <span style="color: #009900;">&#40;</span><span style="color: #003399;">FileNotFoundException</span> e<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
		<span style="color: #000000; font-weight: bold;">throw</span> <span style="color: #000000; font-weight: bold;">new</span> ContentParserException<span style="color: #009900;">&#40;</span>e<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
	<span style="color: #009900;">&#125;</span>
&nbsp;
<span style="color: #009900;">&#125;</span>
&nbsp;
<span style="color: #000000; font-weight: bold;">public</span> <span style="color: #003399;">Document</span> getDocument<span style="color: #009900;">&#40;</span><span style="color: #003399;">InputStream</span> input<span style="color: #009900;">&#41;</span>
		<span style="color: #000000; font-weight: bold;">throws</span> ContentParserException <span style="color: #009900;">&#123;</span>
&nbsp;
	<span style="color: #666666; font-style: italic;">// try to retrieve the mime type... unfortunately the Tika parser don't</span>
	<span style="color: #666666; font-style: italic;">// handle this automatically</span>
&nbsp;
	<span style="color: #003399;">BufferedInputStream</span> bufIn <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">BufferedInputStream</span><span style="color: #009900;">&#40;</span>input<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
	MimeType mimeType <span style="color: #339933;">=</span> <span style="color: #000066; font-weight: bold;">null</span><span style="color: #339933;">;</span>
&nbsp;
	<span style="color: #000000; font-weight: bold;">if</span> <span style="color: #009900;">&#40;</span>bufIn.<span style="color: #006633;">markSupported</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
		<span style="color: #666666; font-style: italic;">// TODO this may be dangerous...</span>
		bufIn.<span style="color: #006633;">mark</span><span style="color: #009900;">&#40;</span><span style="color: #cc66cc;">2048</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
		MimeTypes repo <span style="color: #339933;">=</span> config.<span style="color: #006633;">getMimeRepository</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
		<span style="color: #000000; font-weight: bold;">try</span> <span style="color: #009900;">&#123;</span>
			mimeType <span style="color: #339933;">=</span> repo.<span style="color: #006633;">getMimeType</span><span style="color: #009900;">&#40;</span>bufIn<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
		<span style="color: #009900;">&#125;</span> <span style="color: #000000; font-weight: bold;">catch</span> <span style="color: #009900;">&#40;</span><span style="color: #003399;">IOException</span> e<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
			<span style="color: #000000; font-weight: bold;">throw</span> <span style="color: #000000; font-weight: bold;">new</span> ContentParserException<span style="color: #009900;">&#40;</span>e<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
		<span style="color: #009900;">&#125;</span>
		<span style="color: #000000; font-weight: bold;">try</span> <span style="color: #009900;">&#123;</span>
			bufIn.<span style="color: #006633;">reset</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
		<span style="color: #009900;">&#125;</span> <span style="color: #000000; font-weight: bold;">catch</span> <span style="color: #009900;">&#40;</span><span style="color: #003399;">IOException</span> e<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
			logger.<span style="color: #006633;">error</span><span style="color: #009900;">&#40;</span>e<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
		<span style="color: #009900;">&#125;</span>
	<span style="color: #009900;">&#125;</span>
&nbsp;
	<span style="color: #003399;">Document</span> doc <span style="color: #339933;">=</span> getDocument<span style="color: #009900;">&#40;</span>bufIn, mimeType<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
	<span style="color: #000000; font-weight: bold;">return</span> doc<span style="color: #339933;">;</span>
<span style="color: #009900;">&#125;</span>
&nbsp;
<span style="color: #009900;">&#125;</span></pre></div></div>

<p>Additionally I wrote a custom TikaParser that extracts the Exif data from JPEG files. To achieve this I used the <a href="http://www.drewnoakes.com/code/exif/releases/">metadata-extractor</a>. Unfortunately the lib is not available via maven therefore we have to add the lib manually.</p>
<p><code><br />
mvn install:install-file -Dfile=metadata-extractor-2.3.1.jar -DgroupId=com.drew -DartifactId=metadata-extractor -Dversion=2.3.1 -Dpackaging=jar<br />
</code></p>
<p>A custom tika parser may looks like:</p>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;"><span style="color: #000000; font-weight: bold;">package</span> <span style="color: #006699;">de.acidum.tika</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">java.io.IOException</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">java.io.InputStream</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">java.util.Iterator</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">javax.imageio.ImageIO</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">javax.imageio.ImageReader</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.commons.io.input.CloseShieldInputStream</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.log4j.Logger</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.tika.exception.TikaException</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.tika.metadata.Metadata</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.tika.parser.Parser</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.tika.sax.XHTMLContentHandler</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.xml.sax.ContentHandler</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.xml.sax.SAXException</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">com.drew.imaging.jpeg.JpegMetadataReader</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">com.drew.imaging.jpeg.JpegProcessingException</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">com.drew.metadata.Directory</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">com.drew.metadata.MetadataException</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">com.drew.metadata.Tag</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #008000; font-style: italic; font-weight: bold;">/**
* This class implements a tika parser. To activate the parser we have to change
* the tika-config.xml.
*
* Compared to the default Tika Image handling we read the Jpeg Exif data and
* return these values as metadata for Lucene
*
* @author Christoph Hartmann
*
*/</span>
<span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000000; font-weight: bold;">class</span> ImageParser <span style="color: #000000; font-weight: bold;">implements</span> <span style="color: #003399;">Parser</span> <span style="color: #009900;">&#123;</span>
&nbsp;
Logger logger <span style="color: #339933;">=</span> Logger.<span style="color: #006633;">getLogger</span><span style="color: #009900;">&#40;</span><span style="color: #000000; font-weight: bold;">this</span>.<span style="color: #006633;">getClass</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000066; font-weight: bold;">void</span> parse<span style="color: #009900;">&#40;</span><span style="color: #003399;">InputStream</span> stream, <span style="color: #003399;">ContentHandler</span> handler,
		Metadata metadata<span style="color: #009900;">&#41;</span> <span style="color: #000000; font-weight: bold;">throws</span> <span style="color: #003399;">IOException</span>, SAXException, TikaException <span style="color: #009900;">&#123;</span>
&nbsp;
	<span style="color: #003399;">String</span> type <span style="color: #339933;">=</span> metadata.<span style="color: #006633;">get</span><span style="color: #009900;">&#40;</span>Metadata.<span style="color: #006633;">CONTENT_TYPE</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
	<span style="color: #000000; font-weight: bold;">if</span> <span style="color: #009900;">&#40;</span>type <span style="color: #339933;">!=</span> <span style="color: #000066; font-weight: bold;">null</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
		<span style="color: #666666; font-style: italic;">// hey we get a jpeg lets read the exif</span>
		<span style="color: #000000; font-weight: bold;">if</span> <span style="color: #009900;">&#40;</span>type.<span style="color: #006633;">equals</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;image/jpeg&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
			extractJPEGMetaData<span style="color: #009900;">&#40;</span>stream, metadata<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
		<span style="color: #009900;">&#125;</span>
		<span style="color: #666666; font-style: italic;">// if picture is unknown do the default tika handling</span>
		<span style="color: #000000; font-weight: bold;">else</span> <span style="color: #009900;">&#123;</span>
&nbsp;
			Iterator<span style="color: #339933;">&lt;</span>ImageReader<span style="color: #339933;">&gt;</span> iterator <span style="color: #339933;">=</span> ImageIO
					.<span style="color: #006633;">getImageReadersByMIMEType</span><span style="color: #009900;">&#40;</span>type<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
			<span style="color: #000000; font-weight: bold;">if</span> <span style="color: #009900;">&#40;</span>iterator.<span style="color: #006633;">hasNext</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
				ImageReader reader <span style="color: #339933;">=</span> iterator.<span style="color: #006633;">next</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
				reader.<span style="color: #006633;">setInput</span><span style="color: #009900;">&#40;</span>ImageIO
						.<span style="color: #006633;">createImageInputStream</span><span style="color: #009900;">&#40;</span><span style="color: #000000; font-weight: bold;">new</span> CloseShieldInputStream<span style="color: #009900;">&#40;</span>
								stream<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
				metadata.<span style="color: #006633;">set</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;height&quot;</span>, <span style="color: #003399;">Integer</span>
						.<span style="color: #006633;">toString</span><span style="color: #009900;">&#40;</span>reader.<span style="color: #006633;">getHeight</span><span style="color: #009900;">&#40;</span><span style="color: #cc66cc;">0</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
				metadata.<span style="color: #006633;">set</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;width&quot;</span>, <span style="color: #003399;">Integer</span>.<span style="color: #006633;">toString</span><span style="color: #009900;">&#40;</span>reader.<span style="color: #006633;">getWidth</span><span style="color: #009900;">&#40;</span><span style="color: #cc66cc;">0</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
				reader.<span style="color: #006633;">dispose</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
			<span style="color: #009900;">&#125;</span>
		<span style="color: #009900;">&#125;</span>
	<span style="color: #009900;">&#125;</span>
&nbsp;
	XHTMLContentHandler xhtml <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> XHTMLContentHandler<span style="color: #009900;">&#40;</span>handler, metadata<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
	xhtml.<span style="color: #006633;">startDocument</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
	xhtml.<span style="color: #006633;">endDocument</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
<span style="color: #009900;">&#125;</span>
&nbsp;
<span style="color: #008000; font-style: italic; font-weight: bold;">/**
 * get additional metadata for jpeg files
 *
 * @param inputStream
 * @param tikaMetaData
 */</span>
<span style="color: #000000; font-weight: bold;">private</span> <span style="color: #000066; font-weight: bold;">void</span> extractJPEGMetaData<span style="color: #009900;">&#40;</span><span style="color: #003399;">InputStream</span> inputStream,
		Metadata tikaMetaData<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
&nbsp;
	<span style="color: #666666; font-style: italic;">// read the exif meta data</span>
	com.<span style="color: #006633;">drew</span>.<span style="color: #006633;">metadata</span>.<span style="color: #006633;">Metadata</span> jpegMetaData<span style="color: #339933;">;</span>
	<span style="color: #000000; font-weight: bold;">try</span> <span style="color: #009900;">&#123;</span>
		jpegMetaData <span style="color: #339933;">=</span> JpegMetadataReader.<span style="color: #006633;">readMetadata</span><span style="color: #009900;">&#40;</span>inputStream<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
		<span style="color: #666666; font-style: italic;">// iterate through metadata directories</span>
		Iterator<span style="color: #339933;">&lt;?&gt;</span> directories <span style="color: #339933;">=</span> jpegMetaData.<span style="color: #006633;">getDirectoryIterator</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
		<span style="color: #000000; font-weight: bold;">while</span> <span style="color: #009900;">&#40;</span>directories.<span style="color: #006633;">hasNext</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
			Directory directory <span style="color: #339933;">=</span> <span style="color: #009900;">&#40;</span>Directory<span style="color: #009900;">&#41;</span> directories.<span style="color: #006633;">next</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
			<span style="color: #666666; font-style: italic;">// iterate through tags and print to System.out</span>
			Iterator<span style="color: #339933;">&lt;?&gt;</span> tags <span style="color: #339933;">=</span> directory.<span style="color: #006633;">getTagIterator</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
			<span style="color: #000000; font-weight: bold;">while</span> <span style="color: #009900;">&#40;</span>tags.<span style="color: #006633;">hasNext</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
				Tag tag <span style="color: #339933;">=</span> <span style="color: #009900;">&#40;</span>Tag<span style="color: #009900;">&#41;</span> tags.<span style="color: #006633;">next</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
				<span style="color: #000000; font-weight: bold;">try</span> <span style="color: #009900;">&#123;</span>
					tikaMetaData.<span style="color: #006633;">set</span><span style="color: #009900;">&#40;</span>tag.<span style="color: #006633;">getDirectoryName</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">+</span> <span style="color: #0000ff;">&quot;.&quot;</span>
							<span style="color: #339933;">+</span> tag.<span style="color: #006633;">getTagName</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>, tag.<span style="color: #006633;">getDescription</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
					logger.<span style="color: #006633;">debug</span><span style="color: #009900;">&#40;</span>tag.<span style="color: #006633;">getDirectoryName</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">+</span> <span style="color: #0000ff;">&quot;.&quot;</span>
							<span style="color: #339933;">+</span> tag.<span style="color: #006633;">getTagName</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">+</span> <span style="color: #0000ff;">&quot; -&gt; &quot;</span>
							<span style="color: #339933;">+</span> tag.<span style="color: #006633;">getDescription</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
				<span style="color: #009900;">&#125;</span> <span style="color: #000000; font-weight: bold;">catch</span> <span style="color: #009900;">&#40;</span>MetadataException e<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
					logger.<span style="color: #006633;">error</span><span style="color: #009900;">&#40;</span>e<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
				<span style="color: #009900;">&#125;</span>
			<span style="color: #009900;">&#125;</span>
		<span style="color: #009900;">&#125;</span>
&nbsp;
	<span style="color: #009900;">&#125;</span> <span style="color: #000000; font-weight: bold;">catch</span> <span style="color: #009900;">&#40;</span>JpegProcessingException e<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
		logger.<span style="color: #006633;">error</span><span style="color: #009900;">&#40;</span>e<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
	<span style="color: #009900;">&#125;</span>
<span style="color: #009900;">&#125;</span>
<span style="color: #009900;">&#125;</span></pre></div></div>

<p>You can download the whole <a href='http://www.acidum.de/wp-content/uploads/2008/12/lucenetika.zip'>Lucene + Tika example</a> and try yourself.</p>
<p><a href="http://www.acidum.de/wp-content/uploads/2008/12/lucenetika.zip"><img title="download logo" src="http://www.acidum.de/wp-content/themes/acidum/images/www.gif" alt="zip" width="31" height="31" /> </a> </p>
<p><strong>UPDATE:</strong> The project uses <a href="http://maven.apache.org/">Maven 2</a> to build the jar. The official page offers <a href="http://maven.apache.org/download.html">installation instructions</a> and <a href="http://maven.apache.org/run-maven/index.html#Quick_Start">guidelines</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.acidum.de/2009/01/07/index-microsoft-office-files-with-lucene/feed/</wfw:commentRss>
		<slash:comments>12</slash:comments>
		</item>
	</channel>
</rss>

<!-- Dynamic Page Served (once) in 0.607 seconds -->
