Christoph Hartmann on January 7th, 2009

Within my current research project I faced the challenge to index a whole bunch of files. To be platform independent the Java programming language was the first choice. Then I came along the Lucene project.

Lucene is an open-source project that “provides Java-based indexing and search technology”. I have to mention that Lucene is a framework library instead of an out-of-the-box application. If you think of indexing your files, you may have Microsoft Office files, Adobe pfds or OpenOffice documents in mind. None of these file format can be indexed by Lucene with the standard configuration. But Lucene provides a great API to do parsing of files by other code.

I looked at two projects:

While Tika is not available as a binary download Aperture is. I decided for Tika due to Maven support and clean source code. Aperture comes along a whole bunch of dependencies which make it quite complex to figure out what is really required. Although Tika is only available via source code I’ve done the Lucene Tika integration within an half hour.

Just download the Tika source code via
svn checkout http://svn.apache.org/repos/asf/lucene/tika/trunk tika and use maven to install the binary into your local maven repository.

The following part do the core binding between Tika and Lucene. It asks our own written ContentParser which returns a Lucene Document.

logger.debug("Indexing " + file);
try {
	Document doc = null;
	// parse the document
	synchronized (contentParserAccess) {
		doc = contentParser.getDocument(file);
	}
 
	// put it into Lucene
	if (doc != null) {
		writer.addDocument(doc);
	} else {
		logger.error("Cannot handle "
				+ file.getAbsolutePath() + "; skipping");
	}
} catch (IOException e) {
	logger.error("Cannot index " + file.getAbsolutePath()
			+ "; skipping (" + e.getMessage() + ")");
}

The ContentParser calls the TikaParser for each file and put the metadata it returns into a Lucene document. The most difficult part is determine the Mime-Type. Unfortunately Tike does not use it by default. Therefore we have to call the suitable method MimeTypes repo = config.getMimeRepository() and repo.getMimeType(bufIn). Afterward we have to reset the stream to the start. Otherwise Tika could not retrieve the data properly.

package de.acidum.indexer.tika.parser;
 
import ...
 
/**
*
* This class the is the bridge between Lucene and Tika. It uses Tika to
* retrieve the file content and metadata and generates a Lucene document.
*
* @author Christoph Hartmann
*
*/
public class TikaDocumentParser implements ContentParser {
 
Logger logger = Logger.getLogger(this.getClass());
 
AutoDetectParser autoDetectParser;
TikaConfig config;
 
public TikaDocumentParser() {
	try {
		// load tika config to replace the image parser with our own
		InputStream is = this.getClass().getClassLoader()
				.getResourceAsStream("tika-config.xml");
		config = new TikaConfig(is);
 
		// use tika's auto detect parser
		autoDetectParser = new AutoDetectParser(config);
	} catch (Exception e) {
		logger.error(e);
	}
}
 
private Document getDocument(InputStream input, MimeType mimeType)
		throws ContentParserException {
	Document doc = null;
	try {
		Metadata metadata = new Metadata();
 
		if (mimeType != null) {
			metadata.set(Metadata.CONTENT_TYPE, mimeType.getName());
		}
		ContentHandler handler = new BodyContentHandler();
		try {
			autoDetectParser.parse(input, handler, metadata);
		} catch (Exception e) {
			throw new ContentParserException(e);
		}
 
		doc = new Document();
		// add the content to lucene index document
		doc.add(new Field("body", handler.toString(), Field.Store.NO,
				Field.Index.ANALYZED));
 
		// add meta data
		String[] names = metadata.names();
		for (String name : names) {
			String value = metadata.get(name);
			doc.add(new Field(name, value, Field.Store.YES,
					Field.Index.ANALYZED));
		}
 
	} finally {
		try {
			input.close();
		} catch (IOException e) {
			throw new ContentParserException(e);
		}
	}
 
	return doc;
}
 
public Document getDocument(File file) throws ContentParserException {
 
	InputStream input;
	try {
		input = new FileInputStream(file);
 
		if (input == null) {
			System.out
					.println("Could not open stream from specified resource: "
							+ file.getName());
		}
 
		Document doc = getDocument(input);
 
		// add the file name to the meta data
		if (doc != null) {
			try {
				doc.add(new Field("filename", file.getCanonicalPath(),
						Field.Store.YES, Field.Index.NO));
			} catch (IOException e) {
				// TODO Auto-generated catch block
				e.printStackTrace();
			}
		}
 
		return doc;
 
	} catch (FileNotFoundException e) {
		throw new ContentParserException(e);
	}
 
}
 
public Document getDocument(InputStream input)
		throws ContentParserException {
 
	// try to retrieve the mime type... unfortunately the Tika parser don't
	// handle this automatically
 
	BufferedInputStream bufIn = new BufferedInputStream(input);
 
	MimeType mimeType = null;
 
	if (bufIn.markSupported()) {
		// TODO this may be dangerous...
		bufIn.mark(2048);
		MimeTypes repo = config.getMimeRepository();
		try {
			mimeType = repo.getMimeType(bufIn);
		} catch (IOException e) {
			throw new ContentParserException(e);
		}
		try {
			bufIn.reset();
		} catch (IOException e) {
			logger.error(e);
		}
	}
 
	Document doc = getDocument(bufIn, mimeType);
 
	return doc;
}
 
}

Additionally I wrote a custom TikaParser that extracts the Exif data from JPEG files. To achieve this I used the metadata-extractor. Unfortunately the lib is not available via maven therefore we have to add the lib manually.


mvn install:install-file -Dfile=metadata-extractor-2.3.1.jar -DgroupId=com.drew -DartifactId=metadata-extractor -Dversion=2.3.1 -Dpackaging=jar

A custom tika parser may looks like:

package de.acidum.tika;
 
import java.io.IOException;
import java.io.InputStream;
import java.util.Iterator;
 
import javax.imageio.ImageIO;
import javax.imageio.ImageReader;
 
import org.apache.commons.io.input.CloseShieldInputStream;
import org.apache.log4j.Logger;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.XHTMLContentHandler;
import org.xml.sax.ContentHandler;
import org.xml.sax.SAXException;
 
import com.drew.imaging.jpeg.JpegMetadataReader;
import com.drew.imaging.jpeg.JpegProcessingException;
import com.drew.metadata.Directory;
import com.drew.metadata.MetadataException;
import com.drew.metadata.Tag;
 
/**
* This class implements a tika parser. To activate the parser we have to change
* the tika-config.xml.
*
* Compared to the default Tika Image handling we read the Jpeg Exif data and
* return these values as metadata for Lucene
*
* @author Christoph Hartmann
*
*/
public class ImageParser implements Parser {
 
Logger logger = Logger.getLogger(this.getClass());
 
public void parse(InputStream stream, ContentHandler handler,
		Metadata metadata) throws IOException, SAXException, TikaException {
 
	String type = metadata.get(Metadata.CONTENT_TYPE);
 
	if (type != null) {
		// hey we get a jpeg lets read the exif
		if (type.equals("image/jpeg")) {
			extractJPEGMetaData(stream, metadata);
		}
		// if picture is unknown do the default tika handling
		else {
 
			Iterator<ImageReader> iterator = ImageIO
					.getImageReadersByMIMEType(type);
			if (iterator.hasNext()) {
				ImageReader reader = iterator.next();
				reader.setInput(ImageIO
						.createImageInputStream(new CloseShieldInputStream(
								stream)));
				metadata.set("height", Integer
						.toString(reader.getHeight(0)));
				metadata.set("width", Integer.toString(reader.getWidth(0)));
				reader.dispose();
			}
		}
	}
 
	XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
	xhtml.startDocument();
	xhtml.endDocument();
}
 
/**
 * get additional metadata for jpeg files
 *
 * @param inputStream
 * @param tikaMetaData
 */
private void extractJPEGMetaData(InputStream inputStream,
		Metadata tikaMetaData) {
 
	// read the exif meta data
	com.drew.metadata.Metadata jpegMetaData;
	try {
		jpegMetaData = JpegMetadataReader.readMetadata(inputStream);
 
		// iterate through metadata directories
		Iterator<?> directories = jpegMetaData.getDirectoryIterator();
		while (directories.hasNext()) {
			Directory directory = (Directory) directories.next();
			// iterate through tags and print to System.out
			Iterator<?> tags = directory.getTagIterator();
			while (tags.hasNext()) {
				Tag tag = (Tag) tags.next();
 
				try {
					tikaMetaData.set(tag.getDirectoryName() + "."
							+ tag.getTagName(), tag.getDescription());
					logger.debug(tag.getDirectoryName() + "."
							+ tag.getTagName() + " -> "
							+ tag.getDescription());
				} catch (MetadataException e) {
					logger.error(e);
				}
			}
		}
 
	} catch (JpegProcessingException e) {
		logger.error(e);
	}
}
}

You can download the whole Lucene + Tika example and try yourself.

zip

UPDATE: The project uses Maven 2 to build the jar. The official page offers installation instructions and guidelines.

Tags: , , , ,

12 Responses to “Index Microsoft Office Files with Lucene”

  1. I can not use the library, the Maven does not work, download the libraries one by one, and I’m in trouble.

  2. Hey thiago, which library does not work properly? Lucene core is available at http://repo2.maven.org/maven2/. For tika the source code is required. After you got the source code just build the tika lib. Unfortunately the metadata extrator is not available via maven, but i described how to add the lib as well.

  3. Hello Cristoph,

    I am curently trying to build a similar application but i have the following problem: The handler.toString() method, if there is a table of contents in the office document returns some confusing strings like

    HYPERLINK \l “_Toc198978976″ Αξιόπιστη Διαδικασία Έκδοσης PAGEREF _Toc198978976 \h 7

    This is messing up the indexing completely because if someone wants to find a date like 1989 the document will be returned even if the time period does not exist in its contents. Do you know if there is a way around this problem?

    With regards
    Simon

  4. Hello again,

    I have a new question, i downloaded and installed the Lucene+Tika example. How do i test it?

    With regards
    Simon

  5. Is this website still active?

  6. Hey Simon,
    sorry for the delay, but I was quite busy over the last two weeks. Therefore I had no time to answer your questions. Regarding your questions:

    1. Tika return a raw word document that may contain your keywords. As far as I can tell, the string contains your keyword… If you require a fix, just write a new Tika parser. You may take a look at org.apache.tika.parser.microsoft.OfficeParser before. This class uses Apache Poi to extract data from word, excel etc. I wrote a custom image parser, that is used by Tika. (de.acidum.tika.ImageParser). Do not forget to register your parser within tika-config.xml

    2. The de.acidum.indexer.core.Indexer class contains a main method. Should be easy to start, if you’ve already downloaded all dependencies.

    Hope that helps
    Christoph

  7. Hi Christoph,
    I have tried to execute this application. But I couldnt.Cant we write all those 4 java files in a single folder and execute in normal java manner?If so please guide me how to do.

    If i run this one, I got the error jar files missing. will you please guide me to complete my project.Tomorrow is my deadline.

    Please tell me how to register parser within tika-config.xml

    and aso where to place all dependencies

    Thanks,
    Brindha

  8. Hi Brindha,

    take a look into the Readme.txt that explain how to build the project. As described you require maven 2 to build the project. After you added the dependencies to your local maven repo as described it shoud not a problem to build the project via mvn install. You will find the dependencies within the pom.xml.

    To register a new tika parser take a look within the directory src/main/resources/tika-config. Just add a new parser entry as I did for ImageParser:

    <parser name="parse-image" class="de.acidum.tika.ImageParser">
    <mime>image/bmp</mime>
    <mime>image/gif</mime>
    <mime>image/jpeg</mime>
    <mime>image/png</mime>
    <mime>image/tiff</mime>
    <mime>image/vnd.wap.wbmp</mime>
    <mime>image/x-icon</mime>
    <mime>image/x-psd</mime>
    <mime>image/x-xcf</mime>
    </parser>

    Hope that helps.

  9. Hi Christoph,

    Excellent post,I want to index outlook files using tika.I do able to index other file formats using tika.Could i need to post .pst files or .msg files ?,I got no clue regarding outlook file indexing.I hope you can help me out.

  10. Dear Selvam,

    as far as I Know Apache POI (that is used by Tika) just understand msg files. Take a look at the Tika website and the POI website.

    If your task is to index all mails, it will be quite challenging to use the outlook mail storage from Java. It would be easier to use IMAP or POP3 instead.

    Regards,
    Christoph

  11. Hi Christoph,
    Here some questions,hope you can help me.
    i want extract content from the file inside zip,how can i do?Thanks
    Regards,
    Teddy

  12. Hey Taddy,

    sorry for my late reply, but i was quite busy over the last weeks. I would use Tika and write my custom indexer. To extract content from zip files you should use the java.util.zip package. Take a look at the package description and Suns article about compressing and decompresing data using java apis.

    Hope that helps
    Christoph

Leave a Reply