Ingesting Documents (pdf, word, txt, etc) Into ElasticSearch

ElasticSearch is a great tool for full-text search over billions of records. But what if you want to search through files with help of ElastricSearch? How should you extract and index files? After googling for "ElasticSearch searching PDFS", "ElasticSearch index binary files" I didn't find any suitable solution, so I decided to make this post about available options.

Ingest Attachment Plugin

Official site

The simplest and easy to use solution is Ingest Attachment. It's a plugin for ElasticSearch that extracts content from almost all document types (thanks Tika). It's a good choice for a quick start. Ingest Attachment can't be fine tuned, and that's why it can't handle large files. We post about pitfalls of Ingest Attachment before, read it here.
Installation process is straightforward, check out official ElasticSearch site for details.

Apache Tika

Official site

Apache Tika is a de-facto standard for extracting content from files. Roughly speaking, Tika is a combination of open-source libraries that extract files content, joined into a single library. It's open source and it has a REST API. You have to be experienced to setup and configure it on your server. For example, I had issues with setting up Tesseract to do OCR inside Tika. Also you should notice that Tika doesn't work well with some kinds of PDFs (the ones with images inside) and REST API works much slower than direct Java calls, even on localhost.

So, you installed Tika, what's next? You need to create some kind of wrapper that:

  1. Downloads a file
  2. Calls Tika to extract file content
  3. Submits parsed content to ElasticSearch

To make ElasticSearch search fast through large files you have to tune it yourself. Details in this and this posts.

To sum up, Tika is a great solution but it requires a lot of code-writing and fine-tuning, especially for edge cases: for Tika it's weird PDF's and OCR.

FsCrawler

GitHub

FsCrawler is a "quick and dirty" open-source solution for those who wants to index documents from their local filesystem and over SSH. It crawls your filesystem and indexes new files, updates existing ones and removes old ones. FsCrawler is written in Java and requires some additional work to install and configure it. It supports scheduled crawling (e.g. every 15 minutes), also it has some basic API for submitting files and schedule management. FsCrawler uses Tika inside, and generally speaking you can use FsCrawler as a glue between Tika and ElasticSearch.

Ambar

Official site, GitHub

After dealing with every solution described above, we decided to create our own enterprise-ready solution. Ambar includes all the best from existing solutions, and adds some cool new features.

  • It handles large files (>100 MB) well
  • It extracts content from PDF (even poorly formatted and with embedded images) and does OCR on images
  • It provides user with simple and easy to use REST API and WEB UI
  • It is extremely easy to deploy (thanks Docker)
  • It is open-sourced under Fair Source 1 v0.9 license
  • Provides user with parse and instant search experience out-of-the box

The details on how Ambar works are here.

That's it! Hope you can select one option that suits you best.

Ilya P

Read more posts by this author.

Subscribe to Ambar Blog. How we made your docs searchable

Get the latest posts delivered right to your inbox.

or subscribe via RSS with Feedly!