We are being constantly asked by enthusiasts why can't they simply use Ingest Attachment plugin with ElasticSearch to extract their files content and perform full text search over it. What is so valuable in Ambar that Ingest Attachment + ElasticSearch can't provide you with?
The simplest answer would be:
Just try to process, index and search through a bunch of 500 Mb files with Ingest Attachment + ES and you'll know why :)
But, I'll explain.
Ingest Attachment is a plugin that simply wraps Apache Tika into ES pipeline module that does content and meta extraction out-of-the box. At a first look it might be a silver bullet, the simplest and the most efficient extract-index-search solution. But, it's not as great as it might look like. Let me explain why you should not use it if you're looking for a good quality parsing and going to index and store large amounts of data.
Sources Storing and Custom Mappings
First of all, Ingest Attachment receives source data as BASE64 string and sends it (along with meta and parsed text) to ES after parsing.
The basic issues are:
- The overhead when converting binary stream to BASE64 is about 30%
- Storing source files data in ES is waaay too expensive and useless, even if you store it separately and exclude it from the index (especially in BASE64 string format)
A bit deeper:
- If you're going to make it right and you want your ES to perform really well with your files, you have to fine tune it: use custom mappings and analyzers for your index (Making ElasticSearch Perform Well with Large Text Fields and Highlighting Large Documents in ElasticSearch). Also you might want to adjust transactioning and make ES calls more synchronous (for example, when you need to know when exactly the added text becomes searchable). You can't do all this fine tuning with Ingest Attachment, at least out of the box. You'll have to modify it or add another module in the pipeline to perform all that stuff.
- Often you need to store metadata and parsed content separately, you can't do it with Ingest Attachment since it has fixed output data model. You'll have to add another module to your pipeline to split and modify the data.
As I've already mentioned, Ingest Attachment is basically a Tika app wrapped as a plugin for ES. Tika is awesome, but, again, when you need to make it right, you have to tune it properly and do some customization.
For instance: parsing PDFs. In fact, it's quite difficult to extract text from PDF properly, often you have to extract inline images or render the whole page and OCR it depending on the text extracted from the page and it's content (for example you have to analyse whether encoding is right or not). You simply can not tune Tika to use any custom logic inside parsing process, neither you can't do so with Ingest Attachment. If you're aiming at a good quality PDF parsing - Ingest Attachment is not what you're looking for, you have to do it yourself.
Parsing PDFs is a really huge topic and we're going to post on this on our blog soon.
Doing OCR Right
Ingest Attachment can be set up to do OCR with it's Tika, it's quite tricky but possible.
But judging by my experience you should never use it. Unless you want Ingest Attachment to waste all your CPU time for useless format conversions that often end up in an endless cycle that eventually allocates all the JVM's memory and crashes, what a finale...
You can't tune Ingest Attachment and it's Tika to perform any logic inside OCR processing, Tika often tries to convert and OCR broken images without proper exceptions handling and integrity checking, or tries to convert images in some exotic formats it's internal modules simply can not handle. TIFF with Jpeg encoding is a special pain.
So, if you need a good quality OCR, you have to do it yourself.
To Sum Up
Ingest Attachment is a great plugin if you need the easiest parsing-indexing solution that works out-of-the box. But if you're looking for a solid product - you'll have to find something more reliable and complex.