Ambar Use Case: Integrated "Parse and Search" Solution

During last years we were constantly noticing a bunch of developers and enthusiasts looking for a solution to parse (extract content) and index their documents, such as PDFs. Something they can send or upload their documents into and query them by some phrases from documents' content. Since there was no integrated solution present, they had to improvise. Most of them ended up using Tika or Ingest Attachment plugin along with ElasticSearch, some used Solr or even Lucene with custom parsers. But all these solutions are actually not easy to set up: it usually takes weeks or even months to combine all the modules, fine tune them, fix all bugs and finally get them working together properly (Check out this post: Ingest Attachment Plugin for ElasticSearch: Should You Use It?)

At some point during Ambar development we realized that along with its main goals, it also perfectly solves the issue I described by providing developers with an out-of-the-box solid "parse and search" solution.

Let's discuss a basic scenario of using Ambar for uploading files and search through their contents.

Using Ambar REST API

First of all, if you haven't done it yet, install Ambar CE. It's easy, just follow the steps from Ambar Installation: Step-by-step guide.

Let's assume your Ambar API is running on http://ambar, I'll use this address as a default.

Uploading Files

To upload a file into Ambar, use this method:

POST http://ambar/api/files/:sourceId/:filename  

The sourceId is a parameter to group files inside Ambar by their source, for example for files coming from Dropbox the source name can be 'Dropbox', for files that were uploaded from UI you can set 'Default' as a sourceId.

filename is the name of the file being uploaded.

The request body must be a multi-part form data with a single field inside containing the file.

Let's upload 1984-george_orwell.rtf file with Books source id.

curl -X POST \  
  http://ambar/api/files/Books/1984-george_orwell.rtf \
  -H 'content-type: multipart/form-data; boundary=----WebKitFormBoundary7MA4YWxkTrZu0gW' \
  -F 1984-george_orwell.rtf=@1984-george_orwell.rtf

Receiving HTTP 200 OK response means the file has been successfully uploaded into Ambar and enqueued for processing.


Let's search for the 1984-george_orwell.rtf file by a rather popular phrase from it: "Big Brother Watch"~3 (check Mastering Ambar Search Queries to understand search syntax)

The method for search is

GET http://ambar/api/search  

The parameters are:

  • query - search query
  • size - number of results to retrieve (default is '10')
  • page - page of results to retreive (default is '0', for reference: page=0&size=10 returns first 10 hits, page=1&size=10 returns hits from 10th to 20th)
curl -i http://ambar/api/search?query=%22Big%20Brother%20Watch%22~3  

Here is the response

            "updated_datetime":"2017-02-05 08:55:12.606",
            "indexed_datetime":"2017-04-26 19:30:28.375",

            "created_datetime":"2017-01-13 14:12:51.814",
            "indexed_datetime":"2017-04-26 19:30:28.681",
            "author":"George Orwell",
            "processed_datetime":"2017-04-26 19:30:28.681",
                  " CHAPTER ONE<br/><em>Big Brother</em> Is <em>Watching</em> You<br/>It was a bright, cold day in April and the clocks were striking thirteen. Winston Smith hurried home to Victory Mansions with his head down to escape the terrible wind. A cloud of dust blew inside with him, and the hall smelled of dust and yesterday's food.<br/> At the end of the hall, a poster covered one wall. It showed an enormous face, more than a metre wide: the face of a handsome man of about forty-five, with a large, black moustache. The man's",
                  "man's eyes seemed to follow Winston as he moved. Below the face were the words <em>BIG BROTHER</em> IS <em>WATCHING</em> YOU.<br/> Winston went up the stairs. He did not even try the lift. It rarely worked and at the moment the electricity was switched off during the day to save money for Hate Week. The flat was on the seventh floor and Winston, who was thirty-nine and had a bad knee, went slowly, resting several times on the way. Winston was a small man and looked even smaller in the blue overalls of the Party. His hair",
                  "blew dust and bits of paper around in the street and there seemed to be no colour in anything, except in the posters that were everywhere. The face with the black moustache looked down from every corner. There was one on the house opposite. <em>BIG BROTHER</em> IS <em>WATCHING</em> YOU, it said, and the eyes looked into Winston's.<br/> Behind him the voice from the telescreen was still talking about iron. There was now even more iron in Oceania than the Ninth Three-Year Plan had demanded. The telescreen had a microphone"

Let's describe it a bit.

  • total - total number of hits
  • hits - hits array
    • score - the score of the hit, float number from 0 to 1
    • sha256 - sha256 of the content found
    • file_id - unique id of the file
    • meta - metadata object with file's metadata and highlights
      • full_name - full name of the file inside Ambar, the format is //{source id}/{original path to file in case it was crawled}/{file name}.{file extension}
      • source_id - source id the metadata belongs to, in our case it's 'Books'
      • download_uri - unique id used to download the original file with corresponding metadata, the next chapter describes how to use it
      • other fields are quite simple to understand and there's no need to describe them
    • content - object with parsed content, its metadata and highlights
      • highlight - object with retrieved highlights for corresponding fields. Each highlighted field contains of an array of highlighted synopsises, for reference text field represents the parsed content of the file HTML formatted with <em> marked highlishts
  • took - time in milliseconds that the search took

Downloading the file

The method for downloading a file is

GET http://ambar/api/files/:download_uri  

You can get download_uri from meta field in the search response (see previous chapter).

curl -X GET http://ambar/api/files/b41c4aaa2999ce42957f087db8e7608970efcedb1eaa40c28336390ecb5373849c955f395258f3dfd7482d4b84d543cdcc27104c934cd4efdb0ba8c54e6e8e5f266991ac51a4bea8765af633d0074cfd9000b7e3bff65d6cf3cb00bd1ee7cb3163379c0317767c9def3605c32ffc23a3d2a7ecd234b8edd575dffcf4d7106bd6caf7573447c6e991258d2e2091a8a389  

Also, you can just open this URL in your browser and it'll download the file.

To Sum Up

I described how to use Ambar as a ready-to-go parse and search solution for your projects, it's actually much easier than setting up your own Tika + ElasticSearch or any other self-built infrastructure. Also it provides you with much better results quality, both in terms of content extraction and search optimization.

Here is the complete Ambar Web API Documentation, please refer to it if you need to clarify anything on interacting with the API.

Stay tuned and subscribe for our blog!

Igor S

Read more posts by this author.

Subscribe to Ambar Blog. How we made your docs searchable

Get the latest posts delivered right to your inbox.

or subscribe via RSS with Feedly!