Sunday, August 28, 2016

Loading data into Elasticsearch

I was playing around with Elasticsearch and tried to load some data into it.  One way to do it is write a script to parse the JSON file and use Elasticsearch client to index the doc. e.g. using elasticsearch-py:

from elasticsearch import Elasticsearch
import argparse
import json
import sys


parser = argparse.ArgumentParser(description='Import JSON files into Elasticsearch')
parser.add_argument('-f', '--file', help='file to import', required=True)
parser.add_argument('-i', '--index', help='Elasticsearch index name', required=True)
parser.add_argument('-t', '--type', help='Elasticsearch type name', required=True)
parser.add_argument('--id', help='id field of each document')
parser.add_argument('--empty_as_null', help="Convert empty objection to null")
args = parser.parse_args()

es = Elasticsearch()

with open(args.file, 'r') as json_file:
    for line in json_file:
        doc = json.loads(line)
        if args.id is not None:
            doc_id = doc[args.id]
            #doc.pop(args.id)
        else:
            doc_id = None

        try:
            es.index(index=args.index, doc_type=args.type, id=doc_id, body=doc)
        except:
            print('Problem processing ')
            print(doc)
            print(sys.exc_info()[0])


But this is slow.  It took almost an hour to index 61k documents.  A much way is to use the Bulk API.  But first we need to modify the JSON file.  The Yelp sample business data comes in this format:

...
{"business_id": "UsFtqoBl7naz8AVUBZMjQQ", "full_address": "202 McClure St\nDravo
sburg, PA 15034", "hours": {}, "open": true, "categories": ["Nightlife"], "city"
: "Dravosburg", "review_count": 4, "name": "Clancy's Pub", "neighborhoods": [],
"longitude": -79.886930000000007, "state": "PA", "stars": 3.5, "latitude": 40.35
0518999999998, "attributes": {"Happy Hour": true, "Accepts Credit Cards": true,
"Good For Groups": true, "Outdoor Seating": false, "Price Range": 1}, "type": "b
usiness"}

...

We will need insert an action before each record.  A simple sed command will do the trick:

sed -i.bak 's/^/{ "index": { "_index": "yelp", "_type": "business" } }\n/' business.json

Then we can load the file directly into Elasticsearch

curl -s -XPOST localhost:9200/_bulk --data-binary "@business.json"; echo


And this takes only 30 seconds for the same 61k documents

No comments: