Clarence's Wicked Mind: August 2016

Sunday, August 28, 2016

Loading data into Elasticsearch

I was playing around with Elasticsearch and tried to load some data into it. One way to do it is write a script to parse the JSON file and use Elasticsearch client to index the doc. e.g. using elasticsearch-py:

from elasticsearch import Elasticsearch
import argparse
import json
import sys

parser = argparse.ArgumentParser(description='Import JSON files into Elasticsearch')
parser.add_argument('-f', '--file', help='file to import', required=True)
parser.add_argument('-i', '--index', help='Elasticsearch index name', required=True)
parser.add_argument('-t', '--type', help='Elasticsearch type name', required=True)
parser.add_argument('--id', help='id field of each document')
parser.add_argument('--empty_as_null', help="Convert empty objection to null")
args = parser.parse_args()

es = Elasticsearch()

with open(args.file, 'r') as json_file:
for line in json_file:
doc = json.loads(line)
if args.id is not None:
doc_id = doc[args.id]
#doc.pop(args.id)
else:
doc_id = None

try:
es.index(index=args.index, doc_type=args.type, id=doc_id, body=doc)
except:
print('Problem processing ')
print(doc)
print(sys.exc_info()[0])

But this is slow. It took almost an hour to index 61k documents. A much way is to use the Bulk API. But first we need to modify the JSON file. The Yelp sample business data comes in this format:

...
{"business_id": "UsFtqoBl7naz8AVUBZMjQQ", "full_address": "202 McClure St\nDravo
sburg, PA 15034", "hours": {}, "open": true, "categories": ["Nightlife"], "city"
: "Dravosburg", "review_count": 4, "name": "Clancy's Pub", "neighborhoods": [],
"longitude": -79.886930000000007, "state": "PA", "stars": 3.5, "latitude": 40.35
0518999999998, "attributes": {"Happy Hour": true, "Accepts Credit Cards": true,
"Good For Groups": true, "Outdoor Seating": false, "Price Range": 1}, "type": "b
usiness"}
...

We will need insert an action before each record. A simple sed command will do the trick:

sed -i.bak 's/^/{ "index": { "_index": "yelp", "_type": "business" } }\n/' business.json

Then we can load the file directly into Elasticsearch

curl -s -XPOST localhost:9200/_bulk --data-binary "@business.json"; echo

And this takes only 30 seconds for the same 61k documents