from elasticsearch import Elasticsearch
import argparse
import json
import sys
parser = argparse.ArgumentParser(description='Import JSON files into Elasticsearch')
parser.add_argument('-f', '--file', help='file to import', required=True)
parser.add_argument('-i', '--index', help='Elasticsearch index name', required=True)
parser.add_argument('-t', '--type', help='Elasticsearch type name', required=True)
parser.add_argument('--id', help='id field of each document')
parser.add_argument('--empty_as_null', help="Convert empty objection to null")
args = parser.parse_args()
es = Elasticsearch()
with open(args.file, 'r') as json_file:
for line in json_file:
doc = json.loads(line)
if args.id is not None:
doc_id = doc[args.id]
#doc.pop(args.id)
else:
doc_id = None
try:
es.index(index=args.index, doc_type=args.type, id=doc_id, body=doc)
except:
print('Problem processing ')
print(doc)
print(sys.exc_info()[0])
But this is slow. It took almost an hour to index 61k documents. A much way is to use the Bulk API. But first we need to modify the JSON file. The Yelp sample business data comes in this format:
...
{"business_id": "UsFtqoBl7naz8AVUBZMjQQ", "full_address": "202 McClure St\nDravo
sburg, PA 15034", "hours": {}, "open": true, "categories": ["Nightlife"], "city"
: "Dravosburg", "review_count": 4, "name": "Clancy's Pub", "neighborhoods": [],
"longitude": -79.886930000000007, "state": "PA", "stars": 3.5, "latitude": 40.35
0518999999998, "attributes": {"Happy Hour": true, "Accepts Credit Cards": true,
"Good For Groups": true, "Outdoor Seating": false, "Price Range": 1}, "type": "b
usiness"}
...
We will need insert an action before each record. A simple sed command will do the trick:
...
{"business_id": "UsFtqoBl7naz8AVUBZMjQQ", "full_address": "202 McClure St\nDravo
sburg, PA 15034", "hours": {}, "open": true, "categories": ["Nightlife"], "city"
: "Dravosburg", "review_count": 4, "name": "Clancy's Pub", "neighborhoods": [],
"longitude": -79.886930000000007, "state": "PA", "stars": 3.5, "latitude": 40.35
0518999999998, "attributes": {"Happy Hour": true, "Accepts Credit Cards": true,
"Good For Groups": true, "Outdoor Seating": false, "Price Range": 1}, "type": "b
usiness"}
...
We will need insert an action before each record. A simple sed command will do the trick:
sed -i.bak 's/^/{ "index": { "_index": "yelp", "_type": "business" } }\n/' business.json
Then we can load the file directly into Elasticsearch
curl -s -XPOST localhost:9200/_bulk --data-binary "@business.json"; echo
And this takes only 30 seconds for the same 61k documents
No comments:
Post a Comment