Elasticsearch is a powerful search engine based on Lucene. It can build indexes on huge amount of data and we can query the data fast by keywords. Different from common database, Elasticsearch build inverted index and is capable of search keywords on all documents. The serch tool of wikipedia now is Elasticsearch. In this post, we introduce how to make a local Elasticsearch and import wikipedia dump into it. Some basic usages of Elasticsearch are also introduced.
We run Elasticsearch on Linux system. Fisrly, we download Elasticsearch from the web.
We installed some plugins provided by wikimedia. Afterwards, we start Elasticsearch by:
Do not follow the official old post
This famous post provided by Elasticsearch https://www.elastic.co/blog/loading-wikipedia is out-of-date since recently, a lot of changes have been made in the development of Elasticsearch. I personally encountered a lot of problems during reproduce this tutorial. And I cannot find the solution.
Logstash is log processing tool. However, it can also be used as the processing tool of our data. Since wikipedia dump is a text stream, Logstash can process text data by a rich set of plugins. The framework of Logstash is shown as follows We can define our own Imput, Filter and Output part to process the text stream to our desired data.