Elasticsearch is a powerful search engine based on Lucene. It can build indexes on huge amount of data and we can query the data fast by keywords. Different from common database, Elasticsearch build inverted index and is capable of search keywords on all documents. The serch tool of wikipedia now is Elasticsearch. In this post, we introduce how to make a local Elasticsearch and import wikipedia dump into it. Some basic usages of Elasticsearch are also introduced.
We run Elasticsearch on Linux system. Fisrly, we download Elasticsearch from the web.
We installed some plugins provided by wikimedia. Afterwards, we start Elasticsearch by:
Do not follow the official old post
This famous post provided by Elasticsearch https://www.elastic.co/blog/loading-wikipedia is out-of-date since recently, a lot of changes have been made in the development of Elasticsearch. I personally encountered a lot of problems during reproduce this tutorial. And I cannot find the solution.
Logstash is log processing tool. However, it can also be used as the processing tool of our data. Since wikipedia dump is a text stream, Logstash can process text data by a rich set of plugins. The framework of Logstash is shown as follows We can define our own Imput, Filter and Output part to process the text stream to our desired data.
Firstly, we download Logstash and unzip it.
Then, we make our config file to define the three parts of Logstash. The config file
wikipedia.conf is as follows:
In the input part, we get text stream from stdin. We use multiline codec to seperate the text by xml tag
<page>. In the filter part, we use two filters. The first is xml filter, it use xpath to extract information we need and pass the information to the next filter mutate. Mutate is a filter used to modify the data. Here, we removed some fields and convert the title and id from array to string. Afterwards, we use gsub module to do some regexp replacement to preprocess the raw text. Finally, we defined the output part. We set the output node to Elasticsearch (Logstash also support a lot of output database). we set the index of our data and we set the id of each document the same with our extracted id.
We import our data by
bunzip2 -c output the file to stdout, by the pipline, the text stream is sent to logstash.
stream2es(https://github.com/elastic/stream2es) is an old tool provided by Elasticsearch team. If you think Logstash takes too much time, you can consider using this tool instead. Stream2es can import different kinds of formats of data into Elasticsearch. Please be noted that there are some size limit of jave. So we have to change the limit in the importing command. The command is as follows:
During the importing, Stream2es will stuck and cannot import more data.
Some basic operation of Elasticsearch
View Index Information
After importing the data, we can view the index information by
We can see that we have 17959833 documents and it occupy 57Gb disk.
Search by content
We use the following command to search the keyword “”
We can list some sample data by
We can remove the index by