elasticsearch 使用

1 Elasticsearch 简介

2 Elasticsearch 安装

3 Elasticsearch 使用

3.1 创建索引(index)

我们要存储一些英文新闻,包含四个字段: 标题,正文,日期,作者

我们首先创建一个索引,并且指定field类型

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
PUT mydocs
{
"mappings":{
"properties":{
"title":{
"type": "text"
},
"content":{
"type": "text"
},
"date":{
"type": "date"
},
"author":{
"type": "keyword"
}
}
}
}

3.2 存储文档

我们从CNN上(https://edition.cnn.com/)选取了一些新闻,进行批处理操作创建文档

1
2
3
4
5
6
7
8
9
10
11
POST _bulk
{"index":{"_index":"mydocs","_id":1}}
{"title":"A woman’s body was found in a bag at an abandoned bus stop. Malaysian police are investigating","content":"Police in Malaysia say they are investigating the death of a woman whose decomposing body was discovered in a travel bag at an abandoned bus station. A passerby found the bag near a building belonging to the state electricity company Tenaga Nasional Berhad earlier this week in Kulai, a district in the southern state of Johor, state news agency Bernama reported. Kulai district police chief Tok Beng Yeow said the highly decomposed state of the body – which he estimated at more than 50 per cent – had hampered initial identification efforts, according to Bernama. However, a preliminary post-mortem report by Sultanah Aminah Hospital suggested the body belonged to a woman over 25 years of age, who had sustained a head injury and may have died around two weeks ago. The district police chief said an investigation is ongoing as he appealed to local residents to come forward with information.","date":"2023-04-16","author":"Chris Lau"}
{"index":{"_index":"mydocs","_id":2}}
{"title":"Multiple people are injured after a shooting in Dadeville, Alabama","content":"A shooting Saturday night in Dadeville, Alabama, has left people injured, Dadeville Council Member Teneeshia Goodman-Johnson told CNN. Goodman-Johnson said there was a shooting last night at a gathering in the downtown area. The council member was unable to provide specifics on the number of injuries. CNN has reached out to multiple state and local officials for more information. Dadeville is about 45 miles northeast of Montgomery.","date":"2023-04-16","author":"Joe Sutton"}
{"index":{"_index":"mydocs","_id":3}}
{"title":"Gunmen kill 7 at public swimming pool in Mexico","content":"At least seven people were killed, including a child, when gunmen opened fire at a public swimming pool in Mexico on Saturday, according to local authorities. An eyewitness to the attacks told local authorities the armed men had arrived at the pool and opened fire around 4:30 p.m. local time on Saturday, then damaged a shop, security cameras and a monitor before leaving. Social media videos showed people in swimsuits screaming and hugging their children. The Mexican army and security forces have been deployed to search for the gunmen behind the attack, which took place in the city of Cortazar in the central state of Guanajuato. When local security forces arrived at the site they found dead bodies, including one child younger than seven, and shell casings, the municipal government said in a statement. In addition to the seven dead, one person was seriously injured and taken to hospital, it added.","date":"2023-04-16","author":"Marlon Sorto"}
{"index":{"_index":"mydocs","_id":4}}
{"title":"Kylian Mbappé becomes Paris Saint-Germain’s all-time top scorer in Ligue 1","content":"Kylian Mbappé has already achieved much in his young career. The 24-year-old has won a World Cup, scored a hattrick in a World Cup final and is captain of France. On Saturday, he added more to his résumé by becoming Paris Saint-Germain’s all-time leading scorer in Ligue 1. Mbappé was the star of the show in a crucial 3-1 victory against Lens, scoring the opener for his 139th league goal for the club. He also beautifully set up Lionel Messi – who scored three minutes after Vitinha had put PSG 2-0 up – in a brilliant team goal. The Frenchman has achieved his feat in 169 Ligue 1 games, overtaking Edison Cavani who netted 138 times in Ligue 1 for the club in 200 league games. Second-placed Lens is challenging PSG for the Ligue 1 title but now nine points adrift of the Parisians it is looking likely that PSG will win an 11th title. Salis Adbul Samed’s red card in the 19th minute didn’t help Lens. PSG had been going through an indifferent period, losing two home games on the bounce to give title rivals Lens and Marseille hope. PSG coach Christophe Galtier told PSG TV after the match: “If there was a match we had to win, it was this one, after the two straight losses at the Parc. Lens are one of our rivals and obviously it was important to win. “There are seven games left. I know that Lens and Marseille will not give up. We must continue to be focused. I just saw that we have been in top spot since the beginning of the season. “We have to continue like that and prepare well for the Angers game, which comes early, on Friday. Our fixture list looks favorable, but it is only favorable if we invest ourselves fully and show a great determination to win.” PSG next plays Angers at the Stade Raymond-Kopa on April 21.","date":"2023-04-16","author":"Aimee Lewis"}
{"index":{"_index":"mydocs","_id":5}}
{"title":"Elon Musk says he’s cut about 80% of Twitter’s staff","content":"Elon Musk has laid off more than 6,000 people at Twitter since taking over the company, he told the BBC in a rare interview late Tuesday. Musk was quoted as saying in the interview that the social media platform now has only 1,500 employees, down from under 8,000 who were employed at the time of his acquisition. The reduction equates to roughly 80% of the company’s staff. It’s “not fun at all” and can sometimes be “painful,” the billionaire CEO told the British broadcaster at Twitter’s head office in San Francisco. The world’s second richest man said that “drastic action” was needed when he came on board, because the company was facing “a $3 billion negative cash flow situation.” That left Twitter (TWTR) with only “four months to live,” he estimated. “This is not a caring [or] uncaring situation. It’s like, if the whole ship sinks, then nobody’s got a job,” Musk said. Musk purchased Twitter for $44 billion last October.","date":"2023-04-12","author":"Michelle Toh"}
1
2
"index"会覆盖
"create"会更新

3.3 文档增删查改

3.3.1 增添

1
2
3
4
5
6
7
POST mydocs/_doc/6
{
"title":"My Sunday",
"content":"I eat a apple",
"date":"2023-04-11",
"author":"Jincheng Guan"
}

3.3.4 删除

1
DELETE mydocs/_doc/6

3.3.5 查询

1
2
GET mydocs/_doc/3
GET mydocs/_source/3

3.3.6 修改

1
2
3
4
5
6
7
POST mydocs/_update/6
{
"doc":{
"title":"My Saturday",
"date":"2023-04-15"
}
}

3.4 关键词检索

在标题和正文中使用关键词进行检索

match全文查询,term是词条查询

1
2
3
4
5
6
7
8
GET mydocs/_search
{
"query":{
"match":{
"title":"Elon Musk"
}
}
}
1
2
3
4
5
6
7
8
GET mydocs/_search
{
"query":{
"match":{
"content":"Elon Musk"
}
}
}

也可以写painless脚本来实现,不过执行大量数据可能会导致速度慢,所以尽量避免

1
2
3
4
5
6
7
8
9
10
11
12
13
14
GET mydocs/search
{
"query":{
"script":{
"script":{
"source":"doc['title.keyword'].contains(params.name)",
"lang":"painless",
"params":{
"name":"Elon Musk"
}
}
}
}
}

3.5 带约束的关键词查询

例如实现在一周内关键词为Elon Musk的查询

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
GET mydocs/_search
{
"query":{
"bool":{
"must":[
{
"range":{
"date":{
"gte":"2023-04-10",
"lte":"2023-04-17"
}
}
},
{
"match":{
"title":"Elon Musk"
}
}
]
}
}
}

4 Python 客户端使用

连接Elasticsearch客户端

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from elasticsearch import Elasticsearch

ELASTICSEARCH_ENDPOINT="https://localhost:9200"
USERNAME="elastic"
PASSWORD="your password"
ELASTICSEARCH_CERT_PATH="/your path/elasticsearch-8.7.0/config/certs/http_ca.crt"
API_KEY="your api key"

es=Elasticsearch(
ELASTICSEARCH_ENDPOINT,
ca_certs=ELASTICSEARCH_CERT_PATH,
api_key=API_KEY,
basic_auth=(USERNAME,PASSWORD),
verify_certs=false
)

print(es.info())
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
INDEX="mydocs"
QUERY={
"query":{
"match":{
"title":"Saturday"
}
}
}
doc={
"title":"my Saturday",
"content":"I eat two apples",
"date":"2023-04-14",
"author":"jcg"
}
update_doc={
"doc":{
"title":"My sunday"
}
}

4.1 index

1
2
3
#index
res=es.index(index=INDEX,id=6,body=doc)
print(res)
1
2
3
#search
res=es.search(index=INDEX, body=QUERY)
res

4.3 update

1
2
3
4
#update
res=es.update(index=INDEX,id=6,body=update_doc)
res=es.get(index=INDEX,id=6)
res

4.4 delete

1
2
#delete
res=es.delete(index=INDEX,id=6)

4.5 helpers.bulk

原生 bulk API 的一个问题是所有数据都需要先加载到内存,然后才能被索引。 当我们有一个大数据集时,这可能会出现问题并且效率很低。 为了解决这个问题,我们可以使用 bulk helper,它可以从迭代器(iterators)或生成器(generators)中索引 Elasticsearch 文档。 因此,它不需要先将所有数据加载到内存中,这在内存方面非常高效。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#test bulk

mappings={
"mappings":{
"properties":{
"title":{
"type":"text"
},
"content":{
"type":"text"
}
}
}
}

res=es.indices.create(index="mytest",body=mappings)


1
2
3
4
bulk_data=[
{"_index":"mytest","_source":{"title":"go home","content":"I go home"}},
{"_index":"mytest","_source":{"title":"go to sleep","content":"I go to sleep"}}
]
1
res=helpers.bulk(es,bulk_data)

5 中文分词器

5.1 安装

https://github.com/medcl/elasticsearch-analysis-ik

下载对应elasticsearch版本的,解压到plugin文件夹中即可

5.2 使用

在建立mappings时添加

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
PUT mydocs_chinese
{
"mappings":{
"properties":{
"title":{
"type":"text",
"analyzer": "ik_max_word",
"search_analyzer": "ik_max_word"
},
"content":{
"type":"text",
"analyzer": "ik_max_word",
"search_analyzer": "ik_max_word"
}
}
}
}

其中analyzer是分词器,默认是standard

search_analyzer是搜索分词器

ik_max_word是细粒度,ik_smart是粗粒度 (分词器)

在search时和英文一样


elasticsearch 使用
http://gjc2.github.io/2023/04/15/elasticsearch-使用/
作者
gjc
发布于
2023年4月15日
许可协议