Elasticsearch与Python的对接实现-图灵python

Elasticsearch与Python的对接实现

发布时间:2025-04-06 15:45:07

什么是 Elasticsearch

如果你想检查数据，你必然会搜索。搜索离不开搜索引擎。百度和谷歌是一个非常庞大和复杂的搜索引擎。他们几乎索引了互联网上开放的所有网页和数据。然而，对于我们自己的业务数据，当然没有必要使用如此复杂的技术。如果我们想实现我们自己的搜索引擎，方便存储和搜索，Elasticsearch 这是唯一的选择。它是一个全文搜索引擎，可以快速存储、搜索和分析大量数据。

为什么要用 Elasticsearch

Elasticsearch 基于全文搜索引擎库的开源搜索引擎 Apache Lucene™ 基础之上。

那 Lucene 又是什么？Lucene 目前可能存在，无论是开源还是私有，都有最先进、高性能、全功能的搜索引擎功能库，但只是一个库。使用它 Lucene，我们需要写 Java 并引用 Lucene 包可以，我们需要对信息检索有一定程度的了解才能理解 Lucene 是怎么工作的，反正用起来也没那么简单。

所以为了解决这个问题，Elasticsearch 就诞生了。Elasticsearch 也是使用 Java 它的内部使用是编写的 Lucene 做索引和搜索，但其目标是使全文检索简单，相当于 Lucene 一层包装提供了一套简单一致的包装 RESTful API 帮助我们实现存储和检索。

所以 Elasticsearch 只是一个简单的版本 Lucene 包装吗？那就大错特错了，Elasticsearch 不仅仅是 Lucene，而且不仅仅是一个全文搜索引擎。它可以用以下准确描述:

·一个分布式实时文档存储，每个字段都可以被索引和搜索

·分布式实时分析搜索引擎

·能够胜任数百个服务节点的扩展，并支持 PB 结构化或非结构化数据的级别

简而言之，它是一个非常棒的搜索引擎，维基百科，Stack Overflow、GitHub 都用它来搜索。

Elasticsearch 的安装

我们可以到 Elasticsearch 下载官方网站 Elasticsearch：https://www.elastic.co/downloads/elasticsearch，同时，安装说明书也附在官网上。

先下载安装包并解压，然后运行 bin/elasticsearch（Mac 或 Linux）或者 bin\elasticsearch.bat (Windows) 即可启动 Elasticsearch 了。

我使用的是 Mac，Mac 个人推荐使用 Homebrew 安装：

brewinstallelasticsearch

Elasticsearch 默认会在 9200 我们打开浏览器访问端口

http://localhost:9200/ 您可以看到类似的内容：

{
"name":"atntrTf",
"cluster_name":"elasticsearch",
"cluster_uuid":"g2h1xdv5，e64hkjgttp6g",
"version":{
"number":"6.2.4",
"build_hash":"ccec39f",
"build_date":"2018-04-12T20:37:28.497551Z",
"build_snapshot":false,
"lucene_version":"7.2.1",
"minimum_wire_compatibility_version":"5.6.0",
"minimum_index_compatibility_version":"5.0.0"
},
"tagline":"YouKnow,forSearch"
}

如果你看到这个内容，那就意味着 Elasticsearch 安装和启动成功，这里显示我的 Elasticsearch 版本是 6.2.4 版本，版本很重要，以后安装一些插件要对应版本。

接下来，让我们来看看 Elasticsearch 基本概念及和 Python 的对接。

Elasticsearch 相关概念

在 Elasticsearch 有几个基本概念，如节点、索引、文档等。让我们分别解释一下，理解这些概念并熟悉它们 Elasticsearch 很有帮助。

Node 和 Cluster

Elasticsearch 本质上，它是一个允许多个服务器协同工作的分布式数据库，每个服务器可以运行多个服务器 Elasticsearch 实例。

单个 Elasticsearch 例子称为节点（Node）。一组节点形成一个集群（Cluster）。

Index

Elasticsearch 处理后，将索引所有字段写入反向索引（Inverted Index）。搜索数据时，直接搜索索引。

所以，Elasticsearch 数据管理的顶层单位称为 Index(索引)实际上相当于 MySQL、MongoDB 等待内部数据库的概念。此外，值得注意的是，每个数据库 Index (即数据库)名称必须是小写。

Document

Index 内部单条的记录称为 Document(文档)。很多条 Document 构成了一个 Index。

Document 使用 JSON 格式表示，以下是一个例子。

同一个 Index 里面的 Document，不需要相同的结构（scheme），但最好保持相同，这有利于提高搜索效率。

Type

Document 例如，可以分组 weather 这个 Index 可按城市(北京、上海)或气候(晴天、雨天)分组。这种分组叫做 Type，它是过滤虚拟逻辑分组的虚拟逻辑分组 Document，类似 MySQL 中间的数据表，MongoDB 中的 Collection。

不同的 Type 应该有类似的结构（Schema），举例来说，id 这组字段不能是字符串，另一组字段是数值。这与关系数据库中的表不同。数据的性质完全不同(例如 products 和 logs）应该存两个 Index，而不是一个 Index 里面的两个 Type(虽然可以)。

根据规划，Elastic 6.x 只允许每个版本 Index 包含一个 Type，7.x 将完全删除版本 Type。

Fields

也就是说，每个字段 Document 都类似一个 JSON 结构，它包含许多字段，每个字段都有相应的值，多个字段形成一个 Document，其实可以类比 MySQL 数据表中的字段。

在 Elasticsearch 在中间，文档属于一种类型（Type），这些类型存在于索引中（Index）我们可以画一些简单的对比图来类比传统的关系数据库：

RelationalDB->Databases->Tables->Rows->Columns
Elasticsearch->Indices->Types->Documents->Fields

以上就是 Elasticsearch 通过与关系数据库的比较，其中的一些基本概念更有助于理解。

Python 对接 Elasticsearch

Elasticsearch 实际上提供了一系列 Restful API 我们可以使用它进行访问和查询操作 curl 等命令来操作，但毕竟命令行模式没那么方便，这里我们直接介绍使用 Python 来对接 Elasticsearch 相关方法。

Python 中对接 Elasticsearch 使用的是同名库，安装方法很简单:

pip3installelasticsearch

官方文件如下：https://elasticsearch-py.readthedocs.io/，所有用法都可以在里面找到，文章背后的内容也是基于官方文档的。

创建 Index

让我们来看看如何创建索引（Index），在这里，我们将创建一个名称 news 的索引：

fromelasticsearchimportElasticsearch
es=Elasticsearch()
result=es.indices.create(index='news',ignore=400)
print(result)

若创建成功，将返回以下结果：

{'acknowledged':True,'shards_acknowledged':True,'index':'news'}

返回结果是 JSON 格式，其中 acknowledged 字段表示创建操作的成功实施。

但如果我们再次执行代码，我们将返回以下结果：

{'error':{'root_cause':[{'type':'resource_already_exists_exception','reason':'index[news/QM6yz2W8QE-bflkhc5oThw]
alreadyexists','index_uuid':'QM6yz2W8QE-bflkhc5oThw','index':'news'}],'type':'resource_already_exists_
exception','reason':'index[news/QM6yz2W8QE-bflkhc5oThw]alreadyexists','index_uuid':'QM6yz2W8QE-bflkhc5oThw',
'index':'news'},'status':400}

这表明创建失败，status 状态码是错误的原因是400 Index 已经存在了。

请注意，我们的代码在这里使用 ignore 参数为如果返回结果是400，则表明如果返回结果是 400 如果你忽略了这个错误，你就不会报错，程序也不会执行，抛出异常。

假如我们不加 ignore 如果是这个参数：

es=Elasticsearch()
result=es.indices.create(index='news')
print(result)

如果再次执行，就会报错:

raiseHTTP_EXCEPTIONS.get(status_code,TransportError)(status_code,error_message,additional_info)
elasticsearch.exceptions.RequestError:TransportError(400,'resource_already_exists_exception','index
[news/QM6yz2W8QE-bflkhc5oThw]alreadyexists')

这样，程序的执行就会出现问题，所以我们需要充分利用它 ignore 排除一些意想不到的参数，以确保程序的正常执行而不中断。

删除 Index

删除 Index 类似地说，代码如下：

fromelasticsearchimportElasticsearch
es=Elasticsearch()
result=es.indices.delete(index='news',ignore=[400,404])
print(result)

这里也用过 ignore 忽略参数 Index 删除失败导致程序中断的问题是不存在的。

如果删除成功，将输出以下结果：

{'acknowledged':True}

如果 Index 删除后，再执行删除将输出以下结果：

{'error':{'root_cause':[{'type':'index_not_found_exception','reason':'nosuchindex','resource.type':
'index_or_alias','resource.id':'news','index_uuid':'_na_','index':'news'}],'type':'index_not_found_exception',
'reason':'nosuchindex','resource.type':'index_or_alias','resource.id':'news','index_uuid':'_na_','index':
'news'},'status':404}

这一结果显示了目前 Index 不存在，删除失败，返回的结果也是一样的 JSON，状态码是但是因为我们加了400， ignore 忽略了参数 400 因此，程序正常执行输出 JSON 结果，而不是抛出异常。

插入数据

Elasticsearch 就像 MongoDB 同样，结构化字典数据可以直接插入数据，可以调用插入数据 create() 例如，我们在这里插入一个新闻数据：

fromelasticsearchimportElasticsearch
es=Elasticsearch()
es.indices.create(index='news',ignore=400)
data={'title':'美国留给伊拉克的是烂摊子吗','url':'http://view.news.qq.com/zt2011/usa_iraq/index.htm'}
result=es.create(index='news',doc_type='politics',id=1,body=data)
print(result)

在这里，我们首先声明了一个新闻数据，包括标题和链接，然后调用它 create() 该方法在调用中插入该数据 create() 在方法中，我们引入了四个参数，index 参数代表索引名称，doc_type 代表文档类型，body 代表文档的具体内容，id 是数据的唯一标识 ID。

运行结果如下：

{'_index':'news','_type':'politics','_id':'1','_version':1,'result':'created','_shards':{'total':2,
'successful':1,'failed':0},'_seq_no':0,'_primary_term':1}

结果中 result 字段为 created，代表数据插入成功。

其实我们也可以用 index() 插入数据的方法，但与 create() 不同的是，create() 我们需要指定方法 id 该数据是唯一一个用字段识别的， index() 不需要方法，如果没有指定 id，会自动生成一个 id，调用 index() 方法的写法如下：

es.index(index='news',doc_type='politics',body=data)

create() 其实方法内部也是调用的。 index() 方法，是对 index() 包装方法。

更新数据

更新数据也很简单，我们还需要指定数据 id 并且内容，调用 update() 方法如下：

fromelasticsearchimportElasticsearch
es=Elasticsearch()
data={
'title':'美国留给伊拉克的是烂摊子吗',
'url':'http://view.news.qq.com/zt2011/usa_iraq/index.htm',
'date':'2011-12-16'
}
result=es.update(index='news',doc_type='politics',body=data,id=1)
print(result)

我们在这里为数据添加了一个日期字段，然后调用它 update() 结果如下：

{'_index':'news','_type':'politics','_id':'1','_version':2,'result':'updated','_shards':{'total':2,
'successful':1,'failed':0},'_seq_no':1,'_primary_term':1}

返回结果中可以看到，result 字段为 updated，也就是说，更新成功，我们也注意到有一个字段 _version，这代表了更新后的版本号，2 这意味着这是第二个版本，因为之前插入过一次数据，所以第一次插入的数据是版本 1.您可以参考上一个例子的操作结果。更新后，版本号变成 2.以后每次更新版本号都会加。 1。

另外，其实更新操作是用来利用的。 index() 方法也可以做到，写作方法如下:

es.index(index='news',doc_type='politics',body=data,id=1)

可以看到，index() 该方法可以代替我们完成两个操作，如果数据不存在，则执行插入操作，如果已经存在，则执行更新操作，非常方便。

删除数据

如果您想删除一个数据，可以调用它 delete() 指定要删除的数据的方法 id 写作方法如下：

fromelasticsearchimportElasticsearch
es=Elasticsearch()
result=es.delete(index='news',doc_type='politics',id=1)
print(result)

运行结果如下：

{'_index':'news','_type':'politics','_id':'1','_version':3,'result':'deleted','_shards':{'total':2,
'successful':1,'failed':0},'_seq_no':2,'_primary_term':1}

在运行结果中可以看到 result 字段为 deleted，代表成功删除，_version 变成了 3，又增加了 1。

查询数据

以上操作都很简单，普通数据库，比如 MongoDB 一切都可以完成，看起来没什么大不了的，Elasticsearch 更特别的是它极其强大的检索功能。

对于中文，我们需要在这里安装一个分词插件 elasticsearch-analysis-ik，GitHub 链接为：https://github.com/medcl/elasticsearch-analysis-ik，我们在这里使用 Elasticsearch 另一个命令行工具 elasticsearch-plugin 这里安装的版本是 6.2.4，请确保和 Elasticsearch 对应版本，命令如下：

elasticsearch-plugininstallhttps://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.2.4
/elasticsearch-analysis-ik-6.2.4.zip

请用你的版本号代替这里的版本号 Elasticsearch 的版本号。

安装后重新启动 Elasticsearch 可以，它会自动加载安装好的插件。

首先，我们建立了一个新的索引，并指定了需要分词的字段，代码如下：

fromelasticsearchimportElasticsearch
es=Elasticsearch()
mapping={
'properties':{
'title':{
'type':'text',
'analyzer':'ik_max_word',
'search_analyzer':'ik_max_word'
}
}
}
es.indices.delete(index='news',ignore=[400,404])
es.indices.create(index='news',ignore=400)
result=es.indices.put_mapping(index='news',doc_type='politics',body=mapping)
print(result)

在这里，我们先删除以前的索引，然后新建索引，然后更新它 mapping 信息，mapping 分词的字段和字段的类型指定在信息中 type 为 text，分词器 analyzer 和搜索分词器 search_analyzer 为 ik_max_word，即使用我们刚刚安装的中文分词插件。如果没有指定，则使用默认的英文分词器。

接下来，我们将插入几个新数据：

datas=[
{
'title':'美国留给伊拉克的是烂摊子吗',
'url':'http://view.news.qq.com/zt2011/usa_iraq/index.htm',
'date':'2011-12-16'
},
{
'title':'公安部：各地校车将享受最高路权',
'url':'http://www.chinanews.com/gn/2011/12-16/3536077.shtml',
'date':'2011-12-16'
},
{
'title':'中韩渔警冲突调查:韩警平均每天扣1艘中国渔船',
'url':'https://news.qq.com/a/20111216/001044.htm',
'date':'2011-12-17'
},
{
'title':'亚洲男子枪击嫌疑人在中国驻洛杉矶领事馆自首;,
'url':'http://news.ifeng.com/world/detail_2011_12/16/11372558_0.shtml',
'date':'2011-12-18'
}
]
fordataindatas:
es.index(index='news',doc_type='politics',body=data)

在这里，我们指定了带有四个数据的数据 title、url、date 然后通过字段 index() 方法将其插入 Elasticsearch 中，索引名称为 news，类型为 politics。

接下来，我们将根据关键词查询相关内容：

result=es.search(index='news',doc_type='politics')
print(result)

查询所有插入的四个数据：

{
"took":0,
"timed_out":false,
"_shards":{
"total":5,
"successful":5,
"skipped":0,
"failed":0
},
"hits":{
"total":4,
"max_score":1.0,
"hits":[
{
"_index":"news",
"_type":"politics",
"_id":"c05g9mQBD9BUE5fdHOT",
"_score":1.0,
"_source":{
"title":"美国留给伊拉克的是烂摊子吗？",
"url":"http://view.news.qq.com/zt2011/usa_iraq/index.htm",
"date":"2011-12-16"
}
},
{
"_index":"news",
"_type":"politics",
"_id":"9mQBD9BuE5fdHOUm",
"_score":1.0,
"_source":{
"title":"中国驻洛杉矶领事馆被亚洲男子枪击，嫌疑人自首",
"url":"http://news.ifeng.com/world/detail_2011_12/16/11372558_0.shtml",
"date":"2011-12-18"
}
},
{
"_index":"news",
"_type":"politics",
"_id":"du5g9mQBD9BUE5fdHOUj",
"_score":1.0,
"_source":{
"title":"中韩渔民冲突调查：韩国警察平均每天扣除一艘中国渔船",
"url":"https://news.qq.com/a/20111216/001044.htm",
"date":"2011-12-17"
}
},
{
"_index":"news",
"_type":"politics",
"_id":"de5g9mQBD9BUE5fdHOUf",
"_score":1.0,
"_source":{
"title":"公安部：各地校车将享有最高路权",
"url":"http://www.chinanews.com/gn/2011/12-16/3536077.shtml",
"date":"2011-12-16"
}
}
]
}
}

可以看出，返回结果将会出现 hits 在字段中，然后就有了 total 字段标明查询结果条的数量和数量 max_score 代表匹配分数。

此外，我们还可以进行全文检索，这是反映 Elasticsearch 搜索引擎特征的地方:

dsl={
'query':{
'match':{
'title':'中国领事馆'
}
}
}
es=Elasticsearch()
result=es.search(index='news',doc_type='politics',body=dsl)
print(json.dumps(result,indent=2,ensure_ascii=False))

我们在这里使用 Elasticsearch 支持的 DSL 查询语句，使用语句 match 指定全文检索，检索字段为 title，内容为“中国领事馆”

{
"took":1,
"timed_out":false,
"_shards":{
"total":5,
"successful":5,
"skipped":0,
"failed":0
},
"hits":{
"total":2,
"max_score":2.546152,
"hits":[
{
"_index":"news",
"_type":"politics",
"_id":"dk5g9mQBD9BUE5fdHOUm",
"_score":2.546152,
"_source":{
"title":"中国驻洛杉矶领事馆被亚洲男子枪击，嫌犯已自首",
"url":"http://news.ifeng.com/world/detail_2011_12/16/11372558_0.shtml",
"date":"2011-12-18"
}
},
{
"_index":"news",
"_type":"politics",
"_id":"du5g9mQBD9BUE5fdHOUj",
"_score":0.2876821,
"_source":{
"title":"中韩渔警冲突调查:韩警平均每天扣1艘中国渔船",
"url":"https://news.qq.com/a/20111216/001044.htm",
"date":"2011-12-17"
}
}
]
}
}

在这里，我们可以看到两个匹配结果。第一个的分数是 2.54，第二条的分数是 0.28，这是因为第一个匹配的数据包含“中国”和“领事馆”两个词，第二个匹配的数据不包含“领事馆”，但包含“中国”一词，所以也被检索出来，但分数相对较低。

因此，可以看出，相应的字段将在检索过程中进行全文检索，结果将根据搜索关键字的相关性进行排序，这是搜索引擎的基本原型。

另外 Elasticsearch 还支持多种查询方式，详情请参考官方文件：https://www.elastic.co/guide/en/elasticsearch/reference/6.3/query-dsl.html

以上便是对 Elasticsearch 基本介绍和 Python 操作 Elasticsearch 基本用法，但这只是 Elasticsearch 基本功能，它还有更强大的功能等待我们的探索，以后会继续更新，请期待。

本节代码：https://github.com/Germey/ElasticSearch。

资料推荐

此外，还推荐几个好的学习网站：

Elasticsearch 权威指南：https://es.xiaoleilu.com/index.html

全文搜索引擎 Elasticsearch 入门教程：http://www.ruanyifeng.com/blog/2017/08/elasticsearch.html

Elastic 中文社区：https://www.elasticsearch.cn/

参考资料

https://es.xiaoleilu.com/index.html