本文共 5787 字,大约阅读时间需要 19 分钟。
lxml python
Python lxml is the most feature-rich and easy-to-use library for processing XML and HTML data. Python scripts are written to perform many tasks like Web scraping and parsing XML. In this lesson, we will study about python lxml library and how we can use it to parse XML data and perform web scraping as well.
Python lxml是功能最丰富且易于使用的库,用于处理XML和HTML数据。 编写Python脚本可以执行许多任务,例如Web抓取和解析XML。 在本课程中,我们将研究python lxml库以及如何使用它解析XML数据并执行Web抓取。
We can start using lxml by installing it as a python package using tool:
我们可以通过使用工具将lxml作为python软件包安装来开始使用lxml:
pip install lxml
Once we are done with installing this tool, we can get started with simple examples.
安装完此工具后,我们就可以从简单的示例开始。
With lxml, we can create HTML elements as well. The elements can also be calles as the Nodes. Let’s create basic structure of an HTML page using just the library:
使用lxml,我们也可以创建HTML元素。 元素也可以称为节点。 让我们仅使用库来创建HTML页面的基本结构:
from lxml import etreeroot_elem = etree.Element('html')etree.SubElement(root_elem, 'head')etree.SubElement(root_elem, 'title')etree.SubElement(root_elem, 'body')print(etree.tostring(root_elem, pretty_print=True).decode("utf-8"))
When we run this script, we can see the HTML elements being formed:
pretty_print
parameter helps to print indented version of HTML document. 运行此脚本时,我们可以看到正在形成HTML元素:
我们可以看到正在制作HTML元素或节点。pretty_print
参数有助于打印HTML文档的缩进版本。 These HTML elements are basically a . We can access this list normally:
这些HTML元素基本上是一个 。 我们可以正常访问此列表:
html = root_elem[0]print(html.tag)
And this will just head
as that is the tag present right inside html tag. We can also print all elements inside the root tag:
这将只是 head
因为那是html标签内的标签。 我们还可以打印root标记内的所有元素:
for element in root_elem: print(element.tag)
This will print all tags:
这将打印所有标签:
With iselement()
function, we can even check if given element is a valid HTML element:
使用iselement()
函数,我们甚至可以检查给定的元素是否为有效HTML元素:
print(etree.iselement(root_elem))
We just used the last script we wrote. This will give a simple output:
我们只是使用了最后编写的脚本。 这将给出一个简单的输出:
We can add metadata to each HTML element we construct by adding attributes to the elements we make:
通过将属性添加到我们制作的元素中,我们可以将元数据添加到我们构造的每个HTML元素中:
from lxml import etreehtml_elem = etree.Element("html", lang="en_GB")print(etree.tostring(html_elem))
When we run this, we see:
运行此命令时,我们看到:
现在,我们可以按以下方式访问这些属性:print(html_elem.get("lang"))
Value is printed to the console:
None
as output. 值将打印到控制台:
注意,对于给定HTML元素,该属性不存在,我们将获得None
作为输出。 We can also set attributes for an HTML element as:
我们还可以将HTML元素的属性设置为:
html_elem.set("best", "JournalDev")print(html_elem.get("best"))
When we print the value, we get the expected results:
当我们打印值时,我们得到了预期的结果:
Sub-elements we constructed above were empty and that is no fun! Let’s make some sub-elements and put some values in it using lxml library.
我们上面构造的子元素是空的,这没什么好玩的! 让我们使用lxml库制作一些子元素并将一些值放入其中。
from lxml import etreehtml = etree.Element("html")etree.SubElement(html, "head").text = "Head of HTML"etree.SubElement(html, "title").text = "I am the title!"etree.SubElement(html, "body").text = "Here is the body"print(etree.tostring(html, pretty_print=True).decode('utf-8'))
This looks like some healthy data. Let’s see the output:
这看起来像一些健康的数据。 让我们看一下输出:
We can provide RAW XML data directly to etree and parse it as well as it completely understands what is passed to it.
我们可以直接将RAW XML数据提供给etree并对其进行解析,也可以完全理解传递给它的内容。
from lxml import etreehtml = etree.XML('Head of HTMLI am the title! Here is the body')print(etree.tostring(html, pretty_print=True).decode('utf-8'))
Let’s see the output:
让我们看一下输出:
如果您希望数据包括根XML标签声明,那么甚至可以:from lxml import etreehtml = etree.XML('Head of HTMLI am the title! Here is the body')print(etree.tostring(html, xml_declaration=True).decode('utf-8'))
Let’s see the output now:
现在看一下输出:
The parse()
function can be used to parse from files and file-like objects:
parse()
函数可用于从文件和类似文件的对象中进行解析:
from lxml import etreefrom io import StringIOtitle = StringIO("Title Here ")tree = etree.parse(title)print(etree.tostring(tree))
Let’s see the output now:
现在看一下输出:
The fromstring()
function can be used to parse Strings:
fromstring()
函数可用于解析字符串:
from lxml import etreetitle = "Title Here "root = etree.fromstring(title)print(root.tag)
Let’s see the output now:
现在看一下输出:
The fromstring()
function can be used to write XML literals directly into the source:
fromstring()
函数可用于将XML文字直接写入源代码:
from lxml import etreetitle = etree.XML("Title Here ")print(title.tag)print(etree.tostring(title))
Let’s see the output now:
现在看一下输出:
Reference: .
参考: 。
翻译自:
lxml python
转载地址:http://zymzd.baihongyu.com/