【Scrapy】Xpathの使い方 - yukking3’s blog

今回の章

セクション4:XPath Syntax

　11. Using XPath with Scrapy

　12. Tools to Easily Get XPath

今回の目的

Xpathの使い方を学ぶこと

準備

以下をコピーする

html_doc = '''
<html>
<head>
<title>Title of the page</title>
</head>
<body>
<h1>H1 Tag</h1>
<h2>H2 Tag with <a href="#">link</a></h2>
<p>First Paragraph</p>
<p>Second Paragraph</p>
</body>
</html>
'''

In [1]: from scrapy.selector import Selector

In [2]: %paste
html_doc = '''
<html>
<head>
<title>Title of the page</title>
</head>
<body>
<h1>H1 Tag</h1>
<h2>H2 Tag with <a href="#">link</a></h2>
<p>First Paragraph</p>
<p>Second Paragraph</p>
</body>
</html>
'''

## -- End pasted text --

In [3]: sel = Selector(text=html_doc)

①titleのtextを取得する

In [4]: sel.xpath('/html/head/title').extract()
Out[4]: ['<title>Title of the page</title>']

②First Paragraphのみ取得する。//を使うと省略出来る。

In [13]: sel.xpath('//p[1]').extract()
Out[13]: ['<p>First Paragraph</p>']

③Chromeを活用してXpathを書くこともできる。

Open the web page in Google Chrome.
Select the text portion you want to extract.
Right-click, and select "Inspect".
Select the HTML code you need, and select "Copy" and then "Copy XPath".
Paste the XPath to your code, test, and edit it, if necessary.
Note that this method copy the "id" but you can change it to the "class" of the same portion if that will work better