【Scrapy】Xpathの使い方

今回の章

セクション4:XPath Syntax

 11. Using XPath with Scrapy

 12. Tools to Easily Get XPath

 

今回の目的

Xpathの使い方を学ぶこと

 

準備

以下をコピーする

html_doc = '''
<html>
<head>
<title>Title of the page</title>
</head>
<body>
<h1>H1 Tag</h1>
<h2>H2 Tag with <a href="#">link</a></h2>
<p>First Paragraph</p>
<p>Second Paragraph</p>
</body>
</html>
'''

 

In [1]: from scrapy.selector import Selector

 

In [2]: %paste

html_doc = '''

<html>

  <head>

    <title>Title of the page</title>

  </head>

  <body>

    <h1>H1 Tag</h1>

    <h2>H2 Tag with <a href="#">link</a></h2>

    <p>First Paragraph</p>

    <p>Second Paragraph</p>

  </body>

</html>

'''

 

## -- End pasted text --

 

In [3]: sel = Selector(text=html_doc) 

 
①titleのtextを取得する

In [4]: sel.xpath('/html/head/title').extract()

Out[4]: ['<title>Title of the page</title>']

 

②First Paragraphのみ取得する。//を使うと省略出来る。

In [13]: sel.xpath('//p[1]').extract()
Out[13]: ['<p>First Paragraph</p>']

 

Chromeを活用してXpathを書くこともできる。
  1. Open the web page in Google Chrome.
  2. Select the text portion you want to extract.
  3. Right-click, and select "Inspect".
  4. Select the HTML code you need, and select "Copy" and then "Copy XPath".
  5. Paste the XPath to your code, test, and edit it, if necessary.
  6. Note that this method copy the "id" but you can change it to the "class" of the same portion if that will work better