Using XPath with Scrapy Section 4

XPathの使い方

始める方法

$ scrapy shell
In [2]: from scrapy.selector import Selector

以下のファイルを読み込ませる。 f:id:yukking3:20180512221728p:plain

<html>
  <head>
    <title>Title of the page</title>
  </head>
  <body>
    <h1>H1 Tag</h1>
    <h2>H2 Tag with <a href="#">link</a></h2>
    <p>First Paragraph</p>
    <p>Second Paragraph</p>
  </body>
</html>

実際に実行して見る

In [10]: sel = Selector(text=html_doc)

In [11]: sel.xpath('/html/head/title').extract()
Out[11]: ['<title>Title of the page</title>']

f:id:yukking3:20180512222015p:plain

他の情報の取り方は以下の通りだ。 f:id:yukking3:20180512220757p:plain

In [13]: sel.xpath('//p[1]').extract()
Out[13]: ['<p>First Paragraph</p>']

色々なものが取得できる。 f:id:yukking3:20180512220812p:plain

Google Chromeの活用法

  1. Open the web page in Google Chrome.
  2. Select the text portion you want to extract.
  3. Right-click, and select "Inspect".
  4. Select the HTML code you need, and select "Copy" and then "Copy XPath".
  5. Paste the XPath to your code, test, and edit it, if necessary.
  6. Note that this method copy the "id" but you can change it to the "class" of the same portion if that will work better