【Scrapy】簡単なScrapyを作ってみよう1/2 〜Shellで目的のtextを収集する〜
今回の章
セクション3:Building Basic Spider withScrapy
8. Scrapy Simple Spider - Part 1
9. Scrapy Simple Spider - Part 2
今回の目的
quotes.toscrape.comのスパイダーを作る
①H1のtext文字のみ取得
②Classがtagsになってるもの全ての取得
③Classがtagになっているもののみ(右側のタグのみ取得)
事前準備
Scrapyの状態確認
(venv_20201101) MacY:venv_20201101 macbookproy$ scrapy
Scrapy 2.4.0 - no active projectUsage:
scrapy <command> [options] [args]Available commands:
bench Run quick benchmark test
commands
fetch Fetch a URL using the Scrapy downloader
genspider Generate new spider using pre-defined templates
runspider Run a self-contained spider (without creating a project)
settings Get settings values
shell Interactive scraping console
startproject Create new project
version Print Scrapy version
view Open URL in browser, as seen by Scrapy[ more ] More commands available when run from project directory
スパイダーを作ってみよう。最初にquotes_spiderというプロジェクトの作成
(venv_20201101) MacY:venv_20201101 macbookproy$ scrapy startproject quotes_spider
New Scrapy project 'quotes_spider', using template directory '/Users/macbookproy/dev/venv_20201101/lib/python3.6/site-packages/scrapy/templates/project', created in:
/Users/macbookproy/dev/venv_20201101/quotes_spiderYou can start your first spider with:
cd quotes_spider
scrapy genspider example example.com
(venv_20201101) MacY:venv_20201101 macbookproy$ cd quotes_spider
次にquotes.toscrape.com用のスパイダーを作ってみよう。
(venv_20201101) MacY:quotes_spider macbookproy$ scrapy genspider quotes quotes.toscrape.com
Created spider 'quotes' using template 'basic' in module:
{spiders_module.__name__}.{module}
以下のスクリプトが出来る
Shellを入れて、quotes.toscrape.comとfetchできるか確認
$ scrapy shell
In [1]: fetch('http://quotes.toscrape.com/')
2018-05-12 14:45:16 [scrapy.core.engine] INFO: Spider opened
2018-05-12 14:45:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/> (referer: None)In [2]: response
Out[2]: <200 http://quotes.toscrape.com/>
①H1のtext文字のみ取得
In [5]: response.xpath('//h1/a/text()').extract()
Out[5]: ['Quotes to Scrape']
基本はresponse.xpath('') 最初は//がいる
In [16]: response.xpath('//h1')
Out[16]: [<Selector xpath='//h1' data='<h1>\n <a href="/" ...'>]
In [17]: response.xpath('//h1/a/text()')
Out[17]: [<Selector xpath='//h1/a/text()' data='Quotes to Scrape'>]
In [18]: response.xpath('//h1/a/text()').extract()
Out[18]: ['Quotes to Scrape']
②Classがtagsになってるもの全ての取得
In [15]: response.xpath('//*[@class="tag"]')
Out[15]:
[<Selector xpath='//*[@class="tag"]' data='<a class="tag" href="/tag/change/page...'>,
<Selector xpath='//*[@class="tag"]' data='<a class="tag" href="/tag/deep-though...'>,
<Selector xpath='//*[@class="tag"]' data='<a class="tag" href="/tag/thinking/pa...'>,
<Selector xpath='//*[@class="tag"]' data='<a class="tag" href="/tag/world/page/...'>,
<Selector xpath='//*[@class="tag"]' data='<a class="tag" href="/tag/abilities/p...'>,
<Selector xpath='//*[@class="tag"]' data='<a class="tag" href="/tag/choices/pag...'>,
<Selector xpath='//*[@class="tag"]' data='<a class="tag" href="/tag/inspiration...'>,
<Selector xpath='//*[@class="tag"]' data='<a class="tag" href="/tag/life/page/1...'>,
<Selector xpath='//*[@class="tag"]' data='<a class="tag" href="/tag/live/page/1...'>,
<Selector xpath='//*[@class="tag"]' data='<a class="tag" href="/tag/miracle/pag...'>,
<Selector xpath='//*[@class="tag"]' data='<a class="tag" href="/tag/miracles/pa...'>,
<Selector xpath='//*[@class="tag"]' data='<a class="tag" href="/tag/aliteracy/p...'>,
<Selector xpath='//*[@class="tag"]' data='<a class="tag" href="/tag/books/page/...'>,
<Selector xpath='//*[@class="tag"]' data='<a class="tag" href="/tag/classic/pag...'>,
<Selector xpath='//*[@class="tag"]' data='<a class="tag" href="/tag/humor/page/...'>,
<Selector xpath='//*[@class="tag"]' data='<a class="tag" href="/tag/be-yourself...'>,
<Selector xpath='//*[@class="tag"]' data='<a class="tag" href="/tag/inspiration...'>,
<Selector xpath='//*[@class="tag"]' data='<a class="tag" href="/tag/adulthood/p...'>,
<Selector xpath='//*[@class="tag"]' data='<a class="tag" href="/tag/success/pag...'>,
<Selector xpath='//*[@class="tag"]' data='<a class="tag" href="/tag/value/page/...'>,
<Selector xpath='//*[@class="tag"]' data='<a class="tag" href="/tag/life/page/1...'>,
<Selector xpath='//*[@class="tag"]' data='<a class="tag" href="/tag/love/page/1...'>,
<Selector xpath='//*[@class="tag"]' data='<a class="tag" href="/tag/edison/page...'>,
<Selector xpath='//*[@class="tag"]' data='<a class="tag" href="/tag/failure/pag...'>,
<Selector xpath='//*[@class="tag"]' data='<a class="tag" href="/tag/inspiration...'>,
<Selector xpath='//*[@class="tag"]' data='<a class="tag" href="/tag/paraphrased...'>,
<Selector xpath='//*[@class="tag"]' data='<a class="tag" href="/tag/misattribut...'>,
<Selector xpath='//*[@class="tag"]' data='<a class="tag" href="/tag/humor/page/...'>,
<Selector xpath='//*[@class="tag"]' data='<a class="tag" href="/tag/obvious/pag...'>,
<Selector xpath='//*[@class="tag"]' data='<a class="tag" href="/tag/simile/page...'>,
<Selector xpath='//*[@class="tag"]' data='<a class="tag" style="font-size: 28px...'>,
<Selector xpath='//*[@class="tag"]' data='<a class="tag" style="font-size: 26px...'>,
<Selector xpath='//*[@class="tag"]' data='<a class="tag" style="font-size: 26px...'>,
<Selector xpath='//*[@class="tag"]' data='<a class="tag" style="font-size: 24px...'>,
<Selector xpath='//*[@class="tag"]' data='<a class="tag" style="font-size: 22px...'>,
<Selector xpath='//*[@class="tag"]' data='<a class="tag" style="font-size: 14px...'>,
<Selector xpath='//*[@class="tag"]' data='<a class="tag" style="font-size: 10px...'>,
<Selector xpath='//*[@class="tag"]' data='<a class="tag" style="font-size: 8px"...'>,
<Selector xpath='//*[@class="tag"]' data='<a class="tag" style="font-size: 8px"...'>,
<Selector xpath='//*[@class="tag"]' data='<a class="tag" style="font-size: 6px"...'>]
③Classがtagになっているもののみ(右側のタグのみ取得
tagsとすることで10個のみ取得できる。
In [14]: response.xpath('//*[@class="tags"]')
Out[14]:
[<Selector xpath='//*[@class="tags"]' data='<div class="tags">\n Tags:\n...'>,
<Selector xpath='//*[@class="tags"]' data='<div class="tags">\n Tags:\n...'>,
<Selector xpath='//*[@class="tags"]' data='<div class="tags">\n Tags:\n...'>,
<Selector xpath='//*[@class="tags"]' data='<div class="tags">\n Tags:\n...'>,
<Selector xpath='//*[@class="tags"]' data='<div class="tags">\n Tags:\n...'>,
<Selector xpath='//*[@class="tags"]' data='<div class="tags">\n Tags:\n...'>,
<Selector xpath='//*[@class="tags"]' data='<div class="tags">\n Tags:\n...'>,
<Selector xpath='//*[@class="tags"]' data='<div class="tags">\n Tags:\n...'>,
<Selector xpath='//*[@class="tags"]' data='<div class="tags">\n Tags:\n...'>,
<Selector xpath='//*[@class="tags"]' data='<div class="tags">\n Tags:\n...'>]
classがtag-itemのtextのみを取得する。
In [20]: response.xpath('//*[@class="tag-item"]/a/text()').extract()
Out[20]:
['love',
'inspirational',
'life',
'humor',
'books',
'reading',
'friendship',
'friends',
'truth',
'simile']