2015年2月28日土曜日

How to download a file with Scrapy?

1.Way 1

Save the pdf in the spider callback:

def parse_listing(self, response):      # ... extract pdf urls      for url in pdf_urls:          yield Request(url, callback=self.save_pdf)    def save_pdf(self, response):      path = self.get_path(response.url)      with open(path, "wb") as f:          f.write(response.body)

2.Way 2

Do it in pipeline :

# in the spider  def parse_pdf(self, response):      i = MyItem()      i['body'] = response.body      i['url'] = response.url      # you can add more metadata to the item      return i    # in your pipeline  def process_item(self, item, spider):      path = self.get_path(item['url'])      with open(path, "wb") as f:          f.write(item['body'])      # remove body and add path as reference      del item['body']      item['path'] = path      # let item be processed by other pipelines. ie. db store      return item

3.Way 3

Use the filepiple(https://groups.google.com/forum/print/msg/scrapy-users/kzGHFjXywuY/O6PIhoT3thsJ)

1) download https://raw.github.com/scrapy/scrapy/master/scrapy/contrib/pipeline/files.py
and save it somewhere in your Scrapy project,
let's say at the root of your project (but that's not the best location…)

yourproject/files.py

2) then, enable this pipeline by adding this to your settings.py

ITEM_PIPELINES = [      'yourproject.files.FilesPipeline',  ]  FILES_STORE = '/path/to/yourproject/downloads'

FILES_STORE needs to point to a location where Scrapy can write (create it beforehand)

3) add 2 special fields to your item definition

    file_urls = Field()      files = Field()

4) in your spider, when you have an URL for a file to download,
add it to your Item instance before returning it

...      myitem = YourProjectItem()      ...      myitem["file_urls"] = ["http://www.example.com/somefileiwant.csv"]      yield myitem

5) run your spider and you should see files in the FILES_STORE folder

Here's an example that download a few files from the IETF website

the scrapy project is called "filedownload"

items.py looks like this:

from scrapy.item import Item, Field    class FiledownloadItem(Item):      file_urls = Field()      files = Field()

this is the code for the spider:

from scrapy.spider import BaseSpider  from filedownload.items import FiledownloadItem    class IetfSpider(BaseSpider):      name = "ietf"      allowed_domains = ["ietf.org"]      start_urls = (          'http://www.ietf.org/',          )        def parse(self, response):          yield FiledownloadItem(              file_urls=[                  'http://www.ietf.org/images/ietflogotrans.gif',                  'http://www.ietf.org/rfc/rfc2616.txt',                  'http://www.rfc-editor.org/rfc/rfc2616.ps',                  'http://www.rfc-editor.org/rfc/rfc2616.pdf',                  'http://tools.ietf.org/html/rfc2616.html',              ]          )      

Reference:

https://stackoverflow.com/questions/7123387/scrapy-define-a-pipleine-to-save-files

https://groups.google.com/forum/#!msg/scrapy-users/kzGHFjXywuY/O6PIhoT3thsJ

0 件のコメント:

コメントを投稿