1.Way 1
Save the pdf in the spider callback:
def parse_listing(self, response): # ... extract pdf urls for url in pdf_urls: yield Request(url, callback=self.save_pdf) def save_pdf(self, response): path = self.get_path(response.url) with open(path, "wb") as f: f.write(response.body)
2.Way 2
Do it in pipeline :
# in the spider def parse_pdf(self, response): i = MyItem() i['body'] = response.body i['url'] = response.url # you can add more metadata to the item return i # in your pipeline def process_item(self, item, spider): path = self.get_path(item['url']) with open(path, "wb") as f: f.write(item['body']) # remove body and add path as reference del item['body'] item['path'] = path # let item be processed by other pipelines. ie. db store return item
3.Way 3
Use the filepiple(https://groups.google.com/forum/print/msg/scrapy-users/kzGHFjXywuY/O6PIhoT3thsJ)
1) download https://raw.github.com/scrapy/scrapy/master/scrapy/contrib/pipeline/files.py
and save it somewhere in your Scrapy project,
let's say at the root of your project (but that's not the best location…)
yourproject/files.py
2) then, enable this pipeline by adding this to your settings.py
ITEM_PIPELINES = [ 'yourproject.files.FilesPipeline', ] FILES_STORE = '/path/to/yourproject/downloads'
FILES_STORE needs to point to a location where Scrapy can write (create it beforehand)
3) add 2 special fields to your item definition
file_urls = Field() files = Field()
4) in your spider, when you have an URL for a file to download,
add it to your Item instance before returning it
... myitem = YourProjectItem() ... myitem["file_urls"] = ["http://www.example.com/somefileiwant.csv"] yield myitem
5) run your spider and you should see files in the FILES_STORE folder
Here's an example that download a few files from the IETF website
the scrapy project is called "filedownload"
items.py looks like this:
from scrapy.item import Item, Field class FiledownloadItem(Item): file_urls = Field() files = Field()
this is the code for the spider:
from scrapy.spider import BaseSpider from filedownload.items import FiledownloadItem class IetfSpider(BaseSpider): name = "ietf" allowed_domains = ["ietf.org"] start_urls = ( 'http://www.ietf.org/', ) def parse(self, response): yield FiledownloadItem( file_urls=[ 'http://www.ietf.org/images/ietflogotrans.gif', 'http://www.ietf.org/rfc/rfc2616.txt', 'http://www.rfc-editor.org/rfc/rfc2616.ps', 'http://www.rfc-editor.org/rfc/rfc2616.pdf', 'http://tools.ietf.org/html/rfc2616.html', ] )
Reference:
https://stackoverflow.com/questions/7123387/scrapy-define-a-pipleine-to-save-files
https://groups.google.com/forum/#!msg/scrapy-users/kzGHFjXywuY/O6PIhoT3thsJ
0 件のコメント:
コメントを投稿