[學無止境][Python]用cfscrape繞過CDN爬行 & 改進方向

GDST

7 年 ago

[學無止境][Python]用cfscrape繞過CDN爬行 & 改進方向

解決了網站一的爬蟲之後，接下來就換另一個網站二，但網站二啟用了Cloudflare的CDN，爬蟲無法直接進入網站爬行。在Google搜尋後，發現了cfscrape這套能夠繞過CDN爬行的套件，更令人開心的是，他是基於我最熟悉的Python編寫的。

如其Github頁面所示，用pip安裝cfscrape後，還需另外安裝Node.js才能正確運行。以下就是這次的代碼，可以看到用cfscrape.create_scraper().get取代了requests.get。


import requests
from bs4 import BeautifulSoup
import cfscrape #Anti-CDN
scraper = cfscrape.create_scraper() #Anti-CDN

###進入每篇文章抓網址
DL = []
def get_dl_link(link):
 global DL
 dl = []
 response = scraper.get(link)
 soup = BeautifulSoup(response.text, 'lxml')
 articles = soup.find('div','entry').find_all('p')

 title = articles[0].getText().split("\n")[0] #Title
 #print(title)
 dl += [title]
 for article in articles :
  if type(article) != type(None):
   if type(article.find('img')) != type(None): #Poster
    poster = article.find('img').get("src") 
    dl += [poster]
   for i in article.find_all('a'): #DL Link
    dl += [i.get('href')]
 meta1 = soup.find('div','entry').find('div','sh-content pressrelease-content sh-hide')
 if type(meta1) != type(None): #Screenshot
  for i in meta1.find_all('a'): 
   dl += [i.get('href')]
 for i in dl :
  print(i)
 print()
 DL.append(dl)

###Main
page = 1
keyword = "高橋しょう子"
end = 0

while True:
 url = "http://maxjav.com/page/" + str(page) + "/?s=" + keyword
 response = scraper.get(url)
 soup = BeautifulSoup(response.text, 'lxml')
 #檢查是不是沒有下一頁
 check = soup.find_all('h2')
 for i in check :
  if type(i) != type(None) and i.getText() == "Error 404 - Not Found":
   end = True
 if end :
  break

 articles = soup.find_all('p')
 for article in articles : #
  meta = article.find('a')
  if type(meta) != type(None) and keyword in str(article):
   #每篇文章的網址
   link = meta.get("href")
   get_dl_link(link)
 page += 1 #去下一頁

###Export
filename = "maxjav_" + keyword +".csv"
with open(filename , "w", encoding = "utf8") as data:
 for i in DL :
  for j in i :
   data.write("%s," % (j))
  data.write("\n")

修改內容

1. 修正screenshot、dl-link欄位若出現多個連結只會記錄第一個網址的問題 #用for loops檢查

2. 修改匯出格式，方便手動儲存為下載清單

更改後的匯出畫面

改進方向

1. 改為互動式介面，具體包括關鍵字、爬行頁數、匯出格式(txt或csv)、是否在執行時印出。

2. 將兩個網站的爬蟲合併，並在互動式介面選擇爬行其一或兩者皆爬行

3.在爬行兩個網站時，將標題一樣的下載網址合併

4. 下載空間過濾。因為AllDebrid跟Real-Debrid似乎都不支援UploadGiG，所以當下載網址是UG時將其忽略

[學無止境][Python]爬蟲實作成品 & 修改內容 »

« [學無止境][Python]用爬蟲批次找出下載網址

Categories: Python 學無止境

Tags: cfscrapecrawlerPython爬蟲

GDST: