智慧生活科技專業社群: Python網路爬文

2018年10月6日星期六

在這數位時代中，網路、網站、網頁已成為日常生活中不可以缺的元素，我們生活起居經常要上網查資料，假如我們能寫一支程式來過濾網路上的資訊，這一定是很棒的一件事。本篇文章就是要用Python網路爬蟲來收集資料。

想要從網頁中取得資料就要安裝requests套件

想要解構並擷取網頁資訊就要安裝beautifulsoup4套件

import requests

r = requests.get('http://www.google.com')

print (r.text)

執行結果：

如何讀取南開科技大學的網頁內容

import requests

url = "http://www.nkut.edu.tw"
re = requests.get(url)
re.encoding='utf8'
print(re.text)

判斷回傳的代碼

import requests

r = requests.get('http://www.nkut.edu.tw')

print(r.status_code)

if r.status_code == requests.codes.ok:

print("OK")

執行結果

我們來嘗試網路爬文找出南開科技大學首頁上重要資訊內容

import requests

from bs4 import BeautifulSoup

url = "http://www.nkut.edu.tw"

re = requests.get(url)

re.encoding='utf8'

soup = BeautifulSoup(re.text, 'html.parser')

print(soup)

print("列印出第一頁的文字")

print(soup.find('p'))

print("\n\n列印出id是counter的文字")

print(soup.find(id='counter'))

print("\n\n列印出全部的文字")

print(soup.find_all('p'))

print("\n\n列印出的文字")

print(soup.find('h1'))

智慧生活科技專業社群