2019年1月9日 星期三

Python網路爬文收集教師發表期刊、論文、專利

資料收集往往要花費許多心力,透過Python網路爬文可以減輕一些人力,本文以網路爬文來協助收集某一位教師的研發能量。
若您沒有網路爬文的經驗,可以先參考Python網路爬文
接下來您可以在網頁上按右鍵,選擇檢視網頁原始碼,如下圖。


從上圖中可以發現,在期刊部份是使用表格來呈現,因此我們可以用for row in soup.find_all('table'),找出所有表格,再用state變數來記得第幾個表格,以期刊論文是在第6表格,所以使用 elif state == 6,我們再利用for column in row.find_all('tr')找出每一篇期刊論文,最後再用for paper in column.find_all('td')來找出期刊論文的資訊。


程式如下:
import requests
from bs4 import BeautifulSoup

url = "http://dli.nkut.edu.tw/people/bio.php?PID=7"
re = requests.get(url)
re.encoding='utf8'
state = 0
sub_state = 0

soup = BeautifulSoup(re.text, 'html.parser')
print(soup.title.string)
for row in soup.find_all('table'):
    if state == 0:
        print(soup.find('td').string)
    elif state == 1:
        substate = 0
        for column in row.find_all('tr'):
            if substate == 1:
                print(column.find('td').string)
            substate+=1
    elif state == 6:
       print(row.find('div').string)
       substate = 0
       for column in row.find_all('tr'):
            if substate > 0:
                print(substate)
            for paper in column.find_all('td'):
                print(paper.string)
            substate+=1
    elif state == 9:
       print(row.find('div').string)
       substate = 0
       for column in row.find_all('tr'):
            if substate > 0:
                print(substate)
            for paper in column.find_all('td'):
                print(paper.string)
            substate+=1
    elif state == 12:
       print(row.find('div').string)
       substate = 0
       for column in row.find_all('tr'):
            if substate > 0:
                print(substate)
            for paper in column.find_all('td'):
                print(paper.string)
            substate+=1
    state+=1

2019年1月2日 星期三

使用matplotlib用python畫出股票交易曲線

import requests
import json
import datetime as dt
import matplotlib.pyplot as plt
import matplotlib.dates as mdates

url = 'http://www.twse.com.tw/exchangeReport/STOCK_DAY?date=%s&stockNo=%s' % ( '20181201', '2892')
r = requests.get(url)
data =json.loads(r.text)

dates = []
y=[]
for row in data['data']:
    dateArr =row[0].split("/")
    date4=dateArr[1]+"/"+dateArr[2]+"/"+str(int(dateArr[0])+1911)
    dates.append(date4)
    y.append(row[6])

x = [dt.datetime.strptime(d,'%m/%d/%Y').date() for d in dates]


plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%m/%d/%Y'))
plt.gca().xaxis.set_major_locator(mdates.DayLocator())
plt.plot(x,y)
plt.gcf().autofmt_xdate()
plt.show()

執行結果:

用python撰寫查詢股票當月交易情形

Python在資料讀取以及資料操作上,充份展現其簡單易學的特性,台灣股票交易網站網址為:http://www.twse.com.tw/exchangeReport/STOCK_DAY
採用GET操作方式,只要輸入日期和股票號碼(date=%s&stockNo=%s)即可以讀取當月交易情形。

完整的程式如下:
import requests
import json

url = 'http://www.twse.com.tw/exchangeReport/STOCK_DAY?date=%s&stockNo=%s' % ( '20181201', '2892')
r = requests.get(url)
print(r.text)
data =json.loads(r.text)

print(data['title'])
print(data['fields'])

for row in data['data']:
    print(row)

執行結果:
107年12月 2892 第一金           各日成交資訊
['日期', '成交股數', '成交金額', '開盤價', '最高價', '最低價', '收盤價', '漲跌價差', '成交筆數']
['107/12/03', '23,373,526', '469,242,264', '20.15', '20.15', '20.00', '20.05', '+0.10', '4,980']
['107/12/04', '15,121,883', '302,647,738', '20.00', '20.10', '19.95', '20.00', '-0.05', '4,175']
['107/12/05', '12,664,832', '252,585,050', '19.95', '20.00', '19.90', '19.95', '-0.05', '5,155']
['107/12/06', '16,763,587', '333,492,740', '19.95', '20.00', '19.85', '19.90', '-0.05', '4,907']
['107/12/07', '9,717,030', '193,156,769', '19.85', '19.90', '19.80', '19.85', '-0.05', '3,994']
['107/12/10', '10,270,980', '203,219,118', '19.85', '19.85', '19.75', '19.75', '-0.10', '4,094']
['107/12/11', '12,001,409', '237,158,584', '19.75', '19.85', '19.70', '19.75', ' 0.00', '4,196']
['107/12/12', '13,184,739', '260,919,247', '19.85', '19.85', '19.75', '19.75', ' 0.00', '3,970']
['107/12/13', '10,739,661', '213,827,720', '19.80', '20.00', '19.75', '19.90', '+0.15', '4,031']
['107/12/14', '10,221,597', '203,150,390', '19.85', '19.95', '19.75', '19.95', '+0.05', '3,833']
['107/12/17', '13,792,165', '273,836,050', '19.90', '20.00', '19.75', '19.85', '-0.10', '3,586']
['107/12/18', '13,223,459', '261,472,914', '19.90', '19.90', '19.70', '19.80', '-0.05', '5,231']
['107/12/19', '19,141,904', '378,807,037', '19.75', '19.90', '19.65', '19.90', '+0.10', '5,032']
['107/12/20', '8,733,645', '173,279,787', '19.80', '19.95', '19.75', '19.90', ' 0.00', '2,918']
['107/12/21', '12,112,746', '239,466,603', '19.90', '19.90', '19.75', '19.75', '-0.15', '2,876']
['107/12/22', '3,215,217', '63,488,569', '19.75', '19.80', '19.70', '19.70', '-0.05', '1,387']
['107/12/24', '10,574,031', '208,425,067', '19.70', '19.80', '19.65', '19.80', '+0.10', '4,389']
['107/12/25', '5,641,494', '110,725,582', '19.55', '19.70', '19.55', '19.65', '-0.15', '2,285']
['107/12/26', '5,396,256', '106,223,589', '19.65', '19.75', '19.65', '19.70', '+0.05', '3,127']
['107/12/27', '10,050,848', '199,616,315', '19.90', '19.95', '19.75', '19.85', '+0.15', '3,892']

['107/12/28', '12,457,018', '248,623,710', '19.85', '20.00', '19.80', '20.00', '+0.15', '3,035']