一、怎么將查詢到的網(wǎng)站后臺(tái)數(shù)據(jù)用最方便的方式導(dǎo)出
網(wǎng)頁(yè)下載
為了能夠下載數(shù)據(jù)并快速批量搜索數(shù)據(jù)庫(kù)中的內(nèi)容,用python寫(xiě)一段代碼,用于自動(dòng)下載網(wǎng)頁(yè)文件并導(dǎo)出需要數(shù)據(jù)到Excel。觀察后發(fā)現(xiàn),網(wǎng)頁(yè)鏈接是由前綴和數(shù)字組成的形如,https://xxxx./xxxxx.php?id=num,所以,將通過(guò)循環(huán)語(yǔ)句來(lái)完成下載。
import urllib.request#導(dǎo)入插件
#定義函數(shù)讀取鏈接
def getHtml(url):
??? html = urllib.request.urlopen(url).read() #調(diào)用urllib讀取鏈接
??? return html
?
#定義函數(shù)保存網(wǎng)頁(yè)
def saveHtml(file_name, file_content):
??? #注意windows文件命名的禁用符,比如 /
??? with open(file_name.replace(‘/’, ‘_’) + “.html”, “wb”) as f:
??????? #寫(xiě)文件用bytes而不是str,所以要轉(zhuǎn)碼
??????? f.write(file_content)
?
#設(shè)定參數(shù)i,i為需要下載的網(wǎng)頁(yè)數(shù)量
i = 1
for i in range(1,707):
??? aurl = “https://xxxx/xxxx.php?id=” + str(i) #組合形成網(wǎng)頁(yè)url
??? html = getHtml(aurl)#調(diào)用函數(shù)讀取鏈接到html字符串
??? name = “文件” + str(i)#組合形成文件名
??? saveHtml(name, html)#調(diào)用函數(shù)保存下載鏈接
??? i += 1
?
print(‘下載成功’)
網(wǎng)頁(yè)內(nèi)容分析
網(wǎng)頁(yè)共700個(gè),網(wǎng)頁(yè)下載下來(lái)后,需要進(jìn)行數(shù)據(jù)分析提取。為了便于篩選和對(duì)比,我決定導(dǎo)出到excel比較方便。
根據(jù)分析網(wǎng)頁(yè)內(nèi)容,發(fā)現(xiàn)需要提取的內(nèi)容都是被td標(biāo)簽所標(biāo)記的,并且數(shù)值總在參量名的下一項(xiàng),如下所示。
???
???????????????????????
???????????????????????
???????????????????????
所以,我決定調(diào)用bs4和pandas庫(kù)來(lái)完成這項(xiàng)工作。
?
from bs4 import BeautifulSoup
import lxml
import requests
import pandas as pd
import numpy as np
?
#定義讀取網(wǎng)頁(yè)文件的函數(shù)
def read_html(path):?????????? #讀取單個(gè)html到pd
??? htmlfile = open(path, ‘r’, encoding=’utf-8′)#打開(kāi)地址所在的網(wǎng)頁(yè)文件
??? htmlhandle = htmlfile.read()#讀取該網(wǎng)頁(yè)文件全文到htmlhandle
??? soup = BeautifulSoup(htmlhandle, ‘lxml’)#調(diào)用BeautifulSoup讀取網(wǎng)頁(yè)標(biāo)簽內(nèi)容
??? td_list = soup.find_all(‘td’)? # 找到所有td標(biāo)簽
??? #創(chuàng)建三個(gè)列表,temporary是臨時(shí)列表,result是用來(lái)存放帶有td標(biāo)簽內(nèi)容的列表,final是我們最終導(dǎo)出的列表
??? result = []
??? final = []
??? temporary = []
??? #將所有的td標(biāo)簽中的值導(dǎo)入result列表中
??? for d in td_list:
??????? #print(d.string)#實(shí)時(shí)輸出讀取的值
??????? result.append(d.string) #實(shí)時(shí)將值保存到列表
?
??? print(len(result)) #輸出result項(xiàng)目數(shù)
??? t = 0#創(chuàng)建順序參量t,確保所有標(biāo)簽都被遍歷
??? for t in range(len(result)):
??????? if? result[t] != None:#確保標(biāo)簽不是空,防止程序報(bào)錯(cuò)
?????????? if? result[t] == ‘類型’:#比對(duì)標(biāo)簽內(nèi)容,如果相同就打印下一項(xiàng)(參數(shù))
??????????????? print(result[t+1])
??????????????? temporary.append(result[t+1])#將下一項(xiàng)添加到臨時(shí)列表的最后
??? #如果臨時(shí)列表不為零,則將臨時(shí)列表的內(nèi)容添加到final列表,否則就在final列表中添加‘無(wú)’,確保最終參數(shù)順序不會(huì)錯(cuò)
??? if len(temporary) != 0:
??????? final.extend(temporary)
??? else:
??????? final.append(‘無(wú)’)
?
??? t = 0
??? temporary = []#將臨時(shí)列表清零
??? for t in range(len(result)):
??????? if? result[t] != None:
?????????? if? result[t] == ‘成份配比’:
??????????????? print(result[t+1])
??????????????? temporary.append(result[t+1])
??? if len(temporary) != 0:
??????? final.extend(temporary)
??? else:
??????? final.append(‘無(wú)’)
??? t = 0
??? temporary = []
??? for t in range(len(result)):
??????? if? result[t] != None:
?????????? if? result[t] == ‘飽和磁感應(yīng)強(qiáng)度(T)’:
??????????????? print(result[t+1]+’T’)
????????? ??????temporary.append(result[t+1]+’T’)
??? if len(temporary) != 0:
??????? final.extend(temporary)
??? else:
??????? final.append(‘無(wú)’)
?
??? t = 0
??? temporary = []
??? for t in range(len(result)):
??????? if? result[t] != None:
?????????? if? result[t] == ‘矯頑力’:
??????????????? print(result[t+1]+’A/m’)
??????????????? temporary.append(result[t+1]+’A/m’)
??? if len(temporary) != 0:
??????? final.extend(temporary)
??? else:
??????? final.append(‘無(wú)’)
?
??? t = 0
??? temporary = []
??? for t in range(len(result)):
??????? if? result[t] != None:
?????????? if? result[t] == ‘有效磁導(dǎo)率’:
??????????????? print(result[t+1])
??????????????? temporary.append(result[t+1])
??? if len(temporary) != 0:
??????? final.extend(temporary)
??? else:
??????? final.append(‘無(wú)’)
?
??? t = 0
??? temporary = []
??? for t in range(len(result)):
??????? if? result[t] != None:
?????????? if? result[t] == ‘熱處理溫度’:
??????????????? print(result[t+1])
??????????????? temporary.append(result[t+1]+’℃’)
??? if len(temporary) != 0:
??????? final.extend(temporary)
??? else:
??????? final.append(‘無(wú)’)
?
??? t = 0
??? temporary = []
??? for t in range(len(result)):
??????? if? result[t] != None:
?????????? if? result[t] == ‘熱處理時(shí)間’:
??????????????? print(result[t+1])
?? ?????????????temporary.append(result[t+1]+’min’)
??? if len(temporary) != 0:
??????? final.extend(temporary)
??? else:
??????? final.append(‘無(wú)’)
?
??? t = 0
??? temporary = []
??? for t in range(len(result)):
??????? if? result[t] != None:
?????????? if? result[t] == ‘出處’:
??????????????? print(result[t+1])
??????????????? temporary.append(result[t+1])
??? if len(temporary) != 0:
??????? final.extend(temporary)
??? else:
??????? final.append(‘無(wú)’)
?
?
??? df = pd.DataFrame(final)#將final轉(zhuǎn)化為panda數(shù)據(jù)幀
??? return df???????????????????? #返回參數(shù)
?
?
path = ‘./文件1.html’#初始文件路徑
df1 = read_html(path)??????????????? #調(diào)用函數(shù)
?
number = 700#需要讀取的文件數(shù)目
?
for i in range(2,number):#循環(huán)讀取余下文件,其中組合參數(shù)名使用locals()函數(shù)來(lái)創(chuàng)建
??? path = “./文件” + str(i) +”.html”
??? locals()[‘df’+str(i)]= read_html(path)#循環(huán)創(chuàng)造dfn函數(shù)
??? i += 1
?
writer = pd.ExcelWriter(‘./stat.xlsx’, engine=’xlsxwriter’) #創(chuàng)建excel文件,注意路徑中的數(shù)/,與windows中的\不同
?
df1.to_excel(writer, sheet_name=’Sheet1′)? # 起始寫(xiě)入位置, A1列.
?
for i in range(2,number):#利用local()批量調(diào)用剛才創(chuàng)建的函數(shù)
??? locals()[‘df’+str(i)].to_excel(writer, sheet_name=’Sheet1’, startcol=i,index=None,header=True)
??? i += 1
?
writer.save() #保存文件
?
print(“全部信息爬取完畢,請(qǐng)查看Excel文件”)
延伸閱讀:
二、應(yīng)用架構(gòu)是什么
應(yīng)用架構(gòu)(Application Architecture)是描述了IT系統(tǒng)功能和技術(shù)實(shí)現(xiàn)的內(nèi)容。應(yīng)用架構(gòu)分為以下兩個(gè)不同的層次:
企業(yè)級(jí)的應(yīng)用架構(gòu):企業(yè)層面的應(yīng)用架構(gòu)起到了統(tǒng)一規(guī)劃、承上啟下的作用,向上承接了企業(yè)戰(zhàn)略發(fā)展方向和業(yè)務(wù)模式,向下規(guī)劃和指導(dǎo)企業(yè)各個(gè)IT系統(tǒng)的定位和功能。在企業(yè)架構(gòu)中,應(yīng)用架構(gòu)是最重要和工作量最大的部分,他包括了企業(yè)的應(yīng)用架構(gòu)藍(lán)圖、架構(gòu)標(biāo)準(zhǔn)/原則、系統(tǒng)的邊界和定義、系統(tǒng)間的關(guān)聯(lián)關(guān)系等方面的內(nèi)容。單個(gè)系統(tǒng)的應(yīng)用架構(gòu):在開(kāi)發(fā)或設(shè)計(jì)單一IT系統(tǒng)時(shí),設(shè)計(jì)系統(tǒng)的主要模塊和功能點(diǎn),系統(tǒng)技術(shù)實(shí)現(xiàn)是從前端展示到業(yè)務(wù)處理邏輯,到后臺(tái)數(shù)據(jù)是如何架構(gòu)的。這方面的工作一般屬于項(xiàng)目組,而不是企業(yè)架構(gòu)的范疇,不過(guò)各個(gè)系統(tǒng)的架構(gòu)設(shè)計(jì)需要遵循企業(yè)總體應(yīng)用架構(gòu)原則。