最近想看5个字成语域名注册情况(5个字的首字母,比如民以食为本,myswb.com),首先我需要一份5个字成语库。于是写了一个简单爬虫,索性将3个字到12个字全部爬下来。
1. 概述
幸运的是,911查询网站列出了3个字到12个字的成语(估计不全),比如5个字的成语。爬取的结果,我已分享在GitHub,在这里。或者点击下列链接查看。
- chinese_idioms_3.dat
- chinese_idioms_4.dat
- chinese_idioms_5.dat
- chinese_idioms_6.dat
- chinese_idioms_7.dat
- chinese_idioms_8.dat
- chinese_idioms_9.dat
- chinese_idioms_10.dat
- chinese_idioms_11.dat
- chinese_idioms_12.dat
举例,以下是部分5个字成语,点击链接可以查看解释(其实把解释爬下来也不难):
- 骑牛读汉书 http://chengyu.911cha.com/NmVmYQ==.html
- 夹板医驼子 http://chengyu.911cha.com/MWR1Ng==.html
- 无风三尺浪 http://chengyu.911cha.com/MmRneQ==.html
- 藕断丝不断 http://chengyu.911cha.com/NnF4aw==.html
- 恶事行千里 http://chengyu.911cha.com/ejE0.html
- 二桃殺三士 http://chengyu.911cha.com/NnN2YQ==.html
- 碰一鼻子灰 http://chengyu.911cha.com/NnF6OA==.html
- 骤雨不终日 http://chengyu.911cha.com/OTdnaQ==.html
- 穿一条裤子 http://chengyu.911cha.com/OGZjYQ==.html
- 辄作数日恶 http://chengyu.911cha.com/OG5wNA==.html
- 无何有之乡 http://chengyu.911cha.com/MmQ5cQ==.html
- 艺多不压身 http://chengyu.911cha.com/Nnp6NA==.html
- 武人不惜死 http://chengyu.911cha.com/N3l5aw==.html
- 好心办坏事 http://chengyu.911cha.com/N2x1dw==.html
- 胸中百万兵 http://chengyu.911cha.com/OGhpYw==.html
- ……
2. 爬取
其实,方法跟之前的博文《第一个爬虫程序:建立联系方式表格》差不多。完整的源代码,我已分享在GitHub,在这里。
2.1 收集urls
注意到某个字数下的成语有好几页,先把这些页面的链接收集起来。用Chrome浏览器的Inspect element
查看HTML文本,页面的导航HTML文本如下:
<div class="gclear pp bt center f14"><span class="gray">首页</span> <a href="zishu_5_p4.html">末页</a> <span class="gray">|</span> <a href="zishu_5.html" class="red noline">1</a> <a href="zishu_5_p2.html">2</a> <a href="zishu_5_p3.html">3</a> <a href="zishu_5_p4.html">4</a> <span class="gray">|</span> <span class="gray">上一页</span> <a href="zishu_5_p2.html">下一页</a></div>
这样的话,抓取就很方便了,源代码如下:
## Step 1: get all page urls
urls_set = set()
url = format_url.format(word_counts=word_counts)
parsed_uri = urlparse(url)
base_url = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)
try:
response = requests.get(url)
except (requests.exceptions.MissingSchema, requests.exceptions.ConnectionError):
pass
soup = BeautifulSoup(response.text)
for anchor in soup.find_all('div', {'class' : 'gclear pp bt center f14'}) : # navigate pages
for item in anchor.find_all('a') :
page_url = urljoin(base_url, item.attrs['href'])
urls_set.add(page_url)
2.2 抓取
去掉非重要代码,如下:
def crawler_chinese_idiom(self, url):
idioms = list()
parsed_uri = urlparse(url)
base_url = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)
response = requests.get(url)
soup = BeautifulSoup(response.text)
# for result_set in soup.find_all("ul", {"class", re.compile(r"l[345]\s+center")}): #有bug
for result_set in soup.find_all("ul", {"class" : ['l3', 'l4', 'l5', 'center']}):
for idiom in result_set.find_all('li') :
sub_url = idiom.find_all('a')[0].attrs['href']
idiom_url = urljoin(base_url, sub_url)
t = (idiom.get_text(), idiom_url)
idioms.append(t)
return idioms
2.3 完整源代码
#!/usr/bin/env python3
# this program is designed to crawler Chinese idioms betwee 3 and 12 charaters
# By SparkandShine, sparkandshine.net
# July 21th, 2015
from bs4 import BeautifulSoup
import bs4
import requests
import requests.exceptions
import re
#from urlparse import urlparse python2.x
from urllib.parse import urlparse
from urllib.parse import urljoin
import os
class Crawler_Chinese_Idioms :
def __init__(self):
pass
### function output ###
def format_output(self, filename, chinese_idioms):
fp = open(filename, 'w')
for item in chinese_idioms :
s = '\t'.join(item)
fp.write(s + '\n')
#print(s)
fp.close()
### function crawler chinese idioms, word counts [3, 12]###
def crawler_chinese_idioms(self):
out_dir = 'dataset_chinese_idioms/'
format_filename = 'chinese_idioms_{word_counts}.dat'
if not os.path.exists(out_dir):
os.makedirs(out_dir)
format_url = 'http://chengyu.911cha.com/zishu_{word_counts}.html'
for word_counts in range(3, 13) :
#for word_counts in [8] :
chinese_idioms = set()
## Step 1: get all page urls
urls_set = set()
url = format_url.format(word_counts=word_counts)
parsed_uri = urlparse(url)
base_url = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)
try:
response = requests.get(url)
except (requests.exceptions.MissingSchema, requests.exceptions.ConnectionError):
pass
soup = BeautifulSoup(response.text)
for anchor in soup.find_all('div', {'class' : 'gclear pp bt center f14'}) : #navigate pages
for item in anchor.find_all('a') :
page_url = urljoin(base_url, item.attrs['href'])
urls_set.add(page_url)
#print(urls_set)
## Step 2: crawler chinese idioms
for url in urls_set :
idioms = self.crawler_chinese_idiom(url)
chinese_idioms.update(idioms)
## Step 3: write to file
filename = out_dir + format_filename.format(word_counts=word_counts)
self.format_output(filename, chinese_idioms)
### function, crawler chinese idioms from a given url ###
def crawler_chinese_idiom(self, url):
idioms = list()
parsed_uri = urlparse(url)
base_url = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)
#print('base_url', base_url)
try:
response = requests.get(url)
except (requests.exceptions.MissingSchema, requests.exceptions.ConnectionError):
# ignore pages with errors
return idioms
soup = BeautifulSoup(response.text)
## !!! there might be a bug !!!
#for result_set in soup.find_all("ul", {"class", re.compile(r"l[45]\s+center")}): #l4 center or l5 center
#for result_set in soup.find_all("ul", {"class" : "l4 center"}):
for result_set in soup.find_all("ul", {"class" : ['l3', 'l4', 'l5', 'center']}):
#for result_set in soup.find_all("ul", {"class" : ["l4 center", "l5 center"]}):
for idiom in result_set.find_all('li') :
sub_url = idiom.find_all('a')[0].attrs['href']
idiom_url = urljoin(base_url, sub_url)
t = (idiom.get_text(), idiom_url)
#print(t)
idioms.append(t)
return idioms
### END OF CLASS ###
def main():
crawler = Crawler_Chinese_Idioms()
crawler.crawler_chinese_idioms()
if __name__ == '__main__':
main()
3. 获取更全的成语
这样爬到的成语,想必是不全的。我想到的另一个思路:爬取一定数量的网页(比如成语查询网站),然后再中文分词,将成语从HTML文本筛选出来。