最近想看5个字成语域名注册情况（5个字的首字母，比如民以食为本，myswb.com），首先我需要一份5个字成语库。于是写了一个简单爬虫，索性将3个字到12个字全部爬下来。

1. 概述

幸运的是，911查询网站列出了3个字到12个字的成语（估计不全），比如5个字的成语。爬取的结果，我已分享在GitHub，在这里。或者点击下列链接查看。

举例，以下是部分5个字成语，点击链接可以查看解释（其实把解释爬下来也不难）：

骑牛读汉书 http://chengyu.911cha.com/NmVmYQ==.html
夹板医驼子 http://chengyu.911cha.com/MWR1Ng==.html
无风三尺浪 http://chengyu.911cha.com/MmRneQ==.html
藕断丝不断 http://chengyu.911cha.com/NnF4aw==.html
恶事行千里 http://chengyu.911cha.com/ejE0.html
二桃殺三士 http://chengyu.911cha.com/NnN2YQ==.html
碰一鼻子灰 http://chengyu.911cha.com/NnF6OA==.html
骤雨不终日 http://chengyu.911cha.com/OTdnaQ==.html
穿一条裤子 http://chengyu.911cha.com/OGZjYQ==.html
辄作数日恶 http://chengyu.911cha.com/OG5wNA==.html
无何有之乡 http://chengyu.911cha.com/MmQ5cQ==.html
艺多不压身 http://chengyu.911cha.com/Nnp6NA==.html
武人不惜死 http://chengyu.911cha.com/N3l5aw==.html
好心办坏事 http://chengyu.911cha.com/N2x1dw==.html
胸中百万兵 http://chengyu.911cha.com/OGhpYw==.html
……

2. 爬取

其实，方法跟之前的博文《第一个爬虫程序：建立联系方式表格》差不多。完整的源代码，我已分享在GitHub，在这里。

2.1 收集urls

注意到某个字数下的成语有好几页，先把这些页面的链接收集起来。用Chrome浏览器的Inspect element查看HTML文本，页面的导航HTML文本如下：

<div class="gclear pp bt center f14"><span class="gray">首页</span> <a href="zishu_5_p4.html">末页</a> <span class="gray">|</span> <a href="zishu_5.html" class="red noline">1</a> <a href="zishu_5_p2.html">2</a> <a href="zishu_5_p3.html">3</a> <a href="zishu_5_p4.html">4</a> <span class="gray">|</span> <span class="gray">上一页</span> <a href="zishu_5_p2.html">下一页</a></div>

这样的话，抓取就很方便了，源代码如下：

## Step 1: get all page urls
urls_set = set()
url = format_url.format(word_counts=word_counts)
parsed_uri = urlparse(url)
base_url = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)
try:
    response = requests.get(url)
    except (requests.exceptions.MissingSchema, requests.exceptions.ConnectionError):
        pass

    soup = BeautifulSoup(response.text)
    for anchor in soup.find_all('div', {'class' : 'gclear pp bt center f14'}) : # navigate pages
        for item in anchor.find_all('a') :
            page_url = urljoin(base_url, item.attrs['href'])
            urls_set.add(page_url)

2.2 抓取

去掉非重要代码，如下：

def crawler_chinese_idiom(self, url):    
    idioms = list()    

    parsed_uri = urlparse(url)    
    base_url = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)     

    response = requests.get(url)    
    soup = BeautifulSoup(response.text)     

    # for result_set in soup.find_all("ul", {"class", re.compile(r"l[345]\s+center")}): #有bug    
    for result_set in soup.find_all("ul", {"class" :  ['l3', 'l4', 'l5', 'center']}):    
        for idiom in result_set.find_all('li') :            
            sub_url = idiom.find_all('a')[0].attrs['href']            
            idiom_url = urljoin(base_url, sub_url)            
            t = (idiom.get_text(), idiom_url)            
            idioms.append(t)     

    return idioms

2.3 完整源代码


#!/usr/bin/env python3

# this program is designed to crawler Chinese idioms betwee 3 and 12 charaters
# By SparkandShine,  sparkandshine.net
# July 21th, 2015

from bs4 import BeautifulSoup
import bs4
import requests
import requests.exceptions
import re
#from urlparse import urlparse  python2.x
from urllib.parse import urlparse
from urllib.parse import urljoin

import os

class Crawler_Chinese_Idioms :
    def __init__(self):
        pass

    ### function output ###
    def format_output(self, filename, chinese_idioms):
        fp = open(filename, 'w')


        for item in chinese_idioms :
            s = '\t'.join(item)
            fp.write(s + '\n')
            #print(s)

        fp.close()


    ### function crawler chinese idioms, word counts [3, 12]###
    def crawler_chinese_idioms(self):
        out_dir = 'dataset_chinese_idioms/'
        format_filename = 'chinese_idioms_{word_counts}.dat'

        if not os.path.exists(out_dir):
            os.makedirs(out_dir)

        format_url = 'http://chengyu.911cha.com/zishu_{word_counts}.html'


        for word_counts in range(3, 13) :
        #for word_counts in [8] :
            chinese_idioms = set()

            ## Step 1: get all page urls
            urls_set = set()
            url = format_url.format(word_counts=word_counts)
            parsed_uri = urlparse(url)
            base_url = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)
            try:
                response = requests.get(url)
            except (requests.exceptions.MissingSchema, requests.exceptions.ConnectionError):
                pass

            soup = BeautifulSoup(response.text)
            for anchor in soup.find_all('div', {'class' : 'gclear pp bt center f14'}) : #navigate pages
                for item in anchor.find_all('a') :
                    page_url = urljoin(base_url, item.attrs['href'])
                    urls_set.add(page_url)
                #print(urls_set)

            ## Step 2: crawler chinese idioms
            for url in urls_set :
                idioms = self.crawler_chinese_idiom(url)
                chinese_idioms.update(idioms)

            ## Step 3: write to file
            filename = out_dir + format_filename.format(word_counts=word_counts)
            self.format_output(filename, chinese_idioms)


    ### function, crawler chinese idioms from a given url ###
    def crawler_chinese_idiom(self, url):
        idioms = list()

        parsed_uri = urlparse(url)
        base_url = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)
        #print('base_url', base_url)

        try:
            response = requests.get(url)
        except (requests.exceptions.MissingSchema, requests.exceptions.ConnectionError):
            # ignore pages with errors
            return idioms

        soup = BeautifulSoup(response.text)

        ## !!! there might be a bug !!!
        #for result_set in soup.find_all("ul", {"class", re.compile(r"l[45]\s+center")}): #l4 center or l5 center
        #for result_set in soup.find_all("ul", {"class" :  "l4 center"}):
        for result_set in soup.find_all("ul", {"class" :  ['l3', 'l4', 'l5', 'center']}):
        #for result_set in soup.find_all("ul", {"class" :  ["l4 center", "l5 center"]}):
            for idiom in result_set.find_all('li') :
                sub_url = idiom.find_all('a')[0].attrs['href']
                idiom_url = urljoin(base_url, sub_url)

                t = (idiom.get_text(), idiom_url)
                #print(t)
                idioms.append(t)

        return idioms

### END OF CLASS ###

def main():
    crawler = Crawler_Chinese_Idioms()
    crawler.crawler_chinese_idioms()


if __name__ == '__main__':
    main()

3. 获取更全的成语

这样爬到的成语，想必是不全的。我想到的另一个思路：爬取一定数量的网页（比如成语查询网站），然后再中文分词，将成语从HTML文本筛选出来。

爬取成语：从3字到12字