mzh/blog

中国县及县以上地区数据结构【Python】

郭嘉统计局虽然经常更新地区数据,但是其数据结构糟糕透顶,plain HTML有没有!都不提供SQL或者是XML数据类型! 都还得写个解析器来加载这个结构,用LXML解析的过程我就不写了 去除各种table之后的数据库

110000 北京市 110100 市辖区 110101 东城区 110102 西城区 110105 朝阳区 110106 丰台区 110107 石景山区 110108 海淀区

转换以后:

省:北京
    ├─ 市辖区
    │  ├─ 东城区
    │  ├─ 西城区
    │  ├─ 朝阳区
    │  ├─ 石景山区
    │  ├─ 海淀区
    │  ╰─ 平谷区
    ╰─ 县
       ├─ 密云县
       ╰─ 延庆县

规律就是邮政编码!用了groupby和defaultdict这些基本的Python东西 下面是程序啦,用法是直接打省份的名称即可。 当然,函数已经是单独实现的,所以在其他地方用也行啦~

    
    #!/usr/bin/env python
    # encoding: utf-8
    
    
    from itertools import groupby
    from utils import ScaleTree
    
    def make_city_tree(path):
        """
        make a tree of province, city, county of China
    
        :path: path to database
        :returns: tree
    
        """
        def strip_zipcode(item):
            """
            Strip all zipcode from stats.gov.cn
            :returns: city_name
            """
            return item[7:].decode('utf8').strip()
    
        with file(path, 'r') as db:
            
            tree = ScaleTree()
            
            provinces = groupby(db, key=lambda x: x[:2])
            
            for pid, province_data in provinces:
                
                province_data = list(province_data)
                province_name, cities = strip_zipcode(province_data[0]), province_data[1:]
    
                for cid, city_data in groupby(cities, key=lambda x: x[:4]):
                    
                    city_data = list(city_data)
                    city_name, counties = strip_zipcode(city_data[0]), city_data[1:]
                    tree[province_name][city_name] = map(strip_zipcode, counties)
    
            return tree
    
    
    if __name__ == '__main__':
        
        tree = make_city_tree('db.txt')
        pro = unicode(raw_input('省:'), 'utf8')
        for x in tree.keys():
            if pro in x:
                cities = tree.get(x)
                for t, city in enumerate(cities):
                    if t+1 != len(cities):
                        st = u"├"
                    else:
                        st = u'╰'
                    print u"\u2004\u2004\u2004%s─\u2004%s" % (st, city)
                    for d, county in enumerate(tree.get(x).get(city)):
                        if d+1 != len(tree.get(x).get(city)):
                            st = u"├"
                        else:
                            st = u'╰'
                        if t+1 != len(cities):
                            ct = u"│"
                        else:
                            ct = u"\u2004"
                        print u"\u2004\u2004\u2004%s\u2004\u2004%s─\u2004%s" % (ct, st, county)