[开发技巧] 如何获取汉字笔画数？

发表于 2024-12-22 分类于开发技巧， Python 阅读次数：本文字数： 576 阅读时长 ≈ 1 分钟

记录了在开发卜筮脚本时遇到的汉字笔画数获取问题。在排除 pypinyin 库的错误方案后，详细介绍了如何通过解析 Unicode 官方 Unihan 数据库，提取 kTotalStrokes 字段来实现精准的汉字笔画查询功能。

[开发技巧] 如何获取汉字笔画数？

背景

在开发一个简单的卜筮小脚本的过程中，遇见了这个有趣的问题。如果只是特定个别汉字，我们大可以硬编码一个字典在脚本中，但是如果想获取任意一个汉字的笔画数呢？

pypinyin 库

from pypinyin import pinyin, Style

def get_strokes_count(chinese_character):
    pinyin_list = pinyin(chinese_character, style=Style.NORMAL)
    strokes_count = len(pinyin_list[0])
    return strokes_count

character = input("请输入一个汉字：")
strokes = get_strokes_count(character)
print("汉字'{}'的笔画数为：{}".format(character, strokes))

尝试了一下，发现得到的结果实际上是该汉字在 normal 拼音格式下的结果数，

unihan 数据库

unihan 数据库是一个由 Unicode 联盟维护的汉字数据库，看起来很靠谱，还提供了在线的工具。

在其在线查询工具Unihan Database Lookup中进行检索，发现查询结果中

存在kTotalStrokes字段，即为所需的笔画数数据。
作为 unicode 的官方数据库，目前版本完全满足基本的汉字查询。

Nice! 离成功更进了一步！

从 Unihan 数据库中获取笔画信息

最开始打算直接通过 lookup 发送查询请求，hmmm，太慢了，地址在国外。发现数据库文件本身也不大，就直接下载下来了。

Unihan 下载地址

打开压缩包，有文件若干.

通过 lookup 检索得到的结果，我们要的kTotalStrokes字段在IRG Source 中,取出该文件。
在regex101中测试正则，取出要的 unicode 部分和笔画数部分，单独存成文件, 以供查询.

编码

提取笔画信息

file = Path("Stroke/Unihan_IRGSources.txt")
output = Path("Stroke/unicode2stroke.json")
stroke_dict = dict()
with open(file,mode="r") as f:
    for line in f:
        raw_line = line.strip()
        pattern = r"(U\+.*)\skTotalStrokes.*\s(\d+)"
        result = re.findall(pattern=pattern, string=raw_line)
        if len(result) == 0:
            continue
        unicode_key = result[0][0]
        unicode_stroke = result[0][1]
        print(f"{unicode_key}: {unicode_stroke}")
        stroke_dict[unicode_key] = unicode_stroke

with open(file=output, mode="w", encoding="utf-8") as f:
    json.dump(stroke_dict,f, ensure_ascii=False, indent=4)

导出成json 文件方便访问

编写获取函数

with open(output) as f:
    unicode2stroke = json.load(f)

def get_character_stroke_count(char: str):
    unicode = "U+" + str(hex(ord(char)))[2:].upper()
    return int(unicode2stroke[unicode])

test_char = "阿"
get_character_stroke_count(char=test_char)

在获取时，注意 unicode 将汉字转为其对应的十六进制码

成功!达到预期结果!