mingtzge / 2019-ccf-bdci-ocr-mczj-ocr-identificationidelement Goto Github PK

View Code? Open in Web Editor NEW

876.0 25.0 314.0 1.85 MB

2019CCF-BDCI大赛最佳创新探索奖获得者基于OCR身份证要素提取赛题冠军天晨破晓团队赛题源码

License: MIT License

Python 98.56% Dockerfile 0.17% Shell 0.68% MATLAB 0.59%

2019-ccf-bdci-ocr-mczj-ocr-identificationidelement's Introduction

整体介绍

赛题介绍

我们的队名是:鹏脱单攻略队后面改为"天晨破晓" 最终我们团队成绩在复赛AB榜均排在第一名,识别准确率达0.996952

团队成绩:2019CCF-BDCI大赛最佳创新探索奖和 "基于OCR的身份证要素提取"单赛题冠军

系统处理流程图

方案亮点

我们采用条件生成对抗网络(CGAN)处理赛题中的水印干扰,取到了比较好的效果,展示一下效果图片:

生成仿真数据源码,生成仿真训练数据训练去水印模型和文字识别模型

方案PPT 方案论文

执行方式介绍

完整执行示例:
CPU执行,单进程:
python main_process.py --test_experiment_name test_example --test_data_dir ./test_data --gan_ids -1 --pool_num 0
参数详解:
--test_experiment_name:实验名,将决定中间数据结果存放目录
--test_data_dir: 数据目录
--gan_ids: 去水印模型:如果是-1 则是cpu运行, 大于0,则是GPU
--pool_num 0单进程 大于0多进程
其他参数参考main_process.py中的help

项目整体文件结构说明:

(按照处理流程介绍,具体文件介绍见文件内的readme)

身份证区域提取模块 cut_twist_process

剪切、翻转部分代码；用于将身份证正反面从原始图片中切分出来

去除水印\关键文本定位模块 watermask_remover_and_split_data

进行水印去除,身份证切割,提取文字部分,滤波

去水印模型 pytorch-CycleGAN-and-pix2pix

我们训练好的去除水印模型地址:

参考资料

去水印模型采用条件gan网络。论文链接

参考了GitHub上gan pix2pix 项目，链接,我们基于此项目进行了一些更改

文字识别模块 recognize_process

用于识别图片中的文字信息

参考资料

识别模型采用CRNN。论文链接

参考了GitHub上两个模型的TensorFlow实现

项目1

项目2

文本纠正模块 data_correction_and_generate_csv_file

对识别结果进行纠正,以及生成最终的csv文件

data_temp

CCFTestResultFixValidData_release.csv

生成的结果文件

main_process.py

执行脚本

Requirement.txt

运行的环境要求

注意

去水印模型地址

这项是工程的submodel,需要克隆的时候需要加上"--recursive"参数去水印模型较大,采用了git-lfs,要安装这个包,不然可能会导致clone失败或者比较慢

由于lfs超出限额，功能失效，可能会导致模型文件（文字识别和去水印模型）clone失败，我们把模型文件上传百度云了，由于数据文件包含身份证图片，数据敏感，百度云链接很容易失效，需要赛题数据和模型文件的话，添加我百度网盘好友，具体见About data download

!!!注:测试数据跟初赛和复赛的数据格式需要保持一致,每面身份证左上角需要有:"仅限DBCI比赛(复赛)使用"字样, 且字体大小格式位置应该跟初赛和复赛的保持一致,否则将严重影响识别的准确性甚至代码运行出错原因:对于身份证各个元素的识别,我们是先裁剪出来,再识别的.我们在裁剪的时候,是以"限DBCI"为参考的 ,每次裁剪前,都会用模板在图片匹配对应的位置,得到参考坐标,再相对于这个参考坐标裁剪各个元素.

2019-ccf-bdci-ocr-mczj-ocr-identificationidelement's People

Contributors

Stargazers

Watchers

Forkers

wenqiang-miao chasingstar95 chilewang0228 a793719595 barryzm wyc2015fq helpcaige quuhua911 zhangxiang1209 timedcy fendaq yeguolin cqcracked gameofdimension killerwuhan hollisjoe ouya-bytes liuzhuang1024 lmolhw5252 persuelx challenging6 xiaolang564321 llf10811020205 zoujuny smilealvin jtengyp tony1236 mathpopo liyucode qzchenwl wishgale davidce stc-cqupt herbert-wu jingmouren seeker1943 realhellosunsun wdhxek sporterman liuguoyou cainiaopcw wwwanghao xuweidongkobe cherish24 qizhen816 guopj2016 zzmcdc tchigher vilon888 joristian gpsbird kmfeng mamak426 xiesibo fangqin0703 zhongkailv coderhancode nkgfirecream leo-xxx leewi9 htpauleta dun933 ieee820 woniupapa cqray1990 ethanshan qingfengting2017 glastu jadentan tuq820 mochp sihangsong hajungong007 hots-j sjz207 aliushn hell-to-heaven shanhedian2017 qiuhui1991 light201212 guker yanggui19891007 zonghaofan yingning daoyijushi cuda-convnet winterxx sunzhuojun sunxingxingtf xiaochengcike xiongjunhan happog tamerlx samuelyi han8435762 yb7 mahomel hhy5277 sanster jxncyym

2019-ccf-bdci-ocr-mczj-ocr-identificationidelement's Issues

可以提供下复赛数据集吗

char_map.json文件生成方式

您好，

我想用自己的数据集finetune一下这个crnn的模型，想请教一下，训练用到的char_map.json是否为一个自己生成的字典文件？根据自己的训练数据里面出现的字符列出即可？
另外，在annotation file里面是不是就是 path/xxx.jpg 标注文字这样的格式的一个标注文件？标注文字是没有根据字典文件转换index的？
想确认一下数据格式是否正确！感谢！

已加网盘好友，十分谢谢分享

cut_twist_process中template生成方式

您好，当我使用自定义身份证图片进行测试时，发现twist模块将原本正着的图像进行了反转，我怀疑时用于进行比较的template中的模版图片的通用性不够，故而想自己生成。

想请教下，这里template文件的生成方式？是否是twist_part中的gaussian_blur函数即可？还是否需要进行灰度锐化等操作？

感谢！

the data link have been lost by baidu

Hi, same to the title of the title link. the data of the model has been block by baidu. Could you reload the data again?

百度网盘的链接失效了，还有大佬手上有资源吗，很关键！！有报酬！！！！QQ767618141

RT。

git-clone error with git-lfs

hi-我试图运行代码，但这个lfs文件无法下载，请问下有没有什么解决方案？感谢！～

git clone https://github.com/Mingtzge/2019-CCF-BDCI-OCR-MCZJ-OCR-IdentificationIDElement.git
Cloning into '2019-CCF-BDCI-OCR-MCZJ-OCR-IdentificationIDElement'...
remote: Enumerating objects: 150, done.
remote: Counting objects: 100% (150/150), done.
remote: Compressing objects: 100% (135/135), done.
remote: Total 305 (delta 38), reused 97 (delta 11), pack-reused 155
Receiving objects: 100% (305/305), 1.82 MiB | 32.72 MiB/s, done.
Resolving deltas: 100% (56/56), done.
Downloading data_correction_and_generate_csv_file/data/repitle_address_extract.json (27 MB)
Error downloading object: data_correction_and_generate_csv_file/data/repitle_address_extract.json (f6021cc): Smudge error: Error downloading data_correction_and_generate_csv_file/data/repitle_address_extract.json (f6021cc1b6e451a4572698eedf9a48cfad2c14fdd392e6d9fcc8644541816f00): batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.

Errors logged to /home/ec2-user/SageMaker/2019-CCF-BDCI-OCR-MCZJ-OCR-IdentificationIDElement/.git/lfs/logs/20200218T054034.97028677.log
Use git lfs logs last to view the log.
error: external filter 'git-lfs filter-process' failed
fatal: data_correction_and_generate_csv_file/data/repitle_address_extract.json: smudge filter lfs failed
warning: Clone succeeded, but checkout failed.
You can inspect what was checked out with 'git status'
and retry the checkout with 'git checkout -f HEAD'

标注链接的提取码错误

https://pan.baidu.com/s/1jfAKQVjD-SMJSffOwGhh8A的提取码错误

项目应用

后面考虑移植到移动端吗？

json文件读取报错

hi 在使用测试代码运行时，发现如下报错，json文件是从百度网盘下载的，预览看的时候也显示有问题，其他两个json读取没有问题，请问下是什么可能的原因呢？谢谢！

`UnicodeDecodeError Traceback (most recent call last)
in ()
2 unit_json = "./data_correction_and_generate_csv_file/data/unit.json" # 签发机关数据库
3 id_json = "./data_correction_and_generate_csv_file/data/repitle_address_extract.json" # 地址数据库
----> 4 unit_id_json = json.load(open(id_json, "r", encoding="utf-8"))

~/anaconda3/envs/amazonei_tensorflow_p36/lib/python3.6/json/init.py in load(fp, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
294
295 """
--> 296 return loads(fp.read(),
297 cls=cls, object_hook=object_hook,
298 parse_float=parse_float, parse_int=parse_int,

~/anaconda3/envs/amazonei_tensorflow_p36/lib/python3.6/codecs.py in decode(self, input, final)
319 # decode input (taking the buffer into account)
320 data = self.buffer + input
--> 321 (result, consumed) = self._buffer_decode(data, self.errors, final)
322 # keep undecoded input until the next call
323 self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8d in position 21908442: invalid start byte`

地址纠正repitle_address_extract.json文件下载

作者能不能分享一下repitle_address_extract.json文件，这个好像也下载不了，谢谢！

您好，数据集的标签文件csv中的身份证号被科学计数给约了，怎么处理呢？

识别模型lfs无法下载

git lfs无法下载识别模型，能否提供baidu云盘方便下载？

文本检测方法

文本检测咱们团队用的什么方法。。。。。

为什么项目clone不下来

unable to access 'https://github.com/Mingtzge/2019-CCF-BDCI-OCR-MCZJ-OCR-IdentificationIDElement/': Failed to connect to github.com port 443: Timed out提示这样的错误，但查找一些资料也没有完全解决，直接下载zip文件夹会使项目文件不全。

百度网盘分享失效，跪求留个联系方式，感激不尽，QQ 83656313

请问下有数据集路径吗？

repitle_address_extract.json

Hi,
It shows 25.6 MB but what I can see are 3 lines, could you please add a download link for this file?

version https://git-lfs.github.com/spec/v1
oid sha256:f6021cc1b6e451a4572698eedf9a48cfad2c14fdd392e6d9fcc8644541816f00
size 26857155

请问识别模型为什么没有ckpt文件

文字定位模型

请问文字定位模型使用的哪一个呢？可否提供呢？

watermask_remover_and_split_data部分好像没有main主函数，不能执行

百度云盘分享失效了，跪求，跪求，感谢恩师，QQ 83656313

CGAN data

HI:
what datasets you train CGAN to erase watermark like?,i test an full image but the data you generate is not an full image,it's just a small picture which include watermark only

是否可以提供论文和 ppt 的 pdf 链接？

谢谢分享。论文链接和 ppt链接里的内容是 png 格式的，里边的图片看不清楚，是否可以提供 pdf 格式的，谢谢。

关于tensorflow-gpu版本

我的cuda是10版的不支持tensorflow-gpu1.12.0所以下的1.13.1版本的tensorflow-gpu，运行mainprocess.py时，在recognize_process下的test_crnn_jmz.py的一行代码‘saver = tf.train.Saver()'报错了，错误信息是“OutOfRangeError:Read less bytes than requested”，该怎么处理呢？

身份证元素提取模板

你好~
身份证元素提取模板图片在template_imgs里面没有，能够提供一下吗，想研究一下模板匹配这一块的代码？

How long time you guys trained the model with a dic including 6031 characters?

I got a similar task recently, but there are 20838 characters in my dic. I have trained recognition model for almost 2 weeks,and it will last for another 2 weeks at least i guess. It's my first time to use CRNN ,so i got cofussed is that normal to train a model for these long time?(BTW: i used simgle gpu)

百度网盘的分享无效了

这个可以适用单张身份证图片吗，我试了下会报错

如果适用的话，方便指导一下如何修改代码吗，万分感谢

关于签发机关纠正问题

大佬，项目中unit.json中签发机关的内容不太对，以’南京市栖霞区公安局‘为例，身份证背面的签发机关真正的内容为’南京市公安局栖霞分局‘，是还有其他转换逻辑吗？

数据集被百度屏蔽不给下载了

还有其他的下载渠道吗？谢谢了~

测试程序运行出错

NameError Traceback (most recent call last)
~/code/OCR/main_process.py in ()
77 # 去水印和对图片进行切割和处理
78 watermask_handler = WatermarkRemover(args, header_dir, cut_twisted_save_path)
---> 79 watermask_handler.watermask_remover_run()
80 recognize_image_path = os.path.join(header_dir, "test_data_preprocessed")
81 recognize_txt_path = os.path.join(header_dir, "test_data_txts")

~/code/OCR/watermask_remover_and_split_data/watermask_process.py in watermask_remover_run(self)
222 self.gan_gen_result()
223 if not args.no_rec_img: # 将去好水印的图片恢复到原图
--> 224 print("running rec_img ....")
225 self.recover_origin_img()
226 self.ori_img_path = self.recover_image_dir # 将原始图片路径改为去过水印之后的图片路径

~/code/OCR/watermask_remover_and_split_data/watermask_process.py in recover_origin_img(self)
168 result_dir = os.path.join(self.gan_result_dir, self.pixel_mode, "test_latest", "images")
169 if not os.path.exists(result_dir):
--> 170 print("not exists gan result dir")
171 exit(0)
172 result_img_names = os.listdir(result_dir)

NameError: name 'exit' is not defined

Some question on Pix2Pix

Hey, You mentioned you make some modifications to the Pix2pix model in this repo.

Roughly checking your repo and original Pix2Pix model, I got these modifications below:

Added --add_contrast in base_options.py
Commented the Visualizer part
Changed a function into lambda function in 'get_norm_layer' from networks.py

I wonder is there more modification on the Pix2Pix part of your repo? You also claimed you make data augmentation and could you please add more detail of that?

请问参考文献CGAN和参考代码CycleGAN不一样是为什么？

如题

test model

when i run test.py the result as follows:

it didn't work to erase the watermark

there are something wrong in test_data_dst_path, exit...

执行到 watermask_remover_and_split_data/watermask_process.py 第253 行的函数时出现错误

关于模型和数据下载链接失效了

您好百度盘的链接失效了，可否提供一下新的下载地址，我这边已经在百度盘上加您了

About data download

由于百度云盘链接很容易失效。所以需要数据的朋友请添加我为百度云盘好友，我的账号1713983016@@qq.com，我将通过好友分享的方式分享给大家，备注信息：”2019 CCF BDCI OCR 赛题数据“

ctcloss的参数labels

2019-CCF-BDCI-OCR-MCZJ-OCR-IdentificationIDElement/recognize_process/crnn_model/crnn_model.py

Line 136 in 6f11d68

    
           loss = tf.reduce_mean(tf.nn.ctc_loss(labels=labels, inputs=inference_ret, sequence_length=\

看到您的tf版本是1.1.12，该版本ctcloss对应的输入label是一个稀疏矩阵，确认下您这行代码输入的labels是经过稀疏变换的吗？

bug

带少数民族文字的身份证，模板跟正常的不一样，有处理过这种类型的身份证么？

test error

when i run test.py in pytorch-CycleGAN-and-pix=pix2pix to predict and i loadmodel with latest_net_G.pth and i raise AttributeError: 'Sequential' object has no attribute 'model'

def __patch_instance_norm_state_dict(self, state_dict, module, keys, i=0):
    """Fix InstanceNorm checkpoints incompatibility (prior to 0.4)"""
    key = keys[i]
    if i + 1 == len(keys):  # at the end, pointing to a parameter/buffer
        if module.__class__.__name__.startswith('InstanceNorm') and \
                (key == 'running_mean' or key == 'running_var'):
            if getattr(module, key) is None:
                state_dict.pop('.'.join(keys))
        if module.__class__.__name__.startswith('InstanceNorm') and \
           (key == 'num_batches_tracked'):
            state_dict.pop('.'.join(keys))
    else:
        self.__patch_instance_norm_state_dict(state_dict, getattr(module, key), keys, i + 1)

recognise model

model_save下的识别模型是不是有问题，运行mytest_crnn.py加载模型总是不对，
021-02-01 09:46:44.007098: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at save_restore_tensor.cc:175 : Data loss: Unable to open table file /recognize_process/model_save: Failed precondition: ./recognize_process/model_save; Is a directory: perhaps your file is in a different file format and you need to use a different restore operator?
Traceback (most recent call last):
File ".local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1356, in _do_call
return fn(*args)
File "local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1341, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File ".local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.DataLossError: Unable to open table file r/recognize_process/model_save: Failed precondition: /recognize_process/model_save; Is a directory: perhaps your file is in a different file format and you need to use a different restore operator?
[[{{node save/RestoreV2}}]]

你们团队真的是太聪明了，这么好的解决方案。我有一个地方没看懂，还请指明

大赛给的样本都是打了水印的。想用GAN去生成没有水印的图像，首先就要有没有水印的标签，而且还要满足仅仅没有水印，其他位置和带有水印一致。你们是怎么获得这样的数据的？

mingtzge / 2019-ccf-bdci-ocr-mczj-ocr-identificationidelement Goto Github PK

2019-ccf-bdci-ocr-mczj-ocr-identificationidelement's Introduction

整体介绍

系统处理流程图

方案亮点

执行方式介绍

项目整体文件结构说明:

身份证区域提取模块 cut_twist_process

去除水印\关键文本定位模块 watermask_remover_and_split_data

去水印模型 pytorch-CycleGAN-and-pix2pix

参考资料

文字识别模块 recognize_process

参考资料

文本纠正模块 data_correction_and_generate_csv_file

data_temp

CCFTestResultFixValidData_release.csv

main_process.py

Requirement.txt

注意

2019-ccf-bdci-ocr-mczj-ocr-identificationidelement's People

Contributors

Stargazers

Watchers

Forkers

2019-ccf-bdci-ocr-mczj-ocr-identificationidelement's Issues

Recommend Projects

Recommend Topics

Recommend Org