Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

正文文本提取和表格文本提取 #112

Open
kendrickliu opened this issue Sep 4, 2024 · 1 comment
Open

正文文本提取和表格文本提取 #112

kendrickliu opened this issue Sep 4, 2024 · 1 comment

Comments

@kendrickliu
Copy link

kendrickliu commented Sep 4, 2024

你好,在试用pdf解析时,有问题想请教一下:
1.从category_id的类别上看,"category_id":1是plain_text正文自然段文本,"category_id":5中的latex是表格文本,但是我看到解析结果的json文件,发现"category_id":1没有text文本,只有"category_id":15的ocr_text的text文本,ocr_text是否可以理解为除去表格内容等的纯正文文本呢?
2.解析后的json文件ocr_text文本的上下文顺序与原文有些不同(不是多列),是否存在坐标排序?
感谢

@wufan-tb
Copy link
Collaborator

wufan-tb commented Sep 19, 2024

  1. 你的理解是正确的,id 1 是layout检测的段落框,但是没有段落的文本内容,id 15是ocr检测的文本框,但是检测的时候图片,表格,和公式都是mask掉的,所以不包含表格内容;
  2. json中应该是乱序的,因为本身就是多个检测器,也没有进行坐标排序;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants