Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

表格文本latex及段落顺序 #111

Open
kendrickliu opened this issue Sep 4, 2024 · 1 comment
Open

表格文本latex及段落顺序 #111

kendrickliu opened this issue Sep 4, 2024 · 1 comment

Comments

@kendrickliu
Copy link

kendrickliu commented Sep 4, 2024

您好,有问题想咨询
1.从类别上看category_id=1的plain_text是正文自然段落的文本,latex是category_id=5的表格的文本。但是我看解析后的json文件中, "category_id":1没有text值,只有category_id=15的'ocr_text'有文本,'ocr_text'的“text”文本段是否可以理解为是除去表格内容后的正文文本?
2.上下文顺序有部分不太对的上,是否存在坐标排序?
附上截图,感谢

Uploading 1725428833581.png…

@ouyanglinke
Copy link
Collaborator

  1. category_id=1是layout模型的结果,是段落级别的文本的框,只有bbox信息;15是OCR模型的结果,包含bbox和文本;如果想利用ocr_text来组成plain_text文本框内的整个自然段,需要跑MinerU的后处理;
  2. 阅读顺序排序的代码也在MinerU里;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants