AutoText

English | 简体中文

AutoText

智能文本自动处理工具（Intelligent text automatic processing tool）。

AutoText的功能主要有文本纠错，图片ocr以及表格结构识别等。

Guide

文本纠错
图片ocr
版面检测
表格结构识别
Todo
Contact
Citation
License
Contribute

文本纠错

文本纠错部分详细见jcorrector

本项目目前主要包括：

基于ngram的纠错

1.纠错

  Corrector corrector = new Corrector();
  String sentence = “少先队员因该为老人让坐”;
  System.out.println(corrector.correct(sentence));

2.检测

   Detector detector = new Detector();
   String sentence = “我的喉咙发炎了要买点阿莫细林吉”;
   System.out.println(detector.detect(sentence));

基于深度学习的纠错

纠错

  LoadModel.loadOnnxModel();
  String text = "今天新情很好。";
  text = "你找到你最喜欢的工作，我也很高心。";
  text = "是的，当线程不再使用时，该缓冲区将被清理（我昨天实际上对此进行了测试，我可以每5ms发送一个新线程，而不会产生净内存累积，并确认它的rng内存已在gc上清理）。编号7788";
  text = text.toLowerCase();
  BertTokenizer tokenizer = new BertTokenizer();
  MacBert macBert = new MacBert(tokenizer);
  Map<String, OnnxTensor> inputTensor = null;
  try {
      inputTensor = macBert.parseInputText(text);
  } catch (Exception e) {
      e.printStackTrace();
  }
  List<String> predTokenList = macBert.predCSC(inputTensor);
  predTokenList = predTokenList.stream().map(token -> token.replace("##", "")).collect(Collectors.toList());
  String predString = String.join("", predTokenList);
  System.out.println(predString);
  List<Pair<String, String>> resultList = macBert.getErrors(predString, text);
  for(Pair<String, String> result : resultList) {
      System.out.println(text + " => " + result.getLeft() + " " + result.getRight());
  }

基于模板中文语法纠错

纠错

 String templatePath = GecDemo.class.getClassLoader().getResource(PropertiesReader.get("template")).getPath().replaceFirst("/", "");
 GecCheck gecRun = new GecCheck();
 gecRun.init(templatePath);
 String sentence = "爸爸看完小品后忍俊不禁笑了起来。";
 String infoStr = gecRun.checkCorrect(sentence);
 if(StringUtils.isNotBlank(infoStr)) {
     System.out.println(infoStr);
 }

成语、专名纠错

纠错

 String properNamePath = ProperDemo.class.getClassLoader().getResource(PropertiesReader.get("proper_name_path")).getPath().replaceFirst("/", "");
 String strokePath = ProperDemo.class.getClassLoader().getResource(PropertiesReader.get("stroke_path")).getPath().replaceFirst("/", "");;
 ProperCorrector properCorrector = new ProperCorrector(properNamePath, strokePath);
 List<String> testLine = List.of(
         "报应接中迩来",
         "这块名表带带相传",
         "这块名表代代相传",
         "他贰话不说把牛奶喝完了",
         "这场比赛我甘败下风",
         "这场比赛我甘拜下封",
         "这家伙还蛮格尽职守的",
         "报应接中迩来",  // 接踵而来
         "人群穿流不息",
         "这个消息不径而走",
         "这个消息不胫儿走",
         "眼前的场景美仑美幻简直超出了人类的想象",
         "看着这两个人谈笑风声我心理不由有些忌妒",
         "有了这一番旁证博引",
         "有了这一番旁针博引",
         "这群鸟儿迁洗到远方去了",
         "这群鸟儿千禧到远方去了",
         "美国前总统特琅普给普京点了一个赞，特朗普称普金做了一个果断的决定"
 );
 for(String line : testLine) {
     System.out.println(properCorrector.correct(line));
 }

具体使用见本项目中的examples/correct部分，或者jcorrector

图片ocr

这部分主要利用paddleocr 中的检测与识别部分，并将其中模型转为onnx格式进行调用，本项目在识别前对图片进行了预处理，使得在cpu环境下，平均一张图10秒左右。
具体使用见本项目中的examples/ocr/text/OcrDemo部分
PS
- 模型网盘下载
- 提取码：b5vq
- 模型下载后可放入resources的text_recgo下或其它位置
使用

    // read image file
    String imagePath = "examples\\ocr\\img_test\\text_example.png";
    var imageFile = Paths.get(imagePath);
    var image = ImageFactory.getInstance().fromFile(imageFile);
    
    // init model
    String detectionModelFile = OcrDemo.class.getClassLoader().getResource(PropertiesReader.get("text_recog_det_model_path")).getPath().replaceFirst("/", "");
    String recognitionModelFile = OcrDemo.class.getClassLoader().getResource(PropertiesReader.get("text_recog_rec_model_path")).getPath().replaceFirst("/", "");
    Path detectionModelPath = Paths.get(detectionModelFile);
    Path recognitionModelPath = Paths.get(recognitionModelFile);
    OcrApp ocrApp = new OcrApp(detectionModelPath, recognitionModelPath);
    ocrApp.init();
    
    // predict result and consume time
    var timeInferStart = System.currentTimeMillis();
    Pair<List<TextListBox>, Image> imagePair = ocrApp.ocrImage(image, 960);
    System.out.println("consume time: " + (System.currentTimeMillis() - timeInferStart)/1000.0 + "s");
    for (var result : imagePair.getLeft()) {
            System.out.println(result);
    }
    // save ocr result image
    ocrApp.saveImageOcrResult(imagePair, "ocr_result.png", "examples\\ocr\\output");
    ocrApp.closeAllModel();

结果，为文字及其坐标

    position: [800.0, 609.0, 877.0, 609.0, 877.0, 645.0, 800.0, 645.0], text: 8.23%
    position: [433.0, 607.0, 494.0, 607.0, 494.0, 649.0, 433.0, 649.0], text: 68.4
    position: [96.0, 610.0, 316.0, 611.0, 316.0, 641.0, 96.0, 640.0], text: 股东权益比率（%）
    position: [624.0, 605.0, 688.0, 605.0, 688.0, 650.0, 624.0, 650.0], text: 63.2
    position: [791.0, 570.0, 887.0, 570.0, 887.0, 600.0, 791.0, 600.0], text: -39.64%
    position: [625.0, 564.0, 687.0, 564.0, 687.0, 606.0, 625.0, 606.0], text: 49.7
    position: [134.0, 568.0, 279.0, 568.0, 279.0, 598.0, 134.0, 598.0], text: 毛利率（%）
    ......

结果图片展示图片OCR

版面检测

利用yolov8进行版面检测，见layout_analysis4j
具体使用见本项目中的examples/ocr/layout_detection/LayoutDetection部分
使用

    public static void main(String...args) {
            String modelPath = LayoutDetection.class.getClassLoader().getResource(PropertiesReader.get("model_path")).getPath().replaceFirst("/", "");
            String labelPath = LayoutDetection.class.getClassLoader().getResource(PropertiesReader.get("table_det_labels_path")).getPath().replaceFirst("/", "");
            String imgPath = "D:\\project\\idea_workspace\\layout_analysis4j\\img\\test.webp";
    
            try {
                LayoutDet modelDet = new LayoutDet(modelPath, labelPath);
                Mat img = Imgcodecs.imread(imgPath);
                if (img.dataAddr() == 0) {
                    System.out.println("Could not open image: " + imgPath);
                    System.exit(1);
                }
                // run detection
                try {
                    List<Detection> detectionList = modelDet.detectObjects(img);
    
                    LayoutDetectionUtil.drawPredictions(img, detectionList);
                    System.out.println(JSON.toJSONString(detectionList));
                    Imgcodecs.imwrite("D:\\project\\idea_workspace\\layout_analysis4j\\img\\prediction.jpg", img);
                } catch (OrtException ortException) {
                    ortException.printStackTrace();
                }
    
            } catch (OrtException e) {
                e.printStackTrace();
            }
        }

结果如下

    [{"bbox":[137.88228,40.05045,352.5302,60.206684],"confidence":0.9228547,"label":"Header","labelIndex":0},
    {"bbox":[25.661982,52.15992,80.54977,60.164627],"confidence":0.8484751,"label":"Header","labelIndex":0},
    {"bbox":[400.68176,50.069782,462.38123,58.523815],"confidence":0.83252084,"label":"Header","labelIndex":0},
    {"bbox":[27.056168,478.51273,205.72672,656.39886],"confidence":0.9614719,"label":"Text","labelIndex":1},
    {"bbox":[25.820251,304.84778,463.41486,386.2965],"confidence":0.89359975,"label":"Text","labelIndex":1},
    {"bbox":[21.327255,190.66518,463.8985,257.07446],"confidence":0.8879021,"label":"Text","labelIndex":1},
    {"bbox":[182.88458,142.3864,308.64737,156.58653],"confidence":0.79081506,"label":"Text","labelIndex":1},
    {"bbox":[38.471603,435.21515,463.1955,474.5235],"confidence":0.77674204,"label":"Text","labelIndex":1},
    {"bbox":[153.92957,160.85332,338.4781,168.90303],"confidence":0.764402,"label":"Text","labelIndex":1},
    {"bbox":[27.318249,661.32355,151.53812,670.04987],"confidence":0.3412643,"label":"Text","labelIndex":1},
    {"bbox":[306.27896,667.1539,362.94162,674.0262],"confidence":0.8710417,"label":"Figure caption","labelIndex":3},
    {"bbox":[213.9415,479.61642,468.25687,661.9558],"confidence":0.9372132,"label":"Figure","labelIndex":4},
    {"bbox":[26.771957,405.50818,94.59786,416.935],"confidence":0.91822684,"label":"Title","labelIndex":7},
    {"bbox":[131.77039,103.47922,359.43063,120.83272],"confidence":0.88686645,"label":"Title","labelIndex":7},
    {"bbox":[26.655102,661.2926,151.92046,670.0917],"confidence":0.87808716,"label":"Title","labelIndex":7},
    {"bbox":[27.927279,275.91486,68.040955,287.49615],"confidence":0.8072859,"label":"Title","labelIndex":7},
    {"bbox":[27.4192,661.2635,151.4754,670.21075],"confidence":0.49235547,"label":"Footer","labelIndex":8}]

结果图片展示版面检测

表格结构识别

基于规则由opencv研发，主要识别的表格类型有：有边界表格、无边界表格以及部分有边界表格。
具体使用见本项目中的examples/ocr/table/TableDemo部分
使用

    public static void borderedRecog() {
            String imagePath = "D:\\project\\idea_workspace\\AutoText\\src\\main\\java\\examples\\ocr\\img_test\\bordered_example.png";
            Mat imageMat = imread(imagePath);
            System.out.println("imageMat : " + imageMat.size().height() + " " + imageMat.size().width() + " ");
            List<List<List<Integer>>> resultList = BorderedRecog.recognizeStructure(imageMat);
            System.out.println(resultList);
    //        ImageUtils.imshow("Image", pair.getRight());
        }
    
    public static void unBorderedRecog() {
        String imagePath = "D:\\project\\idea_workspace\\AutoText\\src\\main\\java\\examples\\ocr\\img_test\\unbordered_example.jpg";
        Mat imageMat = imread(imagePath);
        System.out.println("imageMat : " + imageMat.size().height() + " " + imageMat.size().width() + " ");
        List<List<List<Integer>>> resultList = UnBorderedRecog.recognizeStructure(imageMat);
        System.out.println(resultList);
//        ImageUtils.imshow("Image", pair.getRight());
    }

    public static void partiallyBorderedRecog() {
        String imagePath = "D:\\project\\idea_workspace\\AutoText\\src\\main\\java\\examples\\ocr\\img_test\\partially_example.jpg";
        Mat imageMat = imread(imagePath);
        System.out.println("imageMat : " + imageMat.size().height() + " " + imageMat.size().width() + " ");
        List<List<List<Integer>>> resultList = PartiallyBorderedRecog.recognizeStructure(imageMat);
        System.out.println(resultList);
//        ImageUtils.imshow("Image", pair.getRight());
    }

结果，为表格单元格坐标

    [[[58, 48, 247, 182], [560, 48, 247, 182], [811, 48, 246, 182], [309, 48, 247, 182], [1312, 48, 247, 182], 
    [1061, 48, 247, 182]], [[58, 234, 247, 118], [309, 234, 247, 118], [1061, 234, 247, 118], [560, 234, 247, 118], 
    [811, 234, 246, 118], [1312, 234, 247, 118]], [[58, 356, 247, 118], [309, 356, 247, 118], [560, 356, 247, 118], 
    [811, 356, 246, 118], [1061, 356, 247, 118], [1312, 356, 247, 118]], [[58, 478, 247, 118], [309, 478, 247, 118],
    [560, 478, 247, 118], [811, 478, 246, 118], [1061, 478, 247, 118], [1312, 478, 247, 118]], [[58, 600, 247, 119],
    [309, 600, 247, 119], [560, 600, 247, 119], [811, 600, 246, 119], [1061, 600, 247, 119], [1312, 600, 247, 119]], 
    [[58, 723, 247, 118], [309, 723, 247, 118], [560, 723, 247, 118], [1061, 723, 247, 118], [1312, 723, 247, 118], 
    [811, 723, 246, 118]], [[58, 845, 247, 118], [309, 845, 247, 118], [560, 845, 247, 118], [811, 845, 246, 118], 
    [1312, 845, 247, 118], [1061, 845, 247, 118]]]

结果图片展示表格结构

表格结构和OCR

这部分将整合表格结构和OCR识别，同时识别出表格单元格和OCR文本。
具体使用见本项目中的examples/ocr/table_text/TableTextDemo部分
使用

    public static void main(String...args) throws IOException, TranslateException {
            String imagePath = "D:\\project\\idea_workspace\\AutoText\\src\\main\\java\\examples\\ocr\\img_test\\bordered_example.png";
            TableText tableText = new TableText();
            /**
             * maxSideLen:image resize
             *
             * borderType:{0:all, 1:bordered(default), 2:unbordered, 3:partiallybordered}
             */
            int maxSideLen = -1; // default, no resize
            int borderType = 1; // default, bordered
            List<TextListBox> listBoxes = tableText.tableTextRecog(imagePath);
            for(TextListBox textListBox : listBoxes) {
                System.out.print(textListBox);
            }
        }

结果，为表格单元格坐标以及单元格内的文本

    position: [58.0, 48.0, 305.0, 48.0, 305.0, 230.0, 58.0, 230.0], text: 节次 星期
    position: [309.0, 48.0, 556.0, 48.0, 556.0, 230.0, 309.0, 230.0], text: 周一
    position: [811.0, 48.0, 1057.0, 48.0, 1057.0, 230.0, 811.0, 230.0], text: 周三
    position: [1061.0, 48.0, 1308.0, 48.0, 1308.0, 230.0, 1061.0, 230.0], text: 周四
    position: [560.0, 48.0, 807.0, 48.0, 807.0, 230.0, 560.0, 230.0], text: 周二
    position: [1312.0, 48.0, 1559.0, 48.0, 1559.0, 230.0, 1312.0, 230.0], text: 周五
    position: [58.0, 234.0, 305.0, 234.0, 305.0, 352.0, 58.0, 352.0], text: 
    position: [309.0, 234.0, 556.0, 234.0, 556.0, 352.0, 309.0, 352.0], text: 语文
    position: [811.0, 234.0, 1057.0, 234.0, 1057.0, 352.0, 811.0, 352.0], text: 英语
    position: [560.0, 234.0, 807.0, 234.0, 807.0, 352.0, 560.0, 352.0], text: 英语
    position: [1061.0, 234.0, 1308.0, 234.0, 1308.0, 352.0, 1061.0, 352.0], text: 自然
    position: [1312.0, 234.0, 1559.0, 234.0, 1559.0, 352.0, 1312.0, 352.0], text: 数学
    position: [58.0, 356.0, 305.0, 356.0, 305.0, 474.0, 58.0, 474.0], text: 3
    position: [560.0, 356.0, 807.0, 356.0, 807.0, 474.0, 560.0, 474.0], text: 英语
    position: [309.0, 356.0, 556.0, 356.0, 556.0, 474.0, 309.0, 474.0], text: 语文
    position: [811.0, 356.0, 1057.0, 356.0, 1057.0, 474.0, 811.0, 474.0], text: 英语
    position: [1312.0, 356.0, 1559.0, 356.0, 1559.0, 474.0, 1312.0, 474.0], text: 数学
    position: [1061.0, 356.0, 1308.0, 356.0, 1308.0, 474.0, 1061.0, 474.0], text: 语文
    position: [58.0, 478.0, 305.0, 478.0, 305.0, 596.0, 58.0, 596.0], text: 三
    position: [309.0, 478.0, 556.0, 478.0, 556.0, 596.0, 309.0, 596.0], text: 数学
    position: [560.0, 478.0, 807.0, 478.0, 807.0, 596.0, 560.0, 596.0], text: 语文
    position: [811.0, 478.0, 1057.0, 478.0, 1057.0, 596.0, 811.0, 596.0], text: 数学
    position: [1312.0, 478.0, 1559.0, 478.0, 1559.0, 596.0, 1312.0, 596.0], text: 英语
    position: [1061.0, 478.0, 1308.0, 478.0, 1308.0, 596.0, 1061.0, 596.0], text: 语文
    position: [58.0, 600.0, 305.0, 600.0, 305.0, 719.0, 58.0, 719.0], text: 四
    position: [309.0, 600.0, 556.0, 600.0, 556.0, 719.0, 309.0, 719.0], text: 数学
    position: [811.0, 600.0, 1057.0, 600.0, 1057.0, 719.0, 811.0, 719.0], text: 数学
    position: [560.0, 600.0, 807.0, 600.0, 807.0, 719.0, 560.0, 719.0], text: 语文
    position: [1061.0, 600.0, 1308.0, 600.0, 1308.0, 719.0, 1061.0, 719.0], text: 体育
    position: [1312.0, 600.0, 1559.0, 600.0, 1559.0, 719.0, 1312.0, 719.0], text: 英语
    position: [58.0, 723.0, 305.0, 723.0, 305.0, 841.0, 58.0, 841.0], text: 五
    position: [560.0, 723.0, 807.0, 723.0, 807.0, 841.0, 560.0, 841.0], text: 思想品德
    position: [309.0, 723.0, 556.0, 723.0, 556.0, 841.0, 309.0, 841.0], text: 体育
    position: [1061.0, 723.0, 1308.0, 723.0, 1308.0, 841.0, 1061.0, 841.0], text: 数学
    position: [1312.0, 723.0, 1559.0, 723.0, 1559.0, 841.0, 1312.0, 841.0], text: 手工
    position: [811.0, 723.0, 1057.0, 723.0, 1057.0, 841.0, 811.0, 841.0], text: 语文
    position: [58.0, 845.0, 305.0, 845.0, 305.0, 963.0, 58.0, 963.0], text: 六
    position: [309.0, 845.0, 556.0, 845.0, 556.0, 963.0, 309.0, 963.0], text: 美术
    position: [560.0, 845.0, 807.0, 845.0, 807.0, 963.0, 560.0, 963.0], text: 音乐
    position: [1061.0, 845.0, 1308.0, 845.0, 1308.0, 963.0, 1061.0, 963.0], text: 数学
    position: [811.0, 845.0, 1057.0, 845.0, 1057.0, 963.0, 811.0, 963.0], text: 语文
    position: [1312.0, 845.0, 1559.0, 845.0, 1559.0, 963.0, 1312.0, 963.0], text: 写字

Todo

加入jcorrector文本纠错，修改部分程序
基于paddleocr模型，利用java实现图片ocr
基于规则利用opencv识别表格结构
整合规则表格识别与OCR识别
加入版面检测分析

...

Contact

1、github：https://github.com/jiangnanboy

2、博客：https://www.cnblogs.com/little-horse/

3、邮件:2229029156@qq.com

Citation

如果你在研究中使用了AutoText，请按如下格式引用：

@{AutoText,
  author = {Shi Yan},
  title = {AutoText: Text automatic processing tool},
  year = {2023},
  url = {https://github.com/jiangnanboy/AutoText},
}

License

AutoText 的授权协议为 Apache License 2.0，可免费用做商业用途。请在产品说明中附加AutoText的链接和授权协议。AutoText受版权法保护，侵权必究。

Contribute

欢迎有兴趣的朋友fork，star，提交PR。

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
.idea		.idea
src/main		src/main
.gitignore		.gitignore
AutoText.iml		AutoText.iml
LICENSE		LICENSE
README.en.md		README.en.md
README.md		README.md
pom.xml		pom.xml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AutoText

文本纠错

图片ocr

版面检测

表格结构识别

表格结构和OCR

Todo

Contact

Citation

License

Contribute

About

Releases

Packages

Languages

License

jiangnanboy/AutoText

Folders and files

Latest commit

History

Repository files navigation

AutoText

文本纠错

图片ocr

版面检测

表格结构识别

表格结构和OCR

Todo

Contact

Citation

License

Contribute

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages