diff --git a/FlagOpen.png b/FlagOpen.png
new file mode 100644
index 0000000..d713966
Binary files /dev/null and b/FlagOpen.png differ
diff --git a/README.md b/README.md
index 323a60a..b7412e6 100644
--- a/README.md
+++ b/README.md
@@ -11,7 +11,12 @@
-----------------------------------------------------------------------
-Data is one of the basic elements in the development of artificial intelligence. With the continuous breakthrough of large-scale pre-training model and related technology, it is becoming more and more important to use efficient data processing tools to improve data quality in the corresponding research. So we launched FlagData, an easy-to-use and easy-to-extend data processing toolkit. FlagData integrates several data processing tools and algorithms including data acquisition, data preparation, data preprocessing and data analysis, which provides a strong data level support for model training and deployment in natural language processing, computer vision and other fields.
+Data is one of the basic elements in the development of artificial intelligence. With the continuous breakthrough of
+large-scale pre-training model and related technology, it is becoming more and more important to use efficient data
+processing tools to improve data quality in the corresponding research. So we launched FlagData, an easy-to-use and
+easy-to-extend data processing toolkit. FlagData integrates several data processing tools and algorithms including data
+acquisition, data preparation, data preprocessing and data analysis, which provides a strong data level support for
+model training and deployment in natural language processing, computer vision and other fields.
FlagData supports the following features:
@@ -24,14 +29,13 @@ FlagData supports the following features:
The complete pipeline process and features such as
![pipeline](pipeline.png)
-## News
+## News
- [Dec 31st, 2023] FlagData v2.0.0 has been upgraded
- [Jan 31st, 2023] FlagData v1.0.0 is online!
--------------------------------------------------------------------------------
-
- [Installation](#Installation)
- [Quick Start](#Quick-Start)
- [Data acquisition phase](#Data-acquisition-phase)
@@ -57,7 +61,9 @@ The complete pipeline process and features such as
pip install -r requirements.txt
```
-Optionally install the `cleaner` module required in FlagData. You will only install the dependency packages for the corresponding modules, which is suitable for users who only want to use the `cleaner` module and do not want to install other module dependency packages.
+Optionally install the `cleaner` module required in FlagData. You will only install the dependency packages for the
+corresponding modules, which is suitable for users who only want to use the `cleaner` module and do not want to install
+other module dependency packages.
```bash
pip install flagdata[cleaner]
@@ -65,7 +71,8 @@ pip install flagdata[cleaner]
**Install the latest version of the main branch**
-The main branch is officially released by FlagData. If you want to install / update to the latest version of the main branch, use the following command:
+The main branch is officially released by FlagData. If you want to install / update to the latest version of the main
+branch, use the following command:
```
git clone https://github.com/FlagOpen/FlagData.git
@@ -83,22 +90,29 @@ pip install -r requirements.txt
### Data acquisition phase
-The LLM interface is utilized to construct a series of single rounds of SFT data for different abilities with three different strategies. The strategies include:
-
-+ ImitateGenerator: augment data using several case samples as templates. Supports simultaneous generation of data in multiple languages.
-+ AbilityExtractionGenerator: using the LLM interface, generalize the abilities contained in several case samples. Generate new samples and answers based on this collection of capabilities.
-+ AbilityDirectGenerator: Generate new samples directly related to a specified ability type or task type. For example, if you specify the ability as "Logical Reasoning", you can generate a series of logical reasoning questions and answers. In order to increase the diversity of generated samples, it is supported to exclude already generated samples.
+The LLM interface is utilized to construct a series of single rounds of SFT data for different abilities with three
+different strategies. The strategies include:
++ ImitateGenerator: augment data using several case samples as templates. Supports simultaneous generation of data in
+ multiple languages.
++ AbilityExtractionGenerator: using the LLM interface, generalize the abilities contained in several case samples.
+ Generate new samples and answers based on this collection of capabilities.
++ AbilityDirectGenerator: Generate new samples directly related to a specified ability type or task type. For example,
+ if you specify the ability as "Logical Reasoning", you can generate a series of logical reasoning questions and
+ answers. In order to increase the diversity of generated samples, it is supported to exclude already generated
+ samples.
- See [ReadMe under data_gen Module](flagdata/data_gen/README.md) for an example.
-
+See [ReadMe under data_gen Module](flagdata/data_gen/README.md) for an example.
### Data preparation phase
-Under the all2txt module, the unstructured / semi-structured files such as pdf2txt and epub2txt can be converted into txt, and it can well solve the problem of incoherent text content caused by single column, double column, and the order of Chinese text interspersed with charts.
+Under the all2txt module, the unstructured / semi-structured files such as pdf2txt and epub2txt can be converted into
+txt, and it can well solve the problem of incoherent text content caused by single column, double column, and the order
+of Chinese text interspersed with charts.
At the same time, the types of elements after parsing are "Table", "FigureCaption", "NarrativeText", "ListItem", "
-Title [Chapter Title]", "Address [E-mail]","PageBreak", "Header [Header]", "Footer [Footer]", "UncategorizedText [arxiv vertical number]", "
+Title [Chapter Title]", "Address [E-mail]","PageBreak", "Header [Header]", "Footer [Footer]", "
+UncategorizedText [arxiv vertical number]", "
Image, Formula, etc. Tool scripts provide two forms: keeping full text and saving by category resolution.
See [ReadMe under all2txt Module](flagdata/all2txt/README.md) for an example.
@@ -107,18 +121,22 @@ See [ReadMe under all2txt Module](flagdata/all2txt/README.md) for an example.
#### Language recognition
-Under the language_identification module, the language classifier of fastText is used to classify. The language classifier of fastText is based on Wikipedia, Tatoeba and SETimes.
-The above training uses n-grams as a feature and uses a hierarchical softmax. Supports 176 language classifications, and finally outputs a score of 0: 1.
+Under the language_identification module, the language classifier of fastText is used to classify. The language
+classifier of fastText is based on Wikipedia, Tatoeba and SETimes.
+The above training uses n-grams as a feature and uses a hierarchical softmax. Supports 176 language classifications, and
+finally outputs a score of 0: 1.
+ Each CPU core can process a thousand documents per second.
+ Do a language classification for each web page and get the score of the classification.
-+ For general cleaning rules, if it is greater than 0.5, it is classified as a specific language, otherwise it indicates that the page is not sure what language it is and discards the page.
++ For general cleaning rules, if it is greater than 0.5, it is classified as a specific language, otherwise it indicates
+ that the page is not sure what language it is and discards the page.
See [ReadMe under language_identification Module](flagdata/language_identification/README.md) for an example.
#### Data cleaning
-The cleaner module uses multi-process pool mp.Pool to process data in parallel in a multi-process manner. Use SharedMemoryManager to create shareable data structures, and multiple processes share data in data processing.
+The cleaner module uses multi-process pool mp.Pool to process data in parallel in a multi-process manner. Use
+SharedMemoryManager to create shareable data structures, and multiple processes share data in data processing.
Efficient data cleaning is achieved through multi-processes and shared memory:
@@ -133,7 +151,9 @@ Currently, the following cleaning rules are included:
It takes only two steps to use the data cleaning feature of FlagData:
-1. Modify the data path and format in the YAML configuration file. We give detailed comments on each parameter in the configuration file template to explain its meaning. At the same time, you can refer to [Configuration](#Configuration) Chapter.
+1. Modify the data path and format in the YAML configuration file. We give detailed comments on each parameter in the
+ configuration file template to explain its meaning. At the same time, you can refer
+ to [Configuration](#Configuration) Chapter.
2. Specify the configuration file path in the following code and run it
```python
@@ -143,64 +163,89 @@ It takes only two steps to use the data cleaning feature of FlagData:
cleaner.clean()
```
-The cleaned file will be saved in the format `jsonl` to the path corresponding to the `output` parameter specified in the configuration file.
+The cleaned file will be saved in the format `jsonl` to the path corresponding to the `output` parameter specified in
+the configuration file.
-See [Tutorial 1: Clean the original text obtained from the Internet](/flagdata/cleaner/tutorial_01_cleaner.md) for an example.
+See [Tutorial 1: Clean the original text obtained from the Internet](/flagdata/cleaner/tutorial_01_cleaner.md) for an
+example.
#### Quality assessment
BERT and fasttext were selected as the evaluation model because they have the following advantages:
-1. BERT model performs well in text classification and understanding tasks, has strong language comprehension and representation ability, and can effectively evaluate text quality.
-2. FastText model has efficient training and reasoning speed, while maintaining the classification performance, which can significantly reduce the training and reasoning time.
+1. BERT model performs well in text classification and understanding tasks, has strong language comprehension and
+ representation ability, and can effectively evaluate text quality.
+2. FastText model has efficient training and reasoning speed, while maintaining the classification performance, which
+ can significantly reduce the training and reasoning time.
-This paper compares different text classification models, including logical regression, BERT and FastText, to evaluate their performance. In the experiment, BERTEval and FastText models perform well in text classification tasks, and FastText model performs best in terms of accuracy and recall rate. [experimental results are from ChineseWebText]
+This paper compares different text classification models, including logical regression, BERT and FastText, to evaluate
+their performance. In the experiment, BERTEval and FastText models perform well in text classification tasks, and
+FastText model performs best in terms of accuracy and recall rate. [experimental results are from ChineseWebText]
See [ReadMe under quality_assessment Module](flagdata/quality_assessment/README.md) for an example.
#### Data deduplication
-deduplication Module provides the ability to deduplicate large amounts of text data, using MinHashLSH (Least Hash Locally Sensitive Hash) by converting text into a series of hash values in order to compare similarities between texts.
+deduplication Module provides the ability to deduplicate large amounts of text data, using MinHashLSH (Least Hash
+Locally Sensitive Hash) by converting text into a series of hash values in order to compare similarities between texts.
-We can control the parameter threshold, which represents the threshold of similarity, with values ranging from 0 to 1. A setting of 1 means that there is an exact match and no text is filtered out. On the contrary, if a lower similarity value is set, texts with slightly higher similarity will also be retained. We can set a higher threshold value as needed to retain only those texts that are very similar, while discard those texts with slightly less similarity. The empirical default value is 0.87. At the same time, we use the distributed computing power of Spark to deal with large-scale data, the idea of MapReduce is used to remove duplicates, and tuned by spark to deal with large-scale text data sets efficiently.
-The following is the similar text iterated in the process of data deduplication, which has slight differences in line wrapping and name editing, but the deduplication algorithm can identify two paragraphs of text that are highly similar.
+We can control the parameter threshold, which represents the threshold of similarity, with values ranging from 0 to 1. A
+setting of 1 means that there is an exact match and no text is filtered out. On the contrary, if a lower similarity
+value is set, texts with slightly higher similarity will also be retained. We can set a higher threshold value as needed
+to retain only those texts that are very similar, while discard those texts with slightly less similarity. The empirical
+default value is 0.87. At the same time, we use the distributed computing power of Spark to deal with large-scale data,
+the idea of MapReduce is used to remove duplicates, and tuned by spark to deal with large-scale text data sets
+efficiently.
+The following is the similar text iterated in the process of data deduplication, which has slight differences in line
+wrapping and name editing, but the deduplication algorithm can identify two paragraphs of text that are highly similar.
```json lines
{
- "__id__":3023656977259,
- "content":"\"2022海口三角梅花展\"已接待游客3万多名——\n三角梅富了边洋村\n一年四季,美丽的海南岛始终春意盎然、鲜花盛开,而作为海南省省花的三角梅就是其中最引人注目的鲜花品种之一,成为海南的一道亮丽风景线。\n\"可别小看这一盆盆普通的三角梅花,特别受游客喜爱。仅最近一个多月,我们就卖出了200多万元,盆栽三角梅销路火爆......吸引更多本地和外地游客来赏花、买花。(经济日报 记者 潘世鹏)\n(责任编辑:单晓冰)"
+ "__id__": 3023656977259,
+ "content": "\"2022海口三角梅花展\"已接待游客3万多名——\n三角梅富了边洋村\n一年四季,美丽的海南岛始终春意盎然、鲜花盛开,而作为海南省省花的三角梅就是其中最引人注目的鲜花品种之一,成为海南的一道亮丽风景线。\n\"可别小看这一盆盆普通的三角梅花,特别受游客喜爱。仅最近一个多月,我们就卖出了200多万元,盆栽三角梅销路火爆......吸引更多本地和外地游客来赏花、买花。(经济日报 记者 潘世鹏)\n(责任编辑:单晓冰)"
}
{
- "__id__":3934190045072,
- "content":"记者 潘世鹏\n\"2022海口三角梅花展\"已接待游客3万多名——\n三角梅富了边洋村\n一年四季,美丽的海南岛始终春意盎然、鲜花盛开,而作为海南省省花的三角梅就是其中最引人注目的鲜花品种之一,成为海南的一道亮丽风景线。\n\"可别小看这一盆盆普通的三角梅花,特别受游客喜爱。仅最近一个多月,我们就卖出了200多万元,盆栽三角梅销路火爆。......吸引更多本地和外地游客来赏花、买花。(经济日报 记者 潘世鹏)"
+ "__id__": 3934190045072,
+ "content": "记者 潘世鹏\n\"2022海口三角梅花展\"已接待游客3万多名——\n三角梅富了边洋村\n一年四季,美丽的海南岛始终春意盎然、鲜花盛开,而作为海南省省花的三角梅就是其中最引人注目的鲜花品种之一,成为海南的一道亮丽风景线。\n\"可别小看这一盆盆普通的三角梅花,特别受游客喜爱。仅最近一个多月,我们就卖出了200多万元,盆栽三角梅销路火爆。......吸引更多本地和外地游客来赏花、买花。(经济日报 记者 潘世鹏)"
}
```
Integration of a single capability of spark:
-Most of the time, we want to use the distributed data processing capability of spark. Here is a way to transform an ordinary function into a spark udf function, and then use the spark capability.
+Most of the time, we want to use the distributed data processing capability of spark. Here is a way to transform an
+ordinary function into a spark udf function, and then use the spark capability.
But for functions that want to be transformed into spark tasks, you need to meet:
1. Data parallelism: the input data of a function can be divided into multiple parts and processed in parallel.
-2. Serializable and immutable: Functions in Spark must be serializable in order to be transmitted across different nodes.
-3. Does not depend on a specific compute node: the execution of the function does not depend on the computing resources or data storage location of a particular node so that it can be executed on any node in the cluster.
-4. Stateless or shareable state: functions do not depend on external states or only on shareable states. This ensures that there are no conflicts or race conditions when executing functions in parallel on different compute nodes.
-
-When using UDF, you should consider performance and optimization. Some functions may work well in a local Python environment, but may not be efficient in a distributed Spark environment.
-For complex logic or functions that require a lot of memory, further optimization and consideration may be required. UDF is designed for simple logic and data processing, and for more complex calculations, you may need to use the
+2. Serializable and immutable: Functions in Spark must be serializable in order to be transmitted across different
+ nodes.
+3. Does not depend on a specific compute node: the execution of the function does not depend on the computing resources
+ or data storage location of a particular node so that it can be executed on any node in the cluster.
+4. Stateless or shareable state: functions do not depend on external states or only on shareable states. This ensures
+ that there are no conflicts or race conditions when executing functions in parallel on different compute nodes.
+
+When using UDF, you should consider performance and optimization. Some functions may work well in a local Python
+environment, but may not be efficient in a distributed Spark environment.
+For complex logic or functions that require a lot of memory, further optimization and consideration may be required. UDF
+is designed for simple logic and data processing, and for more complex calculations, you may need to use the
Spark's native operator for processing.
-The deduplication module provides a common Python function (to determine whether it is a substring of other strings) to use spark udf rewriting, which makes it easy to use spark distributed capabilities. For more information, please see `stringMatching.py` and `stringMatching.py`.
+The deduplication module provides a common Python function (to determine whether it is a substring of other strings) to
+use spark udf rewriting, which makes it easy to use spark distributed capabilities. For more information, please
+see `stringMatching.py` and `stringMatching.py`.
-If the user simply changes the python function to a spark task, it will not work without a spark cluster. Here a detailed document of building a cluster is written in detail, which is convenient for novice users.
+If the user simply changes the python function to a spark task, it will not work without a spark cluster. Here a
+detailed document of building a cluster is written in detail, which is convenient for novice users.
See [Spark cluster building](flagdata/deduplication/README.md) for an example.
### Data analysis phase
The analysis data analysis module provides the following functions:
-+ the average turn analysis code of the text, and calculate the average number of rounds (take the newline character as an example)
+
++ the average turn analysis code of the text, and calculate the average number of rounds (take the newline character as
+ an example)
+ the domain distribution of the text
@@ -208,14 +253,15 @@ The analysis data analysis module provides the following functions:
+ length analysis of the text.
-
See [ReadMe under analysis Module](flagdata/analysis/README.md) for an example.
## Configuration
For the `data cleansing` and `data quality assessment` modules,
-We provide a profile template:[cleaner_config.yaml](https://dorc.baai.ac.cn/resources/projects/FlagData/cleaner_config.yaml), [bert_config.yaml](flagdata/quality_assessment/Bert/bert_config.yaml)。
-The configuration file is readable [YAML](https://yaml.org) format , provides detailed comments. Please make sure that the parameters have been modified in the configuration file before using these modules.
+We provide a profile
+template:[cleaner_config.yaml](https://dorc.baai.ac.cn/resources/projects/FlagData/cleaner_config.yaml), [bert_config.yaml](flagdata/quality_assessment/Bert/bert_config.yaml)。
+The configuration file is readable [YAML](https://yaml.org) format , provides detailed comments. Please make sure that
+the parameters have been modified in the configuration file before using these modules.
Here are some important parameters you need to pay attention to:
@@ -240,10 +286,16 @@ Here are some important parameters you need to pay attention to:
## Contact us
-If you have any questions about the use and code of this project, you can submit issue. At the same time, you can contact us directly through data@baai.ac.cn.
+If you have any questions about the use and code of this project, you can submit issue. At the same time, you can
+contact us directly through data@baai.ac.cn.
+
+An active community is inseparable from your contribution, if you have a new idea, welcome to join our community, let us
+become a part of open source, together to contribute our own efforts for open source!
+
+
-An active community is inseparable from your contribution, if you have a new idea, welcome to join our community, let us become a part of open source, together to contribute our own efforts for open source!
-![contact_me](contact_me.png)
+Or follow Zhiyuan FlagOpen open source system, FlagOpen official website https://flagopen.baai.ac.cn/
+![contact_me](FlagOpen.png)
## Reference project
diff --git a/README_zh.md b/README_zh.md
index 1b06a68..ae2ba9d 100644
--- a/README_zh.md
+++ b/README_zh.md
@@ -31,7 +31,6 @@ FlagData支持以下特性:
--------------------------------------------------------------------------------
-
- [安装](#1安装)
- [快速上手](#2快速上手)
- [数据获取阶段](#21数据获取阶段)
@@ -43,8 +42,8 @@ FlagData支持以下特性:
- [数据去重](#234数据去重)
- [数据分析阶段](#24数据分析阶段)
- [配置](#3配置)
- - [数据清洗](#31数据清洗)
- - [数据质量评估](#32数据质量评估)
+ - [数据清洗](#31数据清洗)
+ - [数据质量评估](#32数据质量评估)
- [联系我们](#4联系我们)
- [参考项目](#5参考项目)
- [许可证](#6许可证)
@@ -96,7 +95,8 @@ pip install -r requirements.txt
all2txt模块下,将pdf2txt、epub2txt等非结构化/半结构化的文件转成txt,并且可以很好的解决单栏、双栏,以及图表穿插中文本的顺序等导致问题文本内容不连贯的问题。
- 同时解析后的元素种类有"Table(表格)", "FigureCaption(图片标题)", "NarrativeText【正文】", "ListItem【参考文献】", "
+ 同时解析后的元素种类有"Table(表格)", "FigureCaption(图片标题)", "NarrativeText【正文】", "
+ListItem【参考文献】", "
Title【章节标题】", "Address【邮箱地址】","PageBreak", "Header【页眉】", "Footer【页脚】", "UncategorizedText【arxiv竖排编号】", "
Image(图)", "Formula(公式)" 等,工具脚本提供保留全文,以及按照类别解析保存两种形式。
@@ -106,7 +106,8 @@ Image(图)", "Formula(公式)" 等,工具脚本提供保留全文,以及
#### 2.3.1、语言识别
- language_identification模块下,使用 fastText 的语言分类器来做分类,fastText 的语言分类器是在 Wikipedia、Tatoeba、SETimes
+ language_identification模块下,使用 fastText 的语言分类器来做分类,fastText 的语言分类器是在
+Wikipedia、Tatoeba、SETimes
上面训练的,使用了 n-grams 来作为特征,使用了层级的 softmax。支持 176 种语言的分类,并且最后会输出一个 0~1 的分数。
+ 每个 CPU 核心上,每秒可以处理一千个文档。
@@ -117,7 +118,8 @@ Image(图)", "Formula(公式)" 等,工具脚本提供保留全文,以及
#### 2.3.2、数据清洗
- cleaner模块,使用多进程池 mp.Pool,通过多进程方式并行处理数据。使用 SharedMemoryManager 创建可共享的数据结构,在数据处理中多进程共享数据。
+ cleaner模块,使用多进程池 mp.Pool,通过多进程方式并行处理数据。使用 SharedMemoryManager
+创建可共享的数据结构,在数据处理中多进程共享数据。
通过多进程和共享内存的方式实现了高效的数据清洗:
@@ -167,12 +169,12 @@ Image(图)", "Formula(公式)" 等,工具脚本提供保留全文,以及
```json lines
{
- "__id__":3023656977259,
- "content":"\"2022海口三角梅花展\"已接待游客3万多名——\n三角梅富了边洋村\n一年四季,美丽的海南岛始终春意盎然、鲜花盛开,而作为海南省省花的三角梅就是其中最引人注目的鲜花品种之一,成为海南的一道亮丽风景线。\n\"可别小看这一盆盆普通的三角梅花,特别受游客喜爱。仅最近一个多月,我们就卖出了200多万元,盆栽三角梅销路火爆......吸引更多本地和外地游客来赏花、买花。(经济日报 记者 潘世鹏)\n(责任编辑:单晓冰)"
+ "__id__": 3023656977259,
+ "content": "\"2022海口三角梅花展\"已接待游客3万多名——\n三角梅富了边洋村\n一年四季,美丽的海南岛始终春意盎然、鲜花盛开,而作为海南省省花的三角梅就是其中最引人注目的鲜花品种之一,成为海南的一道亮丽风景线。\n\"可别小看这一盆盆普通的三角梅花,特别受游客喜爱。仅最近一个多月,我们就卖出了200多万元,盆栽三角梅销路火爆......吸引更多本地和外地游客来赏花、买花。(经济日报 记者 潘世鹏)\n(责任编辑:单晓冰)"
}
{
- "__id__":3934190045072,
- "content":"记者 潘世鹏\n\"2022海口三角梅花展\"已接待游客3万多名——\n三角梅富了边洋村\n一年四季,美丽的海南岛始终春意盎然、鲜花盛开,而作为海南省省花的三角梅就是其中最引人注目的鲜花品种之一,成为海南的一道亮丽风景线。\n\"可别小看这一盆盆普通的三角梅花,特别受游客喜爱。仅最近一个多月,我们就卖出了200多万元,盆栽三角梅销路火爆。......吸引更多本地和外地游客来赏花、买花。(经济日报 记者 潘世鹏)"
+ "__id__": 3934190045072,
+ "content": "记者 潘世鹏\n\"2022海口三角梅花展\"已接待游客3万多名——\n三角梅富了边洋村\n一年四季,美丽的海南岛始终春意盎然、鲜花盛开,而作为海南省省花的三角梅就是其中最引人注目的鲜花品种之一,成为海南的一道亮丽风景线。\n\"可别小看这一盆盆普通的三角梅花,特别受游客喜爱。仅最近一个多月,我们就卖出了200多万元,盆栽三角梅销路火爆。......吸引更多本地和外地游客来赏花、买花。(经济日报 记者 潘世鹏)"
}
```
@@ -191,7 +193,8 @@ spark单一能力的集成:
对于复杂的逻辑或需要大量内存的函数,可能需要进一步的优化和考虑。UDF 是为了简单的逻辑和数据处理而设计的,对于更复杂的计算,可能需要使用
Spark 的原生算子来进行处理。
-deduplication模块下提供了普通Python函数(判断是否是其他字符串的子字符串)使用spark udf的改写,可以方便的使用spark分布式能力,详细请见`stringMatching.py`和`udf_spark_stringMatching.py`的对比
+deduplication模块下提供了普通Python函数(判断是否是其他字符串的子字符串)使用spark
+udf的改写,可以方便的使用spark分布式能力,详细请见`stringMatching.py`和`udf_spark_stringMatching.py`的对比
如果用户只是单单将python函数改成spark任务,如果没有spark集群是不行的。这里详细的写了傻瓜式搭建集群的文档,方便小白用户使用。具体示例见[spark集群搭建](flagdata/deduplication/README_zh.md)
@@ -206,7 +209,6 @@ analysis数据分析模块提供如下功能:
(4)文本的长度分析
-
具体详细示例见[analysis模块下的readMe](flagdata/analysis/README_zh.md)
## 3、配置
@@ -241,7 +243,10 @@ analysis数据分析模块提供如下功能:
如果你对本项目的使用和代码有任何问题,可以提交issue。同时你也可以通过邮箱 data@baai.ac.cn 直接联系我们;
一个活跃的社区离不开你的贡献,如果你有新的idea,欢迎加入我们的社区,让我们成为开源的一部分,一起为开源贡献自己的力量!!!
-![contact_me](contact_me.png)
+
+
+或者关注智源 FlagOpen 开源体系,FlagOpen 官网 https://flagopen.baai.ac.cn/
+![contact_me](FlagOpen.png)
## 5、参考项目