Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create dataset wikihow_vietnamese_human_instructions #358

Open
albertvillanova opened this issue Jan 19, 2022 · 2 comments
Open

Create dataset wikihow_vietnamese_human_instructions #358

albertvillanova opened this issue Jan 19, 2022 · 2 comments
Assignees
Labels
data catalog Gathering data from data sources

Comments

@albertvillanova
Copy link
Member

  • uid: wikihow_vietnamese_human_instructions
  • type: processed
  • description:
    • name: wikiHow Vietnamese Human Instructions
    • description: Step-by-step instructions in Vietnamese extracted from wikiHow and decomposed into a formal graph representation in RDF. For any queries and requests contact: Paolo Pareti
      To cite this dataset use:
      Paula Chocron, Paolo Pareti. Vocabulary Alignment for Collaborative Agents: a Study with Real-World Multilingual How-to Instructions.
      (PDF) (bibtex)
    • homepage: https://www.kaggle.com/paolop/human-instructions-vietnamese-wikihow
    • validated: True
  • languages:
    • language_names:
      • Vietnamese
    • language_comments:
    • language_locations:
      • Asia
      • Vietnam
    • validated: False
  • custodian:
  • availability:
    • procurement:
    • licensing:
      • has_licenses: Yes
      • license_text: CC BY-NC-SA 4.0
      • license_properties:
      • license_list:
        • cc-by-nc-4.0: Creative Commons Attribution Non Commercial 4.0 International
    • pii:
      • has_pii: Yes
      • generic_pii_likely: somewhat likely
      • generic_pii_list:
        • names
        • website account name or handle
        • URLs
      • numeric_pii_likely: somewhat likely
      • numeric_pii_list:
      • sensitive_pii_likely: somewhat likely
      • sensitive_pii_list:
      • no_pii_justification_class:
      • no_pii_justification_text:
    • validated: False
  • processed_from_primary:
    • from_primary: Taken from primary source
    • primary_availability: Yes - their documentation/homepage/description is available
    • primary_license: Yes - the dataset has the same license as the source material
    • primary_types:
      • web | wiki
    • validated: False
    • from_primary_entries:
  • media:
    • category:
      • text
    • text_format:
      • other
      • RDF
    • audiovisual_format:
    • image_format:
    • database_format:
      • .ZIP
    • text_is_transcribed: No
    • instance_type: Sentences / instructions
    • instance_count: 1K<n<10K
    • instance_size: 10<n<100
    • validated: False
  • fname: wikihow_vietnamese_human_instructions.json
@albertvillanova albertvillanova added the data catalog Gathering data from data sources label Jan 19, 2022
@cakiki
Copy link
Member

cakiki commented Jan 23, 2022

#self-assign

@cakiki
Copy link
Member

cakiki commented Jan 23, 2022

Note: This is part of a multilingual resource: https://www.kaggle.com/paolop/human-instructions-multilingual-wikihow

Parent project: http://paolopareti.uk/homepage/prohow/index.htm

For the multilingual dataset, this is the list of the available languages and number of articles in each:

English: 133.842

German: 57.533

Hindi: 6.519

Russian: 127.738

Korean: 7.606

Portuguese: 92.520

Italian: 79.656

French: 60.105

Spanish: 120.507

Chinese: 82.558

Czech: 10.619

Arabic: 15.589

Thai: 10.213

Vietnamese: 8.670

Indonesian: 39.246

Dutch: 19.318

Dataset is in RDF format.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data catalog Gathering data from data sources
Development

No branches or pull requests

2 participants