Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is the true size of dataset? #7

Open
jay2012-lin opened this issue May 30, 2019 · 3 comments
Open

What is the true size of dataset? #7

jay2012-lin opened this issue May 30, 2019 · 3 comments

Comments

@jay2012-lin
Copy link

jay2012-lin commented May 30, 2019

The paper "Name Disambiguation in AMiner: Clustering, Maintenance, and Human in the Loop" in KDD said you used dataset with 70,258 documents from 12,798 authors. But the dataset in this project has 203,078 documnents. It is bigger than what said in paper. However, you said in project NOTE:

"Training data in this demo are smaller than what we used in the paper, so the performance (F1-score) will be a little bit lower than reported scores."

They are contradictory. I want to know the true dataset you used in paper? Can you help me?

Best Regard!

@kourenmu
Copy link

I think the
“We sampled 100 author names from
a well-labeled subset of AMiner database. The benchmark consists
of 70,258 documents from 12,798 authors.”
only means for the test set.
In the test set of this project, there are 100 names but only 6399 authors.

@shivashankarrs
Copy link

Yeah, it is not clear which subset of the dataset (out of 600 author groups including both train and test) are manually annotated. I am not able to find the subset, with 100 names, 70,258 documents from 12,798 authors.

@sanlunainiu
Copy link

I think the “We sampled 100 author names from a well-labeled subset of AMiner database. The benchmark consists of 70,258 documents from 12,798 authors.” only means for the test set. In the test set of this project, there are 100 names but only 6399 authors.

In fact, in the test set "name_to_pubs_test_100.json", there are only 6,399 authors and 35,129 documents, which is half of the number reported by the authors as "The benchmark consists of 70,258 documents from 12,798 authors."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants