Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update README.md #8

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,3 +25,5 @@ The goal of this project is to use AI to evaluate the accuracy of academic paper
**Kushal Chhetri**, CS student at Virginia Tech - I like bouldering and watching video essays.

**Vivian Zheng**, Sophomore studying Computer Science at Stony Brook University

**Mohani Adem**, Biomedical Science student at Sophie Davis/CUNY School of Medicine
287 changes: 287 additions & 0 deletions parsing.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,287 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"import json"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"def parse_json_file(input_file):\n",
" arXiv_parsed = {}\n",
" with open(input_file, 'r') as file:\n",
" for line, line in enumerate(file):\n",
" try: \n",
" record = json.loads(line)\n",
"\n",
" arxiv_id = record.get('id')\n",
"\n",
" if arxiv_id: \n",
" arXiv_parsed[arxiv_id] = record\n",
" \n",
" except json.JSONDecodeError as e: \n",
" print(f\"Skipping invalid JSON on line {line}: {e}\")\n",
"\n",
" with open(output_file, 'w') as outfile:\n",
" json.dump(arXiv_parsed,outfile)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"input_file = '/home/jovyan/work/academic-rag/arxiv-metadata-oai-snapshot.json'\n",
"output_file= '/home/jovyan/work/academic-rag/arxiv_parsed.json'"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Skipping invalid JSON on line {\"id\":\"1411.5022\",\"submitter\":\"James Guillochon\",\"authors\":\"James Guillochon (1) and Abraham Loeb (1) ((1) Harvard ITC)\",\"title\":\"The Fastest Unbound Stars in the Universe\",\"comments\":\"20 pages, 18 figures. Accepted by ApJ\",\"journal-ref\":null,\"doi\":null,\"report-no\":null,\"categories\":\"astro-ph.GA astro-ph.CO\",\"license\":\"http://arxiv.org/licenses/nonexclusive-distrib/1.0/\",\"abstract\":\" The discovery of hypervelocity stars (HVS) leaving our galaxy with speeds of\\nnearly $10^{3}$ km s$^{-1}$ has provided strong evidence towards the existence\\nof a massive compact object at the galaxy's center. HVS ejected via the\\ndisruption of stellar binaries can occasionally yield a star with $v_{\\\\infty}\\n\\\\lesssim 10^4$ km s$^{-1}$, here we show that this mechanism can be extended to\\nmassive black hole (MBH) mergers, where the secondary star is replaced by a MBH\\nwith mass $M_2 \\\\gtrsim 10^5 M_{\\\\odot}$. We find that stars that are originally\\nbound to the secondary MBH are frequently ejected with $v_{\\\\infty} > 10^4$ km\\ns$^{-1}$, and occasionally with velocities $\\\\sim 10^5$ km s$^{-1}$ (one third\\nthe speed of light), for this reason we refer to stars ejected from these\\nsystems as \\\"semi-relativistic\\\" hypervelocity stars (SHS). Bound to no galaxy,\\nthe velocities of these stars are so great that they can cross a significant\\nfraction of the observable universe in the time since their ejection (several\\nGpc). We demonstrate that if a significant fraction of MBH mergers undergo a\\nphase in which their orbital eccentricity is $\\\\gtrsim 0.5$ and their periapse\\ndistance is tens of the primary's Schwarzschild radius, the space density of\\nfast-moving ($v_{\\\\infty} > 10^{4}$ km s$^{-1}$) SHS may be as large as $10^{3}$\\nMpc$^{-3}$. Hundreds of the SHS will be giant stars that c: Unterminated string starting at: line 1 column 386 (char 385)\n"
]
}
],
"source": [
"parse_json_file(input_file)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"def arxiv_record(arxiv_id):\n",
" with open(output_file, 'r') as file:\n",
" data = json.load(file)\n",
" return data.get(arxiv_id, f\"No record found for arXiv ID: {arxiv_id}\")\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'id': '0812.0001',\n",
" 'submitter': 'Frank Tackmann',\n",
" 'authors': 'Keith S. M. Lee and Frank J. Tackmann',\n",
" 'title': 'Nonperturbative m_X cut effects in B -> Xs l+ l- observables',\n",
" 'comments': '11 pages, 4 figures, v2: corrected typos and Eq. (25), v3: journal\\n version',\n",
" 'journal-ref': 'Phys.Rev.D79:114021,2009',\n",
" 'doi': '10.1103/PhysRevD.79.114021',\n",
" 'report-no': 'CALT-68-2710, MIT-CTP 3999',\n",
" 'categories': 'hep-ph',\n",
" 'license': 'http://arxiv.org/licenses/nonexclusive-distrib/1.0/',\n",
" 'abstract': ' Recently, it was shown that in inclusive B -> Xs l+ l- decay, an angular\\ndecomposition provides three independent (q^2 dependent) observables. A\\nstrategy was formulated to extract all measurable Wilson coefficients in B ->\\nXs l+ l- from a few simple integrals of these observables in the low q^2\\nregion. The experimental measurements in the low q^2 region require a cut on\\nthe hadronic invariant mass, which introduces a dependence on nonperturbative b\\nquark distribution functions. The associated hadronic uncertainties could\\npotentially limit the sensitivity of these decays to new physics. We compute\\nthe nonperturbative corrections to all three observables at leading and\\nsubleading order in the power expansion in \\\\Lambda_QCD/m_b. We find that the\\nsubleading power corrections give sizeable corrections, of order -5% to -10%\\ndepending on the observable and the precise value of the hadronic mass cut.\\nThey cause a shift of order -0.05 GeV^2 to -0.1 GeV^2 in the zero of the\\nforward-backward asymmetry.\\n',\n",
" 'versions': [{'version': 'v1', 'created': 'Mon, 1 Dec 2008 20:55:40 GMT'},\n",
" {'version': 'v2', 'created': 'Tue, 10 Feb 2009 21:19:07 GMT'},\n",
" {'version': 'v3', 'created': 'Thu, 25 Jun 2009 20:35:31 GMT'}],\n",
" 'update_date': '2009-08-03',\n",
" 'authors_parsed': [['Lee', 'Keith S. M.', ''], ['Tackmann', 'Frank J.', '']]}"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"arxiv_record('0812.0001')"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"import random"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"def select_random_arxiv_ids(year=\"2024\", count=100):\n",
" arxiv_ids_2024 = []\n",
" \n",
" with open(output_file, 'r') as file:\n",
" data = json.load(file)\n",
" \n",
" for arxiv_id, record in data.items():\n",
" date_str = record.get('update_date')\n",
" if date_str.startswith(year): \n",
" arxiv_ids_2024.append(arxiv_id)\n",
" \n",
" if len(arxiv_ids_2024) >= count:\n",
" selected_ids = random.sample(arxiv_ids_2024, count)\n",
" else:\n",
" selected_ids = arxiv_ids_2024\n",
" \n",
" return selected_ids"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['1410.8506',\n",
" '1108.5475',\n",
" '1203.1691',\n",
" '1307.1460',\n",
" '1202.4131',\n",
" '0903.3835',\n",
" '1204.3714',\n",
" '1209.1867',\n",
" '1109.6622',\n",
" '1008.1736',\n",
" '1301.6714',\n",
" '1110.4034',\n",
" '1304.1345',\n",
" '1107.0237',\n",
" '0811.1449',\n",
" '1411.3218',\n",
" '1306.4806',\n",
" '0911.0334',\n",
" '1211.5661',\n",
" '1304.3343',\n",
" '1203.5760',\n",
" '1103.3966',\n",
" '1103.2273',\n",
" '1212.0494',\n",
" '1405.0993',\n",
" '1106.3959',\n",
" '1104.2444',\n",
" '1304.3389',\n",
" '1308.0859',\n",
" '1104.3534',\n",
" '1110.3465',\n",
" '1304.0228',\n",
" '1309.7583',\n",
" '1012.3122',\n",
" '0911.5246',\n",
" '1405.3364',\n",
" '1312.0281',\n",
" '1005.0555',\n",
" '1408.5513',\n",
" '1306.1138',\n",
" '1310.8464',\n",
" '0810.0725',\n",
" '1404.2211',\n",
" '1311.3276',\n",
" '1212.1081',\n",
" '1304.5228',\n",
" '0808.0163',\n",
" '0911.0105',\n",
" '1404.5928',\n",
" '1406.5504',\n",
" '1304.5774',\n",
" '1303.0327',\n",
" '1304.0174',\n",
" '1005.2358',\n",
" '1107.4431',\n",
" '1105.2854',\n",
" '1404.7424',\n",
" '1406.0886',\n",
" '1105.3528',\n",
" '1305.2669',\n",
" '1306.5469',\n",
" '1210.2050',\n",
" '0809.0182',\n",
" '0808.2600',\n",
" '1405.3626',\n",
" '1003.1120',\n",
" '1112.6226',\n",
" '1311.4997',\n",
" '1409.1509',\n",
" '1402.5070',\n",
" '1407.8006',\n",
" '1211.5273',\n",
" '1304.0180',\n",
" '1304.7349',\n",
" '1311.7424',\n",
" '1408.5859',\n",
" '1210.8198',\n",
" '0704.0495',\n",
" '0909.4342',\n",
" '1309.4142',\n",
" '0905.1813',\n",
" '0901.1988',\n",
" '1308.0497',\n",
" '1402.2221',\n",
" '0811.2745',\n",
" '1206.1978',\n",
" '1209.4970',\n",
" '1002.1270',\n",
" '1111.4181',\n",
" '1402.6938',\n",
" '1007.5234',\n",
" '1207.4317',\n",
" '0707.4539',\n",
" '1108.1840',\n",
" '1406.1444',\n",
" '1311.2191',\n",
" '1408.5972',\n",
" '1305.5617',\n",
" '1211.5656',\n",
" '1404.2214']"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"select_random_arxiv_ids()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.15"
}
},
"nbformat": 4,
"nbformat_minor": 2
}