Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Downloaders data storage organization #286

Open
carlosparadis opened this issue Mar 11, 2024 · 22 comments
Open

Downloaders data storage organization #286

carlosparadis opened this issue Mar 11, 2024 · 22 comments
Assignees

Comments

@carlosparadis
Copy link
Member

The issues #275 #282 #284 #285 are affected by this issue.

@Ssunoo2 @ian-lastname @anthonyjlau to centralize discussion, please use this issue to reach cosensus on how you plan to make the storage organization, file name, etc of your own refreshers + the JIRA refresher. Once we are clear on this here, you can move the final discussion to the first come of your respective issues.

@Ssunoo2
Copy link
Collaborator

Ssunoo2 commented Mar 22, 2024

I'll start by just posting what the current storage organization is:

Jira downloader:

../../rawdata/issue_tracker/geronimo/issue_comments/
../../rawdata/issue_tracker/geronimo/issues/

Changed from

../../rawdata/issue_tracker/

I just make new directories for project name and issues or issue_comments respectively

Github Downloader:

Unchanged:

../../rawdata/github/kaiaulu/issue/
../../rawdata/github/kaiaulu/pull_request/
../../rawdata/github/kaiaulu/issue_or_pr_comment/
../../rawdata/github/kaiaulu/commit/

@anthonyjlau
Copy link
Collaborator

Bugzilla Downloader

Currently, the bugzilla_showcase notebook uses 3 different methods to download data: Traditional Perceval, Perceval's REST API, and Bugzilla's REST API. The one that I will be using for the refresher is Bugzilla's REST API.

Bugzilla's REST API Downloader Storage Organization

I will be using the current storage organization used for bugzilla issues as it is the same format as the GitHub version above.

../../rawdata/bugzilla/redhat/issues
../../rawdata/bugzilla/redhat/issues_comments

@carlosparadis
Copy link
Member Author

Specification change

@Ssunoo2 there is something wrong with your filepath. I remember we agreed we should include in the path jira for consistency with bugzilla. In that sense, your issue_tracker folder should be called jira instead, since both of them are issue trackers.

In addition to that, and the primary reason why I wanted to create this issue to compare side by side, is that the project organization is counter-intuitive as it is on Kaiaulu (and I believe there were even some confusion of your group early on why the files were organized in this manner).

We should organize the information at project level, i.e.:

Bugzilla

Instead of:

../../rawdata/bugzilla/redhat/issues
../../rawdata/bugzilla/redhat/issues_comments

We would have:

../../rawdata/redhat/bugzilla/issues
../../rawdata/redhat/bugzilla/issues_comments

Jira

And instead of:

../../rawdata/jira/geronimo/issue_comments/
../../rawdata/jira/geronimo/issues/

We would have:

../../rawdata/geronimo/jira/issue_comments/
../../rawdata/geronimo/jira/issues/

Motivation

The reason for that, is generally someone running multi-project analysis is thinking the data "per project" rather than "per data source". In addition, if we are discussing about a particular project, and I would like to reproduce your analysis, I may need to ask you to send me "the data of the project". In the current organization, you would need to check every folder to fish for data the project has. Whereas in the new organization you simply zip the folder with said project name and send it over. Lastly, in the project folder organization, you can very quickly assess what data you have by opening the project folder. As it is, you also have to go check each folder.

Anomaly Case 1

There are some strange cases out there, that I want to make sure you give proper consideration as you write the refresher of these downloaders. The first one is the HADOOP project. @Ssunoo2 this affects you the most since this is a JIRA project.

If you look at Hadoop on GitHub (https://github.com/apache/hadoop), particularly the commits, you will see it can have multiple JIRA IDs. You can imagine the mess it turned out to be trying to manage that in the current folder organization.

Let's assume the proposed new organization with the downloader logic you currently have for JIRA. I will focus on the issue folder since what works for issues would work for comments folder.

You would then have:

../../rawdata/hadoop/jira/issues/HDFS_...
../../rawdata/hadoop/jira/issues/MAPREDUCE_...
../../rawdata/hadoop/jira/issues/YARN_...

All in one folder. Would your refresher function work in this case? Or would it break assuming all files in there are from a single issue id? If it will break, then we need to add some logic to discern based on your issue key. That being the case, notice how much saner is that this data is contained inside the hadoop folder.

We could technically make a sub-folder for every issue key, something like:

../../rawdata/hadoop/jira/issues/hdfs/HDFS_...
../../rawdata/hadoop/jira/issues/mapreduce/MAPREDUCE_...
../../rawdata/hadoop/jira/issues/yarn/YARN_...

However I worry this may complicate the folder hierarchy too much due to its depth.

Anomaly Case 2

The other anomaly case is the Spring Framework. You can read it about it here: https://spring.io/blog/2019/01/15/spring-framework-s-migration-from-jira-to-github-issues

Here's Spring GitHub: https://github.com/spring-projects/spring-framework/commits/main/

Basically, Spring used to have JIRA, and moved on to managing issues on GitHub (e.g. https://github.com/spring-projects/spring-framework/issues/16906). Now, how would this migration look like using your downloaders and this new proposed file organization? This one is likely harmless. It would be:

../../rawdata/spring/jira/issues/SPR_...
../../rawdata/spring/github/issues/spring-projects_spring-framework_...

Please give some thought to the above in one of your internal meetings. This is why I crated a separate issue, as it affects all of you. I'd also recommend you (@Ssunoo2) edit your post with how GitHub saves, and that @ian-lastname make a post on how the mailing list downloader saves here. You want to have them all side by side to make sure the organization is consistent.

@ian-lastname
Copy link
Collaborator

Mbox uses the helix.yml config. Going by how the Jira save file path is now done, i'll make the storage organization for mbox as follows:
../../rawdata/helix/mbox
There aren't separate kinds of mail to look out for, so there is no need to create separate folders.

@carlosparadis
Copy link
Member Author

@ian-lastname your mbox architecture will likely be a bit more complex than that. I want you to take a look on OpenSSL as a reference point: https://www.openssl.org/community/mailinglists.html

As you can see, OpenSSL (and in general any open source project), generally have multiple mailing lists. One for users, other for developers, and so goes on. In addition to that, a single mailing list may have multiple archives. See for example:

openssl-dev archives 	

https://marc.info/?l=openssl-dev
https://www.mail-archive.com/openssl-dev@openssl.org/
https://groups.google.com/groups?group=mailing.openssl.dev

Has 3 archives. Now you may wonder why would someone download data from 3 archives for the same mailing list. This s because sometimes the archives cover different periods of a mailing list existence. E.g. Google Groups could be from 2009-2013, MARC from 2008-2016, and Google groups some overlap of both.

Your folder organization has to accommodate this. I'd argue your situation is a bit similar to the case of HADOOP, having multiple JIRA issues into a single project. So please discuss this with your group too and afterwards edit your proposal on how OpenSSL would look like as a folder organization.

@anthonyjlau
Copy link
Collaborator

../../rawdata/helix/mbox/openssl-dev/marc
../../rawdata/helix/mbox/openssl-dev/googlegroups
../../rawdata/helix/mbox/openssl-users/googlegroups

@anthonyjlau
Copy link
Collaborator

As we discussed on call, for projects that have multiple project keys (Anomaly Case 1), we will be using this format to organize the folder structure:

../../rawdata/hadoop/jira/issues/hdfs/HDFS_...
../../rawdata/hadoop/jira/issues/mapreduce/MAPREDUCE_...
../../rawdata/hadoop/jira/issues/yarn/YARN_...

We are using this structure because we don't have to make edits to our existing functions that look for files.

For Anomaly Case 2, we decided that we do not need to worry about it because it should not affect the current structure.

For the Mbox folder structure, we will use this structure:

../../rawdata/helix/mbox/openssl-dev/marc
../../rawdata/helix/mbox/openssl-dev/googlegroups
../../rawdata/helix/mbox/openssl-users/googlegroups

This structure separates each list and further separates the archive in each list.

@anthonyjlau
Copy link
Collaborator

Here is my suggested change in the config file format.

Multiple project keys

# example for cases that have multiple project keys
issue_tracker:
  jira:
    # Obtained from the project's JIRA URL
    domain: https://issues.apache.org/jira
    project_key:
    - hdfs
    - mapreduce
    - yarn
    
    # Download using `download_jira_data.Rmd`
    issues:
      - ../../rawdata/hadoop/jira/issues/hdfs
      - ../../rawdata/hadoop/jira/issues/mapreduce
      - ../../rawdata/hadoop/jira/issues/yarn
    issue_comments: 
      - ../../rawdata/hadoop/jira/issues_comments/hdfs
      - ../../rawdata/hadoop/jira/issues_comments/mapreduce
      - ../../rawdata/hadoop/jira/issues_comments/yarn

Mbox changes

mailing_list:
  # Where is the mbox located locally?
  mbox:
    - ../../rawdata/helix/mbox/openssl-dev/marc
    - ../../rawdata/helix/mbox/openssl-dev/googlegroups
    - ../../rawdata/helix/mbox/openssl-users/googlegroups

@carlosparadis
Copy link
Member Author

@anthonyjlau @ian-lastname

Mail

Concerning the mbox, there is more than just the paths that needs to be changed. This is the full extent of the mailing list information:

kaiaulu/conf/apr.yml

Lines 47 to 54 in 2bc8d14

mailing_list:
# Where is the mbox located locally?
mbox: ../../rawdata/mbox/apr-dev_2012_2019.mbox
# What is the domain of the chosen mailing list archive?
domain: http://mail-archives.apache.org/mod_mbox
# Which lists of the domain will be used?
list_key:
- apr-dev

Contrast to openssl:

kaiaulu/conf/openssl.yml

Lines 47 to 55 in 2bc8d14

mailing_list:
# Where is the mbox located locally?
#mbox: ../../rawdata/mbox/openssl_dev_mbox # 2004-2008 fields are complete
mbox: ../../rawdata/mbox/openssl-dev.mbx # 2002-2019 gmail field is redacted due to google groups
# What is the domain of the chosen mailing list archive?
#domain: http://mail-archives.apache.org/mod_mbox
# Which lists of the domain will be used?
#list_key:
# - apr-dev

Minimally, you may have a mbox fil that you acquired from another project. But alternatively you may need to use one of Kaiaulu downloaders to get the data. Check what Kaiaulu functions need to execute (that's @ian-lastname current task to modify it to a refresher), and try to update the specification above before proceeding.

Issue Tracker

In the off-chance the project migrated the domain of their JIRA issue tracker, your config file proposal will break, since it assumes one domain for all the issue keys. Another concern I have is that if you mimic the enumeration you have done on project_key, issues, and issue_comments, there is this implicit assumption of order across them. Could you propose a different template here, under jira, the user specify (domain,project_key,issues,issue_comments) per issue key? This would make more explicit the information of each group. Do this in a separate comment, so we can consider the pros and cons side by side of what you have vs what the other would look like.

@carlosparadis
Copy link
Member Author

@Ssunoo2 You will face the same consideration for your GitHub config file:

kaiaulu/conf/openssl.yml

Lines 65 to 70 in 2bc8d14

#github:
# Obtained from the project's GitHub URL
#owner: apache
#repo: apr
# Download using `download_github_comments.Rmd`
#replies: ../../rawdata/github/apr/

The anomaly case it is most likely for you to experience on GitHub would be project issues scattered across different GitHub projects. I have not encountered that yet, but I would not be surprised if they existed. Regardless, the solution would mimic what is decided for the JIRA config file.

@anthonyjlau
Copy link
Collaborator

anthonyjlau commented Apr 6, 2024

Here is the updated version of the jira data storage:

issue_tracker:

  # each field in Jira will be a project key
  jira:
    project_key_1:
     # Obtained from the project's JIRA URL
     domain: https://issues.apache.org/jira/hdfs
     project_key: HDFS
      # Download using download_jira_data.Rmd
      issues: ../../rawdata//hadoop/jira/issues/hdfs
      issue_comments: ../../rawdata//hadoop/jira/issues_comments/hdfs

    project_key_2:
     # Obtained from the project's JIRA URL
     domain: https://issues.apache.org/jira/mapreduce
     project_key: MAPREDUCE
     # Download using download_jira_data.Rmd
     issues: ../../rawdata//hadoop/jira/issues/mapreduce
     issue_comments: ../../rawdata//hadoop/jira/issues_comments/mapreduce

    project_key_3:
     # Obtained from the project's JIRA URL
     domain: https://issues.apache.org/jira/yarn
     project_key: YARN
     # local folder path
     issues: ../../rawdata//hadoop/jira/issues/yarn
     issue_comments: ../../rawdata//hadoop/jira/issues_comments/yarn

@anthonyjlau
Copy link
Collaborator

anthonyjlau commented Apr 6, 2024

Not sure if I followed correctly but the mailing list config part should look something like this:

Carlos Edit: I modified the config below.

mailing_list:
  mod_mbox: 
    mail_key_1:
      archive_url: http://mail-archives.apache.org/mod_mbox/geronimo-dev
      mbox: ../../rawdata/geronimo/mod_mbox/geronimo-dev/
    mail_key_2:
      archive_url: http://mail-archives.apache.org/mod_mbox/geronimo-user
      mbox: ../../rawdata/geronimo/mod_mbox/geronimo-user/
  pipermail:
    mail_key_1:
      archive_url: http://some/pipermail/url
      mbox: ../../rawdata/geronimo/pipermail/geronimo-dev/

@carlosparadis
Copy link
Member Author

@anthonyjlau @ian-lastname

I modified the config above so it tries to stay consistent with the folder depth of the other downloaders and account for the information needed for the functions. I also changed from project_key_1 to mail_key_1 since they are all from the same project, but just the mailing list that serves a different purpose.

@ian-lastname try to work with this and post here if for some reason it doesn't work with the functions you are using to refresh.

@ian-lastname
Copy link
Collaborator

ian-lastname commented Apr 8, 2024

mailing_list:
  mod_mbox:
    domain: http://mail-archives.apache.org/mod_mbox/geronimo-user
    mail_key_1:
      key: geronimo-dev
      mbox: ../../rawdata/geronimo/mod_mbox/geronimo-dev/
    mail_key_2:
      key: geronimo-user
      mbox: ../../rawdata/geronimo/mod_mbox/geronimo-user/
  pipermail:
    mail_key_1:
      archive_url: http://some/pipermail/url
      mbox: ../../rawdata/geronimo/pipermail/geronimo-dev/

So, I modified the mod_mbox config. The reason why I changed it to this is because the downloader function was already made to put together the full url for the download using a base domain (domain) and a mailing list (key). Plus, with the way I changed it, I can easily attain the name of the mailing list so that I can put it into the file name of the downloaded mbox file.

Also, I don't think there is a notebook on the pipermail download function. Correct me if I'm wrong please.

@carlosparadis
Copy link
Member Author

@ian-lastname "because the function already does it" is not a good rationale: I modified the config so both pipermail and mod_mbox are consistent in the way the user uses the information. It is also more clear for someone to see a URL that they can post on the browser than figuring out what a key is. Your config seems to also be duplicating the key on the domain url.

The other point of concern is domain. I am not sure if there will be a case a project's mailing list can end up in two domains for mod-mbox. So it is better to keep it flexible per project_key so we do not have to modify in the future.

Unless you made any other change, stick to #286 (comment).

You can modify to be a url in this line:

full_month_url <- stringi::stri_c(base_url, mailing_list, destination[[counter]], sep = "/")

Just replace the base_url,mailinglist to a url parameter you take as input to the function.

Also, I don't think there is a notebook on the pipermail download function. Correct me if I'm wrong please.

Seems not. Please add it to:

https://github.com/sailuh/kaiaulu/blob/master/vignettes/download_mod_mbox.Rmd

When you are done with the changes!

@carlosparadis
Copy link
Member Author

As far as the key is concerned: Before you worry about that in mod_mbox, try to find an example on pipermail and run the function.

https://mail.python.org/pipermail/mailman-users/

I believe Python can be used as an example. In fact, that's where the pipermail code originated in 2021:

https://mail.python.org/pipermail/mailman-users/2012-October/074208.html

Let me know how running this goes. Note you will need to modify the pipermail function to also allow to control the from_year and to_year parameter. Make sure to find another few pipermail mailing list you can try the function out.

See #92 for context.

@Ssunoo2
Copy link
Collaborator

Ssunoo2 commented Apr 9, 2024

Here is the format for the jira and github config files:

issue_tracker:
  jira:
    project_key_1:
    # Obtained from the project's JIRA URL
    domain: https://github.com/sailuh/kaiaulu
    project_key: KAIAULU
    # Download using `download_jira_data.Rmd`
    issues: ../../rawdata/geronimo/jira/issues/
    issue_comments: ../../rawdata/geronimo/jira/issue_comments/
  github:
    project_key_1:
      # Obtained from the project's GitHub URL
      owner: sailuh
      repo: kaiaulu
      # Download using `download_github_comments.Rmd`
      issue_or_pr_comment: ../../rawdata/kaiaulu/github/issue_or_pr_comment/
      issue: ../../rawdata/kaiaulu/github/issue/
      pull_request: ../../kaiaulu/github/pull_request/
      commit: ../../rawdata/kaiaulu/github/commit/

Please feel free to comment on anything that is formatted incorrectly

@carlosparadis
Copy link
Member Author

@Ssunoo2

Just post a new comment below with the corrected version instead of editing your existing one so it is not confusing to follow-up later:

The domain information for Kaiaulu's JIRA is wrong:

issue_tracker:
  jira:
    # Obtained from the project's JIRA URL
    domain: https://sailuh.atlassian.net
    project_key: SAILUH

This should be it instead. Try your downloader against it to see if it works. Note Kaiaulu domain is different than the other JIRAs that uses apache.

Also, did you modify the existing end points in GitHub (commit, pr, etc) so they are folders and can refresh? I don't remember.

Could you add another project to github for Kaiaulu, including your fork information to see how it looks like?

Also I think the endpoints on your config do not agree with what Anthony put here: #286 (comment)

There should be another folder at the end of the endpoints. For JIRA that is named after the JIRA project key. For GitHub, the equivalent is the owner_repo combination. So in Kaiaulu config you would have:

issue_or_pr_comment: ../../rawdata/kaiaulu/github/issue_or_pr_comment/sailuh_kaiaulu
issue_or_pr_comment: ../../rawdata/kaiaulu/github/issue_or_pr_comment/ssunoo2_kaiaulu

for the main repo,

but if I was also downloading and tracking a fork, then that would be:

issue_or_pr_comment: ../../rawdata/kaiaulu/github/issue_or_pr_comment/sailuh_kaiaulu

You can include your fork as an example of project_key_2 here so we can discuss, but don't include in your actual commit since we do not need to download anything from there. So we have a realistic example, please create a codeface.conf

And edit so it include on project_key_1: https://github.com/siemens/codeface

And on project_key_2 Nicole's fork: https://github.com/lfd/codeface/tree/nicole-updates

Note on the Codeface config file, under the branch region:

kaiaulu/conf/kaiaulu.yml

Lines 43 to 44 in 2bc8d14

branch:
- master

You will include an additional line below master called - nicole_updates

@Ssunoo2
Copy link
Collaborator

Ssunoo2 commented Apr 9, 2024

Is this looking right?

issue_tracker:
  jira:
    project_key_1:
      # Obtained from the project's JIRA URL
      domain: https://sailuh.atlassian.net
      project_key: SAILUH
      # Download using `download_jira_data.Rmd`
      issues: ../../rawdata/kaiaulu/jira/issues/sailuh
      issue_comments: ../../rawdata/kaiaulu/jira/issue_comments/sailuh
    # project_key_2:
      # Obtained from the project's JIRA URL
      # domain: https://sailuh.atlassian.net
      # project_key: ssunoo2
      # Download using `download_jira_data.Rmd`
      # issues: ../../rawdata/kaiaulu/jira/issues/ssunoo2
      # issue_comments: ../../rawdata/kaiaulu/jira/issue_comments/ssunoo2
  github:
    project_key_1:
      # Obtained from the project's GitHub URL
      owner: sailuh
      repo: kaiaulu
      # Download using `download_github_comments.Rmd`
      issue_or_pr_comment: ../../rawdata/kaiaulu/github/issue_or_pr_comment/sailuh_kaiaulu/
      issue: ../../rawdata/kaiaulu/github/issue/sailuh_kaiaulu/
      pull_request: ../../kaiaulu/github/pull_request/sailuh_kaiaulu/
      commit: ../../rawdata/kaiaulu/github/commit/sailuh_kaiaulu/
    # project_key_2:
      # # Obtained from the project's GitHub URL
      # owner: sailuh
      # repo: kaiaulu
      # # Download using `download_github_comments.Rmd`
      # issue_or_pr_comment: ../../rawdata/kaiaulu/github/issue_or_pr_comment/sailuh_kaiaulu/
      # issue: ../../rawdata/kaiaulu/github/issue/sailuh_kaiaulu/
      # pull_request: ../../kaiaulu/github/pull_request/sailuh_kaiaulu/
      # commit: ../../rawdata/kaiaulu/github/commit/sailuh_kaiaulu/

For JIRA, I appended project_key to the end of the file path. For Github, I appended owner_repo to the end of the file path. I'll work on testing and make the codeface config file. Regarding the refresh for the pull requests and commits, I had originally thought I was supposed to but you corrected me and specified issues and comments only during week 11

@carlosparadis
Copy link
Member Author

No. There is no ssunoo2 project key in Kaiaulu JIRA. We should not include fictitious examples even if commented on the config file. It will confuse users. Remove project_key_2 from the jira portion.

For project_key_2 on GitHub is also wrong... the fork is not owned by sailuh and kaiaulu, but rather the owner is ssunoo2 and the repo is kaiaulu. I am a bit worried the config file may not be making any sense to you at this point. Should we go over this briefly on call if it helps?

@Ssunoo2
Copy link
Collaborator

Ssunoo2 commented Apr 16, 2024

Here is the updated config format for the issue_trackers:

issue_tracker:
  jira:
    project_key_1:
      # Obtained from the project's JIRA URL
      domain: https://sailuh.atlassian.net
      project_key: SAILUH
      # Download using `download_jira_data.Rmd`
      issues: ../../rawdata/kaiaulu/jira/issues/sailuh/
      issue_comments: ../../rawdata/kaiaulu/jira/issue_comments/sailuh/
  github:
    project_key_1:
      # Obtained from the project's GitHub URL
      owner: sailuh
      repo: kaiaulu
      # Download using `download_github_comments.Rmd`
      issue_or_pr_comment: ../../rawdata/kaiaulu/github/issue_or_pr_comment/sailuh_kaiaulu/
      issue: ../../rawdata/kaiaulu/github/issue/sailuh_kaiaulu/
      refresh_issues: ../..rawdata/kaiaulu/github/refresh_issues/sailuh_kaiaulu/
      pull_request: ../../kaiaulu/github/pull_request/sailuh_kaiaulu/
      commit: ../../rawdata/kaiaulu/github/commit/sailuh_kaiaulu/
    # project_key_2:
      # # Obtained from the project's GitHub URL
      # owner: ssunoo2
      # repo: kaiaulu
      # # Download using `download_github_comments.Rmd`
      # issue_or_pr_comment: ../../rawdata/kaiaulu/github/issue_or_pr_comment/ssunoo2_kaiaulu/
      # issue: ../../rawdata/kaiaulu/github/issue/ssunoo2_kaiaulu/
      # refresh_issues: ../..rawdata/kaiaulu/github/refresh_issues/ssunoo2_kaiaulu/
      # pull_request: ../../kaiaulu/github/pull_request/ssunoo2_kaiaulu/
      # commit: ../../rawdata/kaiaulu/github/commit/ssunoo2_kaiaulu/

Note that a new folder 'refresh_issues' is created as a result of #282

@carlosparadis
Copy link
Member Author

@beydlern @crepesAlot

For discussion regarding the specification of the configuration file, let's use this issue. For discussion of conf.R, let's use #230. It suffices, however, for your task specification to just live on #230.

The specification of the config file is indeed on this issue. It is extremely long at this point, but I'd like you to both to skim through it and find the comments that says "why do we have to make the specification this way?" e.g. of relevant comments:

#286 (comment)

see also "anomaly case 1" and "anomaly case 2" sections in: #286 (comment)

Mbox

We went over an example of why mailing list specification has to convey multiple mail archives. @daomcgill I believe this is the information you need to know for your function input for mod_mbox and pipermail:

#286 (comment)

Contrast how more comprehensive and realistic to what we spoke today this is versus the existing one:

kaiaulu/conf/openssl.yml

Lines 47 to 55 in c781106

mailing_list:
# Where is the mbox located locally?
#mbox: ../../rawdata/mbox/openssl_dev_mbox # 2004-2008 fields are complete
mbox: ../../rawdata/mbox/openssl-dev.mbx # 2002-2019 gmail field is redacted due to google groups
# What is the domain of the chosen mailing list archive?
#domain: http://mail-archives.apache.org/mod_mbox
# Which lists of the domain will be used?
#list_key:
# - apr-dev

Other Downloaders

The most current specification for the other downloader which are jira, bugzilla and github is on this comment: #286 (comment)

With that being said, there are more than just downloaders with file paths being specified. For example, you also need to tell Kaiaulu where your .git files are. Your exercise is therefore as follows:

  1. Browse some of Kaiaulu conf files to understand it's structure: https://github.com/sailuh/kaiaulu/tree/master/conf
  2. Refer to both comments above to build upon Spring'24 group efforts to come up with a new one, and post a comment with an updated version of the existing conf file + the two comments above replacing the old format in that particular session.
  3. Look at the refresher cheatsheet bottom left to be reminded of the philosophy of the file organization: https://github.com/sailuh/kaiaulu_cheatsheet/blob/main/cheatsheets/refresher-cheatsheet.pdf
  4. Check all notebooks file in kaiaulu (/vignettes/) that do not have an _ in front of them (those do not appear in the docs, so we can disregard) to ensure all paths are accounted for in the config specification.
  5. Note you need to understand the config file specification if you are to write get functions to access it! So post a new config specification here first before jumping into coding get functions.
  6. Once we are in agreement with the specification, then you can PR the new specification against all conf files (if you skip us being on the same page, you risk wasting time having to edit all confs again)
  7. Once you get there, then the conf.R can be worked on in parallel to updating the notebooks.

I suggest this workflow should be pasted on your task specification on #230 as checkbox, with the small difference you should indicate who is working on what.

Remember: The specification I need to see here is not just combining the two comments above, but one that includes all information available across all existing confs, with the parts the two comments about refer to being updated. Not every conf contains the full specification, so we need to derive the master specification after looking at all of them (this will in turn be future documentation on the project wiki too).

I hope by the time this milestone is over, you will have a better understanding of all types of data Kaiaulu can interface with, and have a better appreciation of using these configuration files to document information about a project, so all one needs do to re-analyze a project is share a config file to understand assumptions, and to re-run an analysis.

One last time: For specification questions, follow up here, for questions on conf.R, do so on #230 : )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants