Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gremlin crowdsource #121

Open
wants to merge 12 commits into
base: master
Choose a base branch
from
Open

Gremlin crowdsource #121

wants to merge 12 commits into from

Conversation

sara-02
Copy link
Contributor

@sara-02 sara-02 commented Nov 13, 2017

@sara-02 sara-02 changed the title Gremlin crowdsource [WIP]: Gremlin crowdsource Nov 13, 2017
@sara-02
Copy link
Contributor Author

sara-02 commented Nov 13, 2017

For user story openshiftio/openshift.io#1286

@sara-02 sara-02 requested a review from pkajaba November 28, 2017 16:48
@centos-ci
Copy link
Collaborator

@sara-02 Your image is available in the registry: docker pull registry.devshift.net/bayesian/kronos:SNAPSHOT-PR-121

1 similar comment
@centos-ci
Copy link
Collaborator

@sara-02 Your image is available in the registry: docker pull registry.devshift.net/bayesian/kronos:SNAPSHOT-PR-121

@sara-02 sara-02 changed the title [WIP]: Gremlin crowdsource Gremlin crowdsource Nov 28, 2017
Copy link
Contributor

@miteshvp miteshvp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor nitpick, but LGTM otherwise

input_package_topic_data_store,
output_package_topic_data_store,
additional_path)
untagged_pakcage_data = TagListPruner.clean_file(package_file_name,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo. You may want to fix and all subsequent occurrences.

result_package_topic_json = []
untagged_pakcage_data = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo again

# TODO: use singleton object, with updated package_topic_list
if ecosystem in untagged_pakcage_data.keys():
current_untagged_set = set(untagged_pakcage_data[ecosystem])
new_untagged_set = current_untagged_list.union(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you intend to use current_untagged_set instead of the list?

@centos-ci
Copy link
Collaborator

@sara-02 Your image is available in the registry: docker pull registry.devshift.net/bayesian/kronos:SNAPSHOT-PR-121

@miteshvp
Copy link
Contributor

LGTM. Will wait for @pkajaba approval

@centos-ci
Copy link
Collaborator

@sara-02 Your image is available in the registry: docker pull registry.devshift.net/bayesian/kronos:SNAPSHOT-PR-121

1 similar comment
@centos-ci
Copy link
Collaborator

@sara-02 Your image is available in the registry: docker pull registry.devshift.net/bayesian/kronos:SNAPSHOT-PR-121

@pkajaba
Copy link
Contributor

pkajaba commented Dec 4, 2017

@sara-02 Please, rebase instead of merge commit.

@@ -12,6 +12,8 @@ AWS_BUCKET_NAME = os.environ.get("AWS_BUCKET_NAME","dev-stack-analysis-clean-dat
KRONOS_SCORING_REGION = os.environ.get("KRONOS_SCORING_REGION", "")
KRONOS_MODEL_PATH = os.environ.get("KRONOS_MODEL_PATH", KRONOS_SCORING_REGION + "/github/")
DEPLOYMENT_PREFIX = os.environ.get("DEPLOYMENT_PREFIX", "")

GREMLIN_REST_URL = "http://{host}:{port}".format(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have one question not strictly related to this PR, but why do you have template instead of just having config.py?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that by mistake the credential don't get committed. if someone changes their config.py Config.py is in .gitignore.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I can see your point now, but you are sourcing all those files from environment variables.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not always, sometimes directly writing them in the con fig, it is easier that way, as they don't change over the testing period.

Copy link
Contributor

@pkajaba pkajaba Dec 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would go with a way, where every developer would have a script where those secrets are stored. This script would not be in repo, might be in gitignore. This script would basically export stored secrets:

#!/bin/bash

export TOP_SECRET1="foo_bar"
export TOP_SECRET2="foo_bar"
.
.
.
export TOP_SECRET_N="foo_bar"
./run_actuall_code.py

It's another extra script, but I find it clearer than copying configs every time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sara-02 Can you elaborate more?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

through environment variables.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok will add a PR for that separately.

"""Generate the clean aggregated package_topic list as required by Gnosis.

:param input_package_topic_data_store: The Data store to pick the package_topic files from.
:param output_package_topic_data_store: The Data store to save the clean package_topic to.
:param additional_path: The directory to pick the package_topic files from."""

if mode == "test":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need this mode in the first place? You should rename it to data_path and you don't have to have this if.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the value of data_path is different when running the test cases, that is why mode is needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and my point here is that you don't need to solve this through conditions. You can set this value for once This will work because you are not running tests and real code in the same instance, are you?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So basically pass the APOLLO_PATH instead of the mode value?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pretty much, or you can have it as a class variable for example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pkajaba Ack, thank have updated using a temp_path variable instead of mode setting.

@centos-ci
Copy link
Collaborator

@sara-02 Your image is available in the registry: docker pull registry.devshift.net/bayesian/kronos:SNAPSHOT-PR-121

1 similar comment
@centos-ci
Copy link
Collaborator

@sara-02 Your image is available in the registry: docker pull registry.devshift.net/bayesian/kronos:SNAPSHOT-PR-121

def __init__(self, src_dir):
self.src_dir = src_dir
# ensure path ends with a forward slash
self.src_dir = self.src_dir if self.src_dir.endswith("/") else self.src_dir + "/"
self.src_dir = self.src_dir if self.src_dir.endswith(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need this if you use os.path.join everywhere src_dir is used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently removing this will cause other tests cases to fail, this requires a module wide fix of using os.path.join. I will create an issue and work on PR to fix this for all files.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already fixed it for all files that were present in the source code when I did my Python3 changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "apollo" module should be the only problem, ergo not too hard to fix.

def test_gremlin_updater_generate_payload(self):
expected_pay_load = {
'gremlin':
"g.V().has('ecosystem', 'ruby')." +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need the + sign here if you use the magic string concatenation of Python.

@pkajaba
Copy link
Contributor

pkajaba commented Dec 12, 2017

@sara-02 would you kindly rebase? :-)

@pkajaba
Copy link
Contributor

pkajaba commented Dec 12, 2017

@sara-02 It's looking good to me, but I would appreciate some unit test for new functions.

Have added them here, Please leave your review comments on this file.

@centos-ci
Copy link
Collaborator

@sara-02 Your image is available in the registry: docker pull registry.devshift.net/bayesian/kronos:SNAPSHOT-PR-121

@sara-02
Copy link
Contributor Author

sara-02 commented Dec 13, 2017

@pkajaba PTAL again.

Copy link
Contributor

@pkajaba pkajaba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sara-02 would you kindly take a look?

file_list = local_data_obj.list_files()
for file_name in file_list:
data = local_data_obj.read_json_file(file_name)
# TODO: use a singleton object with updated datafile.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you implement this as a singleton? Anyway, I don't really like that you are initializing the instance of the object inside of the same class as an initalized object is.

Is it some design pattern or what is the reason behind it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

for each_ecosystem in self.untagged_data:
package_list = self.untagged_data[each_ecosystem]
pck_len = len(package_list)
if pck_len == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would delete this if because it not required in context for this method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack.

# If pck_len =0 then, no package of that ecosystem requires
# tags. Hence, do nothing.
continue
for index in range(0, pck_len, 100):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is 100 really value what you want to have here? it means that just every 100th element in range will be used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this can be passed as parameter, but that is what is intended, as we want to break the list in chunks of 100 subsets each.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

continue
for index in range(0, pck_len, 100):
sub_package_list = package_list[index:index + 100]
pay_load = self.generate_payload(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you have underscore here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any dangling underscore :/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a payload is the correct word, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

each_ecosystem, sub_package_list)
self.execute_gremlin_dsl(pay_load)

def generate_payload(self, ecosystem, package_list):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method can be class/static.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

self.untagged_data = untagged_data

@classmethod
def generate_and_update_packages(cls, apollo_temp_path):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have some tests but you are not testing this method. Any specific reason?

Copy link
Contributor Author

@sara-02 sara-02 Dec 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pkajaba This method talks to graph, so every time we are testing it we need to a working instance of grmelin-http up and running.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I would advise here to mock HTTP response to emulate the behavior of graph DB.

graph_obj.update_graph()
local_data_obj.remove_json_file(file_name)

def update_graph(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same comment about testing applies here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same reason.

'str_packages': ['service_identity']}}

unknown_data_obj = LocalFileSystem(APOLLO_TEMP_TEST_DATA)
self.assertTrue(unknown_data_obj is not None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert unknown_data_obj would be enough here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

# IMPORTANT: TestGraphUpdater needs to run after TestTagListPruner

# Test class TestGraphUpdater(TestCase):
def test_gremlin_updater_generate_payload(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this method is supposed to test? you don't really have read package list from the file system, just create a fixture of package list and tests whether a query is created correctly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, I don't need to load the list, but generation of list needs to be checked, so I am adding it to the previous test instead of this one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, you can spit it, but if a generation of the list really has to be tested it should be extracted in function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sara-02 ^^ :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am testing extraction here itself and then deleting the files. If not then it will create a dependency of tests as we have make sure that the extraction always gets tested after generation if they are 2 separate tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function only checks for payload now.

class TestPruneAndUpdate(TestCase):

# Test Class TagListPruner
def test_generate_and_save_pruned_list_local(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am struggling to see what method is this test testing. Can you elaborate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The input list contains more than 4 tags, the prune method will generate tag list upto 4 tags based on frequency. So this test checks that the desired 4 tags are generated or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but which method does it test in code? I can't find test_generate_and_save_pruned_list_local method in repository.

Unit tests should test methods and their behavior on various inputs.

Copy link
Contributor

@pkajaba pkajaba Dec 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@centos-ci
Copy link
Collaborator

@sara-02 Your image is available in the registry: docker pull registry.devshift.net/bayesian/kronos:SNAPSHOT-PR-121

@sara-02
Copy link
Contributor Author

sara-02 commented Dec 15, 2017

@pkajaba PATL, i think all major concerns have been addressed. 2 things that need a separate PR include the env.sh for the repo and the use of os.path.join where not in use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants