Tests Deadliness and Higher Order Mutations #174

rasenmaeher92 · 2020-06-04T11:08:17Z

Hey boxed and mutmut contributors,
thanks for this great library! It really helps in developing tests, which test the real corner cases!

I was wondering, whether it would be possible to implement two new features, namely test deadliness and higher order mutations.

Test Deadliness

As far as I am aware, mutmut does not report the number of tests which killed a mutant. The more tests fail after applying a mutant, the higher the deadliness for said mutant is. I think this would be interesting, as this reveals which locations are really critical for the correctness of the code base. It also reveals which locations are well tested and which are not.
As an example, a surviving mutant has a test deadliness of 0, a mutant killed by a single failed test has a test deadliness of 1 and so on.
Maybe this could be normalized by the number of tests to account for small or large test bases.

Reporting an aggregate measure (average, median) or the histogram of test deadliness over all mutants reveals whether

there are too few tests (aggregate test deadliness close to 0),
there are just a few independent tests, which address individual lines (aggregate test deadliness around 1) or
there are also very sensitive tests, which test corner cases of higher level functions (aggregate test deadliness above 1).

As this should be quite easy to obtain I would propose just outputting the results when calling mutmut results.

Higher Order Mutations

Mutmut currently only applies a single mutation before running tests. Combining these mutations (two mutations -> second order, three mutations -> third order, an so on) would allow to evaluate the test robustness in a more extreme setting (edit: I noticed, that you mentioned this idea already in #113 (comment)). I would argue that this setting (for reasonable low orders) is still quite realistic, as the example below shows:

for i in range(0,5): ... the reference code
for i in range(1,5): ... a normal first order mutant, with changed iteration start and thus the iteration length
for i in range(0,6): ... a normal first order mutant, with changed iteration end and thus the iteration length
for i in range(1,6): ... a second order mutant, with changed iteration start and end, but not the iteration length

Another example would be an if A and B or (C and not D): ... statement with multiple logical expressions A, B, C and D. A single mutant will likely change the behavior, but multiple mutations could partially restore parts of the logical behavior and thus have tests wrongly pass.

Obviously this does not scale well for high orders if the number of mutants N is large, I think there are potentially N*(N-1)/2 higher order mutants, which would need to be tested. But within all potential mutant combinations there are some, which are more interesting and more likely to survive. Those combinations are most likely those, of which the independent mutant weren't able to kill (m)any tests.
Thus, building on the test deadliness proposed above, one could order or limit the higher order mutations by the test deadliness of their individual tests. Afterall two (or more) mutations, which were not killed, might also survive in conjunction.

Note, that test deadliness should also be measured for these higher order mutants, which allows us to do the following:
For example, if mutations A and B have td=0, testing AB would be interesting. On the other hand if mutations A and B have td>>0, testing AB would not be interesting.
If if mutations A ,B and C have td=0, the second order mutations AB, AC and CB can be tested. Furthermore the third order mutation ABC could be tested, and this would be determined by the test deadliness of he second order mutations AB, AC and CB.

To use this feature, I would propose a new CLI argument --max_order, which defaults to 1 and thus defaults to the current mutmut behavior. A usage like --max_order=2 would additionally test all possible pair combinations of the mutants. Low digit values of --max_order should reveal the most interesting combinations, as more and more concurrent mutations increase the probability of a test or simply a execution failure.
Something like --td_limit_higher_order could be used to provide a test deadliness limit to skip mutant combinations, where the individuals already have a high average test deadliness.

I would be happy to hear your or any of the mutmut contributors opinion on the usefulness of these two features!

The text was updated successfully, but these errors were encountered:

boxed · 2020-06-04T11:33:46Z

Deadliness

Interesting.. I have had someone suggest to use mutmut to find irrelevant tests that can be deleted before, but deadliness is a new idea to me. Deadliness isn't as easy as you might expect though. There are a few factors that make it more difficult, the primary one being that mutmut is specifically designed to not have any insight into the test runner. So fail/pass is all it knows. This is what makes it so easy to support all test runners. What you suggest would require 1) that we get detailed information on which tests failed and 2) that we don't run the test runner in fail fast mode which is slower. I would think a deadliness of > 1 would be seen as unwanted, as it either means you're running more tests than you need for each mutant.

I believe this can be implemented today by having your test runner output a test report at the end and then giving mutmut a post_mutation step where you can parse that file and do whatever you want with that data.

Higher order mutants

This is an interesting area but so far I don't know of anyone who has actually tried it successfully. Your arguments make sense a priori, but I have actually tried to implement higher order mutants and it doesn't seem to do anything useful. It makes everything slower for negligible gain.

As far as I know there are precious few code bases that have reached zero surviving mutants, which you really should be doing before beginning to think about higher order mutants. And you probably have much more to gain by property based testing in a big chunk of those too before higher order mutants become relevant. If you still need more security and robustness you should probably change programming language I think :P

All that being said.. do you have a code base where you have reached zero and still feel there is more to be done? If so, can I see it?

rasenmaeher92 · 2020-06-12T10:41:55Z

Deadliness

mutmut is specifically designed to not have any insight into the test runner. So fail/pass is all it knows. This is what makes it so easy to support all test runners.

I did not know that. Maybe the slow down from running all tests instead of fail fast could be compensated by using parallelisation of tests such as xdist for pytest. I passed it via the --runner flag (python -m pytest -n auto) and it accelerated my mutmut testing (and it solved the issues I had with timeouts).

I would think a deadliness of > 1 would be seen as unwanted, as it either means you're running more tests than you need for each mutant.

I think there are two different ways to think about (a high) deadliness.

A priori (without designing tests for high deadliness) this means, that a mutant and thus a certain code part, is critical for the correct functioning of the entire code base. This should often not come at great surprise, because one knows the really important parts of the code base. But I could imagine settings in large code bases, where this uncovers "surprisingly critical code paths".
A posteriori (after designing tests for high deadliness) this means, that instead of many tests which kill individual mutants, a few tests are written in such a way that they kill many mutants. As such the deadliness is an estimate how compact your tests are.

Higher Order Mutants

I have actually tried to implement higher order mutants and it doesn't seem to do anything useful. It makes everything slower for negligible gain.

I would totally agree with that. Up to second/third/... order mutants takes roughly double/triple/... the number of mutants to test. Maybe parallelisation of tests (like I mentioned above) and of mutants (like mentioned in another issue) could help with that though.
The idea behind my proposal was that there is a difference between the current implementation and the ground truth implementation a programmer was actually trying to achieve. Often but not always, the difference will be single mutant and I would argue, that second or thrid degree mutants would cover most of the usual differences.

As far as I know there are precious few code bases that have reached zero surviving mutants, which you really should be doing before beginning to think about higher order mutants.

I think I would argue that a surviving higher order mutant is more critical than a lower order mutant. After all this tells us, that there is similar implementation, which still satisfies all tests and has more changes than the lower order mutant applied to the code base. Or put differently, the test base is particularly insensitive to a larger set of changes.
But then you are right, addressing all first order mutants will most likely also kill most to all of the higher order mutants.

Anyways, thank you for your thoughtful response! I can see that the implementation of my ideas come with some challenges, but I just wanted to see if anybody had thought about these concepts before.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tests Deadliness and Higher Order Mutations #174

Tests Deadliness and Higher Order Mutations #174

rasenmaeher92 commented Jun 4, 2020 •

edited

Loading

boxed commented Jun 4, 2020

rasenmaeher92 commented Jun 12, 2020

Tests Deadliness and Higher Order Mutations #174

Tests Deadliness and Higher Order Mutations #174

Comments

rasenmaeher92 commented Jun 4, 2020 • edited Loading

Test Deadliness

Higher Order Mutations

boxed commented Jun 4, 2020

rasenmaeher92 commented Jun 12, 2020

Deadliness

Higher Order Mutants

rasenmaeher92 commented Jun 4, 2020 •

edited

Loading