Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

thread sanitizer finds real bugs in fractal tree software #396

Open
wants to merge 27 commits into
base: master
Choose a base branch
from

Conversation

prohaska7
Copy link
Contributor

The thread sanitizer finds thousands of issues (lock deadlocks, data races, etc) when running the fractal tree tests. Luckily, there are only about a hundred unique issues since a couple of issues dominate the reporting. While some of these issues were previously found with valgrind's data race detectors, the thread sanitizer identified new issues. Many of these issues can be fixed by using atomic types in place of base types. However, many of the issues have yet to be diagnosed. These are discussed in the open issues section of this document.

The fractal tree software was developed on the Intel X86 machines, so there is a bias to its memory consistency model in the code. Some of the changes to the fractal tree softare to address issues found by the thread sanitizer are probably necessary to successfully run the fractal tree software on hardware architectures that have a memory consistency model that differs from the Intel X86 architecture.

Suppressions

The thread sanitizer allows one to suppress issues that either are not of interest right now or are known to be false positives. These suppression rules are specified in a thread sanitizer suppression file. Right now, there are several suppressions necessary to allow useful analysis of the remaining fractal tree software. There are a few rules for unsafe operations that are thought to be acceptable. IMO, the use of these operations should be temporary until atomic variables are used or some other software change is made.

Fixed issues

  • Many of the fractal tree tests had data races on some test specific global variables. These races are easily fixed with atomic types in place of base types.

  • Data races in the lock tree. Atomic types used to fix races on the current lock memory variable and the lock tree reference count variable.

  • Data races in the cache table. Atomic types used to fix races on the current cache table size.

Open issues

  • Lock order inversion in the concurrent lock tree. When the concurrent lock tree gets rebalanced, the lock order from root to leaf changes. This change is annotated for helgrind, so helgrind ignores any lock order inversions. Since this issue occurs frequently, a suppression rule exists to ignore this issue.

  • Data races with the in memory row count variable. In particular, there are a couple of data races in the adjust logical row count function due to the fact that the in memory row count is updated with atomic functions but read normally. Perhaps this function could use a atomic variable instead. Since this issue occurs frequenty, a suppression rule exists to ignore this issue until it can be investigated.

  • Lock order inversion during recovery. Needs to be investigate. No suppression rule exists which makes analysis of the races difficult.

  • Data race with fractal tree flow variables. Flow variable updates are atomic, reads are not. Could be fixed with atomic variable type.

  • Data race on rollback log detected between a client thread retiring a transaction and a hot indexer thread building a new index. This looks like a serious bug.

  • Data race on update of cachetable attributes. Looks like a bug.

  • Data race on fractal tree max MSN. Updates of the max MSN are atomic, reads are not. Could be fixed with atomic variable type.

  • Data race on txn state. Need to investigate impact of this race on correctness of code.

There will inevitably be more issue unearthed by the thread sanitizer as more tests are run.

Status variables

Most of the status variables are racy since we want lightweight overhead for maintaining them and are willing to live with some incorrect results. The impact of using atomic types for all of the status variables should be investigated.

Tools

Ubuntu 17.10
Clang 4.0.1
Cmake 3.9.1

Directions

CXXFLAGS=-fsanitize=thread CC=clang CXX=clang++ cmake -D CMAKE_BUILD_TYPE=Debug ../tokuft
make -j8
TSAN_OPTIONS=”suppressions=tokuft.tsan.suppressions history_size=7 second_deadlock_stack=1” ctest -j8 -E ‘valgrind|helgrind|drd|try-’

Copyright (c) 2017, Rik Prohaska
All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

…ized, otherwise the ft library leaks memory.
…ractal tree tests. Unfortunately, not all of these warnings can be immediately addressed. The suppression file allows some of these issues to be suppressed. There are two reasons for suppressing thread sanitizer issues. First, there are a couple of issues that dominate the warnings and cause other less frequent issues to be lost like a needle in a haystack. The adjustment of row counts, for example, occurs frequently. Second, some of the issues are not bugs but result from algorithm design. For example, the lock order inversion issue in the locktree's concurrent tree occur when the tree is rebalanced and the mutex order from root to leaf in the tree is changed. Since there is not way (that I know) to inform the thread sanitizer that this is to be expected, we need to suppress this issue instead.

Valgrind was used as a race detector on the fractal tree software before the thread sanitizer existed.  Its helgrind and drd tools allow one to annotate address regions in which data races should be ignored.  The fractal tree software is littered with helgrind annotations.  The thread sanitizer does not have a similar facility.  However, issues identified by the thread sanitizer can be suppressed by code location, either by function or by file.  We can use the suppression facility to suppress data race reporting on variables in particular functions.  If these functions are sufficiently small, then we can be confident that we are suppressing races on specific variables.

Many data races identified by the thread sanitizer can be fixed by replacing standard C+ variables with C++ atomic variables.  This replacement will remove the need for thread sanitizer suppressions.
@@ -498,12 +499,12 @@ class evictor {

pair_list* m_pl;
cachefile_list* m_cf_list;
int64_t m_size_current; // the sum of the sizes of the pairs in the cachetable
std::atomic_int64_t m_size_current; // the sum of the sizes of the pairs in the cachetable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes 'class evictor' not a POD and triggers static assertion via 'ENSURE_POD(evictor);' in cachetable.cc around line 3611

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, all of these atomic_type_t specializations are incorrect as they do not exist in all supported back versions of glib, must use std::atomic instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so for backward compatibility, use std::atomic in place of std::atomic_int?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, these all need to be the older style specializations std::atomic<uint32_t>, etc... I easily changed those but the bigger issue is that by adding this to evictor (and possibly other class/structs) introduces non trivial layout and construction/destruction/copy semantics, which then causes failure of std::is_pod and triggers the static_assertion in ENSURE_POD. I do not recall offhand what the need for evictor and others to be POD was and what it would take to eliminate it.

Alternately, since the evictor is really just a singleton, a hack could be to just to move this to global scope, but that IMHO is going the wrong direction in the greater scheme to properly "C++-ize" the FT library.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants