Skip to content

NLPCode/TUBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 

Repository files navigation

TUBench: Benchmarking Large Vision-Language Models on Trustworthiness with Unanswerable Questions

Large Vision-Language Models (LVLMs) have achieved remarkable progress on visual perception and linguistic interpretation. Despite their impressive capabilities across various tasks, LVLMs still suffer from the issue of hallucination, which involves generating content that is incorrect or unfaithful to the visual or textual inputs. Traditional benchmarks, such as MME and POPE, evaluate hallucination in LVLMs within the scope of visual question answering (VQA) using answerable questions. However, some questions are unanswerable due to insufficient information in the images, and the performance of LVLMs on such unanswerable questions remains underexplored. To fill in this research blank, we propose TUBench, a benchmark specifically designed to evaluate the reliability of LVLMs using unanswerable questions. TUBench comprises an extensive collection of high-quality, unanswerable questions that are meticulously crafted using ten distinct strategies. To thoroughly evaluate LVLMs, the unanswerable questions in TUBench use images from four diverse domains as visual contexts: screenshots of code snippets, natural images, geometry diagrams, and screenshots of statistical tables. These unanswerable questions are tailored to test LVLMs' trustworthiness in code reasoning, commonsense reasoning, geometric reasoning, and mathematical reasoning related to tables, respectively. We conducted a comprehensive quantitative evaluation of 28 leading foundational models on TUBench, with Gemini-1.5-Pro, the top-performing model, achieving an average accuracy of 69.2%, and GPT-4o, the third-ranked model, reaching 66.7% average accuracy, in determining whether questions are answerable. Furthermore, our manual analysis of the model outputs reveals that: (1) Gemini-1.5-Pro provides both correct answers and explanations in only 41% of cases, and (2) hallucinations are the primary cause of error, accounting for 58.5% of the incorrect explanations generated by Gemini-1.5-Pro. These findings highlight that TUBench presents a significant challenge to current LVLMs, and offers a new perspective for evaluating hallucinations and trustworthiness through the lens of unanswerable questions.

TUBench includes four distinct datasets: unanswerable code reasoning (UCR), unanswerable VQA (UVQA), unanswerable GeoQA (UGeoQA), and unanswerable TabMWP (UTabMWP). Among these, UCR and UVQA are entirely new datasets created by us with answers to questions restricted to 'Yes', 'No', or 'Unanswerable'. UGeoQA and UTabMWP are extensions of GeoQA and TabMWP, respectively, with unanswerable questions added.

Unanswerable Code Reasoning (UCR)

For UCR, we create unanswerable questions using the following three strategies:

  • S.1. Introduce uncertainties into code screenshots by adding random functions.
  • S.2. Introduce uncertainties into code screenshots by omitting variable initialization.
  • S.3. Introduce uncertainties into code screenshots by deliberately leaving certain lines of code incomplete. The UCR dataset is available at

The UCR dataset is available at datasets/UCR.zip. After extraction, we obtain 20 subdirectories. In each subdirectory, n.png contains the original code snippet, while n.txt includes the corresponding answerable questions designed for that image. Similarly, n_u1.png and n_u1.txt represent the images and questions constructed using strategy S.1; n_u2.png and n_u2.txt are for strategy S.2; and n_u3.png and n_u3.txt are for strategy S.3.

Unanswerable Visual Question Answering (UVQA)

For UVQA, we create unanswerable questions using the following five strategies:

  • S.4. The information required to answer the question is occluded in the image.
  • S.5. The details necessary to answer the question are hard or impossible to discern.
  • S.6. The required information is out of the picture frame.
  • S.7. The spatial relationship is indeterminate.
  • S.8. The required information is not indicated in the image.

The UVQA dataset is available at datasets/UVQA.zip. After extraction, you will find 107 subdirectories. Each subdirectory contains an image and a corresponding text file, which includes both answerable and unanswerable questions designed for that image. Each line in the text file includes the question number, the question, the answer, and the strategy used to construct the unanswerable question (1, 2, 3, 4, and 5 correspond to S.4, S.5, S.6, S.7, and S.8).

Unanswerable GeoQA (UGeoQA)

For UGeoQA, we create unanswerable questions using the following strategy:

  • S.9. To construct an unanswerable question, we deliberately remove a condition from the answerable question.

The UGeoQA dataset is available at datasets/UGeoQA.zip. After extraction, you will find 487 subdirectories. Each directory contains an image and a corresponding JSON file, which includes both answerable and unanswerable questions designed for that image. In the JSON file, "subject" refers to answerable questions, while "unanswerable_subject" indicates unanswerable questions.

Unanswerable UTabMWP (UTabMWP)

For UTabMWP, we create unanswerable questions using the following strategy:

  • S.10. To render the original question unanswerable, we deliberately occlude crucial information in the left image, thus creating the altered image displayed on the right.

The UGeoQA dataset is available at datasets/UTabMWP1.zip and datasets/UTabMWP2.zip. After extraction, you will find 200 subdirectories. Each subdirectory contains a JSON file with questions and options. If the corresponding image for the question is n.png, the question will be answerable. If the corresponding image is n_u1.png, the question will be unanswerable.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published