Abstract and 1. Introduction
Background
Method
Experiments
4.1 Multi-hop Reasoning Performance
4.2 Reasoning with Distractors
4.3 Generalization to Real-World knowledge
4.4 Run-time Analysis
4.5 Memorizing Knowledge
Related Work
Conclusion, Acknowledgements, and References
\ A. Dataset
B. In-context Reasoning with Distractors
C. Implementation Details
D. Adaptive Learning Rate
E. Experiments with Large Language Models
\ Recently, Large Language Models (LLMs) with large parameter sizes learned from human preferences have shown remarkable performance in language understanding and generation. These LLMs are powerful zero-shot and few-shot reasoners. Recent works find that LLMs learn to perform multi-step reasoning by first generating new reasoning chains and then predicting the answers. In this experiment, we benchmark the performance of a popular new LLM, GPT-3.5, on the two multi-hop reasoning datasets we used in our paper. We first evaluate GPT-3.5’s zero-shot reasoning performance in predicting the correct answers. As Table 10 shows, zero-shot prompting GPT-3.5 significantly underperforms RECKONING’s performance. GPT-3.5’s performance improves on ProofWriter without distractors but still is behind the performance of RECKONING. When distractors are present in the context, RECKONING performs much better than zero-shot and few-shot GPT-3.5 prompting. This highlights RECKONING’s strength in disentangling irrelevant information from useful knowledge, an ability that even powerful LLMs like GPT-3.5 lack.
\
:::info Authors:
(1) Zeming Chen, EPFL (zeming.chen@epfl.ch);
(2) Gail Weiss, EPFL (antoine.bosselut@epfl.ch);
(3) Eric Mitchell, Stanford University (eric.mitchell@cs.stanford.edu)';
(4) Asli Celikyilmaz, Meta AI Research (aslic@meta.com);
(5) Antoine Bosselut, EPFL (antoine.bosselut@epfl.ch).
:::
:::info This paper is available on arxiv under CC BY 4.0 DEED license.
:::
\


