SIT: Fine-Tuning Large Language Models with Sequential Instructions

Introduction

Despite the success of existing instruction-tuned models, we find that they usually struggle to follow sequential instructions to queries with multiple instructions. This impairs their performance in complex problems whose solution consists of multiple intermediate tasks. Therefore, we contend that part of the fine-tuning data mixture should be sequential—containing a chain of interrelated tasks. We first approach sequential instruction tuning from a task-driven perspective, manually creating multilingual and multimodal tasks: namely “translate then predict” and “caption then answer”. Next, we further push the boundary by turning instructions in existing datasets (e.g., Alpaca, FlanCoT and Tulu) into diverse and complex sequential instructions, making our method general-purpose. Models that underwent our sequential instruction tuning show improved results in reasoning and open-ended generation. Moreover, we put forward a new benchmark named SeqEval to evaluate a model's ability to follow all the instructions in a sequence, which further corroborates the benefits of our fine-tuning method.

Overview

The issue: Instruction-tuned models perform poorly in sequential tasks
Despite models showed remarkable capabilities on single instructions, we find that they usually struggle to respond to queries with multiple instructions. This impairs their performance in complex problems whose solution consists of multiple intermediate tasks.

Why is it the case and what does it mean "Sequential Instruction"?
The main reason is most instruction data are consist single query (Alpaca) or generate from NLP tasks (FlanCoT). We contend that part of the fine-tuning data mixture should be sequential---containing a chain of interrelated tasks. Some of these examples are intermediate tasks for multilingual and VQA: namely "translate then predict" and "caption then answer".

What is SIT?
We propose Sequential Instruction Tuning (SIT), a novel fine-tuning method that leverages sequential instructions to enhance the performance of large language models (LLMs) on both generic tasks and sequential tasks.

How good is SIT?
Experiments show that SIT significantly outperforms vanilla IT and WizardLM on both generic tasks and sequential tasks. Specifically, SIT show superior results on reasoning and coding tasks (GSM8K, Codex HumanEval), XQuAD and MGSM (multilingual translation), and it shows significant improvement on SeqEval (evolved from AlpacaEval).

Results and Analysis

Main Results

SIT outperforms vanilla IT or WizardLM (response regenerated using the same models) across generic tasks, such as reasoning and coding task. And outperforming in sequential tasks by a large margin.

Analysis

(A) Ablations

Our ablation results show that our method are agnostic to the choice of both base model and generation models. For both G=

(B) Is Sequence Length the Driving Factor Behind Performance?

A variable factor in our comparison of IT and SIT is the length of the training data, based on this, we prepare three ablation experiments to investigate whether SIT’s higher metric scores are attributed to merely having more training tokens.

(C) Root Verb anylaze in new instructions

we check the kinds of instructions generated via Seq-Instruct and draw potential links to model improvements in different skill types. We identify the verb-noun structure in the generated instructions using the Berkeley Neural Parser.

BibTeX

@article{Hu2024FinetuningLL,
          title={SIT: Fine-tuning Large Language Models with Sequential Instructions},
          author={Hanxu Hu and Simon Yu and Pinzhen Chen and Edoardo Maria Ponti},
          journal={ArXiv},
          year={2024},
          volume={abs/2403.07794}
        }