LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks

Jan 1, 2024·

Anna Bavaresco

,

Raffaella Bernardi

,

Leonardo Bertolazzi

,

Desmond Elliott

,

Raquel Fernández

,

Albert Gatt

,

Esam Ghaleb

,

Mario Giulianelli

,

Michael Hanna

,

Alexander Koller

,

André F. T. Martins

,

Philipp Mondorf

,

Vera Neplenbroek

,

Sandro Pezzelle

,

Barbara Plank

,

David Schlangen

,

Alessandro Suglia

,

Aditya K. Surikuchi

,

Ece Takmaz

,

Alberto Testoni

· 0 min read

Type

Journal article

Publication

CoRR

Last updated on Jan 1, 2024

← Learning To See But Forgetting To Follow: Visual Instruction Tuning Makes LLMs More Prone To Jailbreak Attacks Jan 1, 2024

Lost in Space: Probing Fine-grained Spatial Understanding in Vision and Language Resamplers Jan 1, 2024 →