Evaluation
-
The Curious Case of LLM Evaluations
Our modeling, scaling and generalization techniques grew faster than our benchmarking abilities - which in turn have resulted in poor evaluation and hyped capabilities.
-
Gender Bias in GPT4: A Short Demo
A brief demonstration of gender bias in GPT4, as observed from various downstream task perspectives ft. Taylor Swift
-
Unstable Theory of Mind in Sparks of AGI
Discussing the prospect of deriving instinct and purpose for a prompt and creating examples for evaluation problems focussing the Sally-Anne False-Belief Test and provide a summary of when GPT4 and GPT3.5 pass or fail the test.
-
Prompt Cap | Making Sure Your Model Benchmarking is Cap or Not-Cap
Enhance classification with a text annotation framework for improved systemization in prompt-based language model evaluation