Posts
-
A Collection of Good LLM Analogies
-
The Curious Case of LLM Evaluations
Our modeling, scaling and generalization techniques grew faster than our benchmarking abilities - which in turn have resulted in poor evaluation and hyped capabilities.
-
Gender Bias in GPT4: A Short Demo
A brief demonstration of gender bias in GPT4, as observed from various downstream task perspectives ft. Taylor Swift
-
Unstable Theory of Mind in Sparks of AGI
Discussing the prospect of deriving instinct and purpose for a prompt and creating examples for evaluation problems focussing the Sally-Anne False-Belief Test and provide a summary of when GPT4 and GPT3.5 pass or fail the test.
-
Prompt Cap | Making Sure Your Model Benchmarking is Cap or Not-Cap
Enhance classification with a text annotation framework for improved systemization in prompt-based language model evaluation
-
Rethinking User Study Design for Evaluating Model Explanations : A sketchnote
-
Can Rationalization Improve Robustness? : A sketchnote
-
Beware the Rationalization Trap! : A sketchnote