Posts

Dec 5, 2023

A Collection of Good LLM Analogies

Jun 25, 2023

The Curious Case of LLM Evaluations

Our modeling, scaling and generalization techniques grew faster than our benchmarking abilities - which in turn have resulted in poor evaluation and hyped capabilities.

Evaluation LLMs Survey

Apr 27, 2023

Gender Bias in GPT4: A Short Demo

A brief demonstration of gender bias in GPT4, as observed from various downstream task perspectives ft. Taylor Swift

🪴 Potted LLMs Evaluation Prompting

Apr 10, 2023

Unstable Theory of Mind in Sparks of AGI

Discussing the prospect of deriving instinct and purpose for a prompt and creating examples for evaluation problems focussing the Sally-Anne False-Belief Test and provide a summary of when GPT4 and GPT3.5 pass or fail the test.

LLMs Evaluation

Mar 1, 2023

Prompt Cap | Making Sure Your Model Benchmarking is Cap or Not-Cap

Enhance classification with a text annotation framework for improved systemization in prompt-based language model evaluation

Evaluation Prompting LLMs

Aug 3, 2022

A Collection of Good LLM Analogies

The Curious Case of LLM Evaluations

Gender Bias in GPT4: A Short Demo

Unstable Theory of Mind in Sparks of AGI

Prompt Cap | Making Sure Your Model Benchmarking is Cap or Not-Cap

Rethinking User Study Design for Evaluating Model Explanations : A sketchnote

Can Rationalization Improve Robustness? : A sketchnote

Beware the Rationalization Trap! : A sketchnote