LSE.AI

We are a student run AI research lab at the London School of Economics focusing on mechanistic interpretability of LLMs.

Our Research

Find out more about what we do, how, and why.

Objectives
Publish in NeurIPS/ACL/ICLR/ICML, and have significant impact.
Our Methodology
We improve sparse autoencoders and apply them to open problems across the sciences with a particular focus on mechanistic interpretability.
Open Problems
What circuits LLMs use for a task and how do they work? How to make SAEs cheaper? What do LLMs tell us about language?
Benefits of Interpretability
An AI lie detector would significantly reduce AI-related risks. Moreover, interpretable AIs are well suited to fields like medicine and finance, where confidence in the output is critical.

Our team

We’re a dynamic group of individuals who are passionate about research and dedicated to delivering the best.

Selected Papers

Find out more about the work we‘ve done so far.

Accelerating Sparse Autoencoder Training via Layer-Wise Transfer Learning in Large Language Models

Sparse AutoEncoders (SAEs) have gained popularity as a tool for enhancing the interpretability of Large Language Models (LLMs). However, training SAEs can be computationally intensive, especially as model complexity grows. In this study, the potential of transfer learning to accelerate SAEs training is explored by capitalizing on the shared representations found across adjacent layers of LLMs. Our experimental results demonstrate that fine-tuning SAEs using pre-trained models from nearby layers not only maintains but often improves the quality of learned representations, while significantly accelerating convergence. These findings indicate that the strategic reuse of pretrained SAEs is a promising approach, particularly in settings where computational resources are constrained.

Davide Ghilardi, Federico Belotti, Marco Molinari, Jaehyuk Lim

Interpretable Company Similarity with Sparse Autoencoders

Determining company similarity is a vital task in finance, underpinning risk management, hedging, and portfolio diversification. Practitioners often rely on sector and industry classifications such as SIC and GICS codes to gauge similarity, the former being used by the U.S. Securities and Exchange Commission (SEC), and the latter widely used by the investment community. Since these classifications lack granularity and need regular updating, using clusters of embeddings of company descriptions has been proposed as a potential alternative, but the lack of interpretability in token embeddings poses a significant barrier to adoption in high-stakes contexts. Sparse Autoencoders (SAEs) have shown promise in enhancing the interpretability of Large Language Models (LLMs) by decomposing Large Language Model (LLM) activations into interpretable features. Moreover, SAEs capture an LLM‘s internal representation of a company description, as opposed to semantic similarity alone, as is the case with embeddings. We apply SAEs to company descriptions, and obtain meaningful clusters of equities. We benchmark SAE features against SIC-codes, Industry codes, and Embeddings. Our results demonstrate that SAE features surpass sector classifications and embeddings in capturing fundamental company characteristics. This is evidenced by their superior performance in correlating logged monthly returns - a proxy for similarity - and generating higher Sharpe ratios in co-integration trading strategies, which underscores deeper fundamental similarities among companies. Finally, we verify the interpretability of our clusters, and demonstrate that sparse features form simple and interpretable explanations for our clusters.

Marco Molinari, Victor Shao, Luca Imeneo, Mateusz Mikolajczak, Vladimir Tregubiak, Abhimanyu Pandey, Sebastian K. R. T. Pereira

Fixed Point Explainability

This paper introduces a formal notion of fixed point explanations, inspired by the "why regress" principle, to assess, through recursive applications, the stability of the interplay between a model and its explainer. Fixed point explanations satisfy properties like minimality, stability, and faithfulness, revealing hidden model behaviours and explanatory weaknesses. We define convergence conditions for several classes of explainers, from feature-based to mechanistic tools like Sparse AutoEncoders, and we report quantitative and qualitative results.

Emanuele La Malfa, Jon Vadillo, Marco Molinari, Michael Wooldridge