Teaching

Inside the Black Box: Artificial Intelligence Safety and Mechanistic Interpretability

UCLA CS 269, Fall 2025

M/W 4-5:50pm, Royce Hall 156

Instructor: Saadia Gabriel
Email: skgabrie@cs.ucla.edu
Office: Eng VI 295A
Office Hours: 1:45-2:45pm on Mondays

Course Description: Large language models (LLMs) are becoming ubiquitous in our society. They are used in many real-world applications ranging from content moderation and online advertisement to healthcare. Given their increasing role in what we see, how we think, and what is publicly known about us, it is critical to consider risks to public safety when deploying LLM-based systems. This seminar will provide a lens on historical and current safety problems in natural language processing (NLP). We will first discuss ethics challenges introduced by the deployment of LLMs across domains. We will then read literature attempting to understand how "black box" LLMs work and address how lack of transparency hinders AI safety. These discussions will be accompanied by guest lectures from domain experts. There will be a group coding project where students will explore a mechanistic interpretability topic in-depth through the lens of AI safety.

Highlighted Final Student Papers:

Interpra: An Interpretability Visualization Dashboard for Artificial Intelligence
Elizabeth Eyeson, Eshanika Ray, Nathan Huey, James Shiffer, Yuheng Tu

Building a Framework for Explainability and Accountability for Albania’s AI Minister, "Diella"
Chenyang Zhao, Genglin Liu, Mingqi Zhao, Jaelyn Fan

Towards Automated Safety Audits for Model Context Protocols
Zhao Xu, Tong Xie, Chenwei Gu, Yuanhao Ban

Schedule:

Date	Topic	Description	Assignment(s)
9/29	Intro	We will go over the syllabus, schedule, reading list and course expectations. There will be an overview of historical challenges.[Slides]	Reading assignment #1, due by 10/1 11:59pm PT. Sign up for a presentation slot.
10/1	Causal Interventions & Student Presentations	We'll discuss attempts to understand the internals of neural networks through causal approaches, including counterfactuals. We will have our first student presentation.[Slides]	Reading assignment #2, due by 10/14 11:59pm PT.
10/6-10/8	Group project brainstorming	Free time to meet in-person and coordinate final project plans. Guidelines for the final project proposal are here.	Final project groups and proposals due by 10/10 at 11:59pm PT.
10/13	Guest Lecture	Sarah Wiegreffe (University of Maryland)	Reading assignment #3, due by 10/19 11:59pm PT.
10/15	Student Presentations	TBD, sign up here	Reading assignment #4, due by 10/21 11:59pm PT.
10/20	Student Presentations	TBD, sign up here	Reading assignment #5, due by 10/28 11:59pm PT.
10/22	Circuit Analysis & Activation Steering & Student Presentations	We'll cover work on decomposing neural networks to find subcomponents associated with specific concepts or behaviors, and discuss how deeper understanding of these associations has impacted controllability. We will conclude with student presentations.[Slides]	Reading assignment #6, due by 11/9 11:59pm PT.
10/27	Guest Lecture	Sophie Hao (Boston University)	Reading assignment #7, due by 11/11 11:59pm PT.
10/29	Student Presentations	TBD, sign up here	Reading assignment #8, due by 11/16 11:59pm PT.
11/3-11/5	Peer Feedback Sessions	A paper clinic style peer review session in which at least one member of each final project team must be present. Every team should bring their mid-quarter report draft and every member of the team will independently provide feedback to at least two other teams.	Mid-quarter final project report, due by 11/2 11:59pm PT. Reading assignment #9, due by 11/18 11:59pm PT.
11/10	Student Presentations	TBD, sign up here
11/12	Mechanistic Interpretability in the Real World & Student Presentations	We'll discuss factors that have been empirically shown to affect trust in AI. We'll have a critical conversation about whether approaches covered so far are aligned with improving trust, and focus areas for future work. We will conclude with student presentations. [Slides]
11/17	Student Presentations	TBD, sign up here
11/19	Student Presentations & Concluding Remarks	TBD, sign up here
11/24	Guest Lecture	Ana Marasović (University of Utah)	Final project slides due by 11/25 11:59pm PT.
11/26	Final Presentations	Schedule TBD
12/1	Final Presentations	Schedule TBD
12/3	Final Presentations	Schedule TBD, may be virtual-only due to travel	Final project papers due by 12/12 11:59pm PT.

Resources:

We will be using Perusall for collaborative paper note-taking and course discussion.

Grading:

Detailed guidelines for assignments will be released later in the quarter.

Reading assignments (40%)
- Students will read the assigned papers and post an original comment or question for each paper on Perusall. (36%)
- In pairs, students will sign up to present one of the assigned papers and summarize online discussion from Perusall. Each student will only present once. (4%)
Project (55%)
- Students will form groups and write a short (max 5 pg) paper on an AI policy framework for addressing concerns raised during one of the class discussions. This paper should include a technical coding component, either demonstrating a safety risk or a proposed solution.
- This will be graded based on a proposal (5%), mid-quarter progress report (5%), final in-person presentations (15%) and a final write-up (30%).
Peer Feedback (5%)
- Students will be asked to provide short, constructive feedback to their peers' paper drafts and final presentations that can aid in finalizing project write-ups.

Course Policies:

Late Policy. Out of courtesy to peers, it's expected that students complete reading assignments on time, but students may turn in 1 reading assignment up to a week late without penalty. Since the final project is a group assignment there are no late days, but extensions will be considered under extraordinary circumstances. Students are expected to communicate potential presentation conflicts (e.g. illness, conference travel) to the instructor in advance.

Academic Honesty. Reading assignments are expected to be completed individually outside of the paper presentation and the instructor will check for overlap between posted comments/questions. For all assignments, any collaborators or other sources of help should be explicitly acknowledged. Violations of academic integrity (please consult the student conduct code) will be handled based on UCLA guidelines.

Accommodations. Our goal is to have a fair and welcoming learning environment. Students should contact the instructor at the beginning of the quarter if they will need special accomodations or have any concerns.

Use of ChatGPT and Other LLM Tools. Students are expected to first draft writing without any LLMs and all ideas presented must be their own. Students may use LLMs for grammer correction and minimal editing if they add an acknowledgement of this use. Any work suspected to be entirely AI-generated will be given a grade of 0.

Acknowledgements: This course was very much inspired by 2 UW courses: Yulia Tsvetkov's Ethics in AI course and Amy X. Zhang's Social Computing course. It was also inspired by Marzyeh Ghassemi's Ethical ML in Human Deployments course at MIT.