Mentors & Projects

BASE Mentors are pushing AI Safety, Governance, & Security forward while investing in up and coming researchers, practitioners, and leaders.

Tobi Olaiya

Tobi Olaiya is a Senior Manager of Ethical Use Policy at Salesforce, where she leads the operationalization of the company's AI acceptable use policies. She specializes in AI governance, product safety, and cross-functional policy initiatives that balance innovation with ethical technology practices.

Prior to Salesforce, Tobi held Trust & Safety roles at Twitter, where she led the development of the platform's first recommendations explainer, advancing algorithmic transparency and user choice. She holds a Master of Public Policy from the University of Maryland.

Ethical Use & Policy, Salesforce
  • Summary

    This research investigates the systemic gap in AI safety governance, focusing on how standardized models often fail to account for the unique socio-technical risks and cultural nuances found in diverse technological environments. The final framework proposes interoperable safety standards that integrate localized cultural values with universal safety requirements to ensure robust and inclusive governance across different contexts.

    Deliverables: A playbook or template outlining practical steps for ensuring inclusive governance.

    Number of Fellows:


  • Skills needed: Qualitative research; background in areas such as public policy, technology policy, sociology; strong analytical writing.

    Time commitment: 10–15 hours per week.

    Helpful backgrounds: Responsible AI frameworks, Trust & Safety, AI Governance

Alisar Mustafa

Head of AI Policy, Duco

Track: Policy & Governance

Alisar Mustafa is the Head of AI Policy at Duco, where she leads AI safety and governance initiatives for enterprise clients to ensure they operate safely, securely and responsibly. A strategic leader in AI governance, she specializes in model fine-tuning, adversarial testing, and regulatory compliance. She has held AI governance roles at Meta, the Federation of American Scientists, and the U.S. Census Bureau.

She also authors a widely read AI policy newsletter, providing insights on emerging regulations and industry developments. Her expertise spans AI risk mitigation, model evaluation, policy analysis and stakeholder engagement across enterprise and government sectors.

  • Summary

    This project investigates how AI safety training datasets vary across languages. Fellows will select 2–3 languages and conduct a structured literature review to identify gaps in language coverage, harm categories, and whether datasets use translations or native-language content. The focus is on understanding what safety training resources currently exist and what's missing—not on building new datasets or testing models.

    Deliverables

    Public spreadsheet mapping safety resources by language, memo on key gaps and research opportunities, and potential workshop/conference submission

    Number of Fellows: 2-3

  • Skills needed: Ability to critically evaluate dataset methodology (translation approaches, annotation quality, sampling, limitations) and experience conducting structured literature reviews with clear criteria and synthesis across sources.

    Time commitment: 8–10 hours per week for 8 weeks.

    Helpful backgrounds: Prior exposure to multilingual NLP or cross-lingual datasets is strongly preferred. Familiarity with dataset documentation standards (like datasheets for datasets, model cards, or transparency reporting) is also valuable.

KRYstal Jackson

Non-Resident Fellow, Center for Long-Term Cybersecurity

Krystal Jackson is a Non-Resident Research Fellow at the Center for Long-Term Cybersecurity, AI Security Initiative, where she conducts research into the global security implications of artificial intelligence. Before this role, she worked as a Research Associate at the Frontier Model Forum, where she focused on advancing AI-cyber safety and AI security with industry leaders.

Krystal also previously served as an AI Capabilities Analyst at the Cybersecurity and Infrastructure Security Agency, driving critical AI initiatives within the Infrastructure Security Division. Krystal's research experience includes leadership with the Center for AI and Digital Policy Research Clinic, as a Junior AI Fellow at the Center for Security and Emerging Technology, and as a Public Interest Technology Fellow at the U.S. Census Bureau. She acts as the Research Director of BASE.

  • Summary

    This project investigates how software architecture ("scaffolding") transforms foundation models into effective autonomous agents. Recent research shows identical models perform dramatically differently based on their scaffolding configurations, with performance gaps that exceed differences between human skill levels. The project will develop a taxonomy of scaffolding components, create metrics to measure capability uplift, and produce policy-relevant outputs for government agencies and the AI safety community. Fellows will contribute to controlled experiments, architecture analysis, and policy translation work that bridges technical findings with practical evaluation frameworks.

    Deliverables:

    Research paper on scaffolding taxonomy and measurement frameworks, policy recommendations report. Additionally, fellows may contribute to an open-source scaffolding component database, measurement toolkit, or a possible workshop/conference submission.

    Number of Fellows: 2

    Requirements for Fellows

    Skills needed: Research engineering skills, familiarity with AI agent frameworks and cybersecurity evaluation, and ability to translate technical findings for policy audiences. Professional engagement with policymakers. 

    Time commitment: Variable (part-time over 3-4 months within 12-month project timeline)

    Helpful backgrounds: Experience with LLM-based agents, penetration testing, AI safety evaluation frameworks, or policy analysis

  • Summary

    General-purpose AI systems pose unique risks that require adapting traditional safety-critical risk management approaches. An emerging field of AI risk management is gaining traction in national and international standards-setting organizations and benefits from civil society research. This project translates technical research and high-level guidance into actionable best practices and recommendations for AI risk management that can inform policy and standards development. Fellows will work on converting complex technical findings and policy frameworks into practical implementation guidance for organizations managing AI risks.

    Deliverables:

    Possible co-authorship on publications (white papers and policy reports), speaking opportunities at events.

    Number of Fellows: 2

    Requirements for Fellows

    Skills needed: Ability to translate technical research into actionable recommendations, interest in AI risk management, and policy advocacy.

    Time commitment: To be determined

    Helpful backgrounds: Familiarity with risk management frameworks, experience with standards-setting processes, understanding of AI safety principles or governance structures

Track: Alignment & Security

Delali Kwasi Dake

Professor, University of Education, Winneba, Ghana

Prof. Delali Kwasi Dake is an Associate Professor of Computing and Information Technology with a PhD in Computer Engineering and Head of the Department of ICT Education at the University of Education, Winneba, Ghana. His research focuses on machine learning, AI ethics, governance, and responsible digital innovation, particularly in African contexts. He is the Founder of Jobweb Africa, Partnerships Manager for the Ghana AI Research Network (GAIN), and a speaker at major AI and technology summits including the Africa Fintech Summit, Ghana Data Science Summit, and Digital Asset Summit Africa.


  • Summary:
    Generative AI has created ethical tensions in assessment systems designed before AI tools became widely accessible. This project examines whether current assessment practices remain ethically defensible and proposes governance-oriented ethical principles for redesigning assessments that balance academic integrity, fairness, and institutional responsibility.

    Deliverables:

    Practical ethics-informed guidance for universities or regulators, policy memo(s), or a conference submission.

    Number of Fellows: 2


  • Skills needed: Strong analytical and writing skills, with interest in education policy, ethics, or governance.

    Time commitment: 8–10 hours per week.

    Helpful backgrounds: Education studies, ethics, policy analysis, or related fields.

Track: Policy & Governance

Cozmin Ududec

Science of Evaluation Lead, UK AISI

Currently leading the Science of Evaluation team at the UK AI Security Institute. He was previously Chief Scientist at Invenia Labs , doing research on electricity grids and machine learning. My background is in mathematical physics, in particular in the foundations of quantum mechanics and in quantum information theory. He has broad research experience with conceptual, pure, and applied mathematical problems.

  • Summary:

    The EU AI Code of Practice defines a new model based on weight initialization, but regulators actually care about behavioral changes that emerge from weight updates during training. This project investigates how much a model's weights can change before significant performance shifts occur on benchmarks, helping clarify when updated models should require re-auditing.

    Deliverables:

    Academic publication on methods and findings, open source tools for perturbing model weights, possible CSET policy explainer

    Number of Fellows: 2

  • Skills needed: Python programming and hands-on experience with neural networks.

    Time commitment: 10–20 hours per week.

    Helpful backgrounds: Deep learning theory, optimization, and adversarial machine learning.

Track: Alignment

Track: Policy & Governance

Latisha Harry

Senior Fellow, Portulans Institute

Track: Policy & Governance

Latisha is an independent research and policy consultant specializing in AI governance, digital rights, and technology-driven institutional innovation. She designs and leads complex, multi-stakeholder research programs that bridge the worlds of policy, civil society, and emerging technology.

Her work spans AI risk assessment, data governance, misinformation analysis, and legislative mapping, and has worked with organisations such as the Effective Institutions Project, OpenAI, Stanford HAI, CIVICUS, Global Witness, and the International Labour Organization, among others. Latisha’s contributions range from red-teaming advanced AI systems to analysing global surveillance trends, evaluating government responses to AI risks, building cross-national policy datasets, and producing high-impact reports for policymakers.


  • Summary

    This project creates a comprehensive map of AI safety regulations across 10–15 key countries, focusing on critical safeguards like safety testing, model evaluation, supply-chain security, compute tracking, and incident reporting.

    Deliverables

    Fellows will help build a comparative dataset, assess regulatory maturity using a structured framework, and contribute to a report identifying the most important gaps in catastrophic risk prevention. The final output—including country profiles and cross-national analysis—will guide policymakers, researchers, and civil society on where safety measures are lacking and which interventions should be prioritized.

    Number of Fellows: 2-3

  • Skills needed: Strong qualitative research and document synthesis abilities, clear writing skills, and interest in AI safety and governance.

    Time commitment: 10–15 hours per week.

    Helpful backgrounds: Public policy, political science, international governance, technology policy, or AI ethics/safety. Experience with comparative policy analysis is a plus but not required.

  • Summary

    This project creates a comprehensive map of AI safety regulations across 10–15 key countries, focusing on critical safeguards like safety testing, model evaluation, supply-chain security, compute tracking, and incident reporting.

    Deliverables

    Fellows will help build a comparative dataset, assess regulatory maturity using a structured framework, and contribute to a report identifying the most important gaps in catastrophic risk prevention. The final output—including country profiles and cross-national analysis—will guide policymakers, researchers, and civil society on where safety measures are lacking and which interventions should be prioritized.

    Number of Fellows: 2-3

  • Skills needed: Strong qualitative research and document synthesis abilities, clear writing skills, and interest in AI safety and governance.

    Time commitment: 10–15 hours per week.

    Helpful backgrounds: Public policy, political science, international governance, technology policy, or AI ethics/safety. Experience with comparative policy analysis is a plus but not required.

Titi Akinsanmi

 Global Policy Team Lead, Google

With over four years as Global Policy Team Lead at Google, I focus on shaping policies for the responsible use and access to generative AI products and hardware platforms. My team works to develop trustworthy technologies that respect individual and societal rights while addressing safety and ethical considerations in the digital space.  

A dedicated advocate for responsible innovation, I bring extensive expertise in public access, government consultations, and engaging with officials to address critical issues in technology governance. My mission is to ensure that digital tools and policies empower and protect users worldwide, fostering a safer, more inclusive future in the evolving digital economy.


  • This project explores how AI privacy and data protection frameworks can be adapted to reflect African cultural values like Ubuntu and communal wealth. It bridges the gap between global standards like GDPR and local realities in the Digital South, challenging Western individualistic privacy concepts and preventing "digital colonialism" in AI governance.

    Fellows will conduct comparative analysis of African data privacy legislation, create accessible educational content for everyday users, and develop policy briefs on personal data rights in African contexts.

    Deliverables:

    Featured blog post for the demistef.ai portal, research paper comparing African data legislation, and plain language guide to AI privacy rights for everyday users.

    Number of Fellows: 2

    Requirements for Fellows

    Skills needed: Qualitative research, interest in AI policy/law, strong science communication skills (translating complex topics for non-experts), AI engineering and coding capabilities

    Time commitment: 10–15 hours per week

    Helpful backgrounds: Familiarity with GDPR, AI ethics principles, or African socio-political contexts

  • Summary:

    This project identifies and documents real-world instances of algorithmic bias and unethical surveillance in African societies. It creates a knowledge base that empowers everyday users to recognize and report AI-driven harms, putting end-users back at the center of the technology lifecycle as key decision-makers and informing policy development.

    Fellows will map AI deployments across Africa, conduct case studies on bias incidents, and create accessible educational materials explaining technical AI/ML concepts.

    Deliverables:

    Living blog post for the portal. Mapped directory of AI/ML initiatives in the Digital South. Updated "AI/ML Terms Dictionary" written in everyday language.

    Number of Fellows: 2

    Requirements for Fellows

    Skills needed: Tech-savviness, data mapping, creative content creation (video/blogging), AI engineering and coding capabilities

    Time commitment: 10–15 hours per week

    Helpful backgrounds: Basics of Machine Learning, familiarity with social justice issues in tech, interest in "Human-in-the-loop" AI

DR. GASPARD BAYE

Track: Security

Track: Policy & Governance

COLIN SHEA BLYMYER

Research Fellow, CSET

Colin Shea-Blymyer is a Research Fellow at Georgetown’s Center for Security and Emerging Technology (CSET), where he works on the CyberAI Project. His research has spanned safe reinforcement learning, formal methods, adversarial machine learning, and AI ethics. Previously, he was a graduate researcher with MITRE where he helped establish the National Institute of Standards and Technology (NIST) program on adversarial machine learning research at the

National Cybersecurity Center of Excellence (NCCOE). He holds an MS and BS in Computer

Science from Virginia Tech. He has a PhD in Computer Science and Artificial Intelligence from


Track: Alignment

Amari Cowan

Emerging Technology Fellow U.S. Census Bureau

Amari is an AI policy and governance professional working at the intersection of AI and public policy, translating technical safety, risk, and performance considerations into practical governance frameworks. Amari’s experience includes senior roles across Big Tech and the U.S. federal government, including serving as the first AI Officer at the Federal Energy Regulatory Commission and working on global technology governance initiatives at Meta and TikTok.

She currently serves as a Technologist-in-Residence, Emerging Technology Fellow at the U.S. Census Bureau, where she works on experimental policy frameworks and advises leadership across the federal landscape on ethical AI governance at scale.


  • Summary

    This project explores how AI systems increasingly shape human-to-human interactions and decision-making, often in subtle but consequential ways. By analyzing concrete, real-world case studies, the research identifies measurable changes in human-to-human behavior, attitudes, and social dynamics, and evaluates how existing governance frameworks address, or fail to address, these “influence risks.”

    Deliverables:

    1. Short paper or abstract submission (to be determined pending relevance to available conferences and journals).

    2. Draft policy memo to complement the paper as an example of governance in practice.

    3. Optional: one public comment to a government agency of the fellow’s choice relevant to our work, to be submitted independently.

    Number of Fellows: 1-2

  • Skills needed: Excellent (or strong aspirations for) academic and persuasive writing skills. 

    Time Commitment (Hours per week):

    Helpful Backgrounds: Some knowledge of AI governance standards, policies, or regulations, relevant either globally, in the United States, the European Union, or within a region or country of the fellow’s choice.

    A proficient understanding of the regulatory or legislative framework of a country or region of interest.

    General interest in the intersection of AI policy and human computer interaction. 

Track: Governance

Gabrielle Hibbert

AI Policy Lead, Commonwealth of Pennsylvania

Gabrielle Hibbert is currently the AI Policy Lead for the Commonwealth of Pennsylvania, where she writes and develops governance solutions for the Commonwealth. With her experience designing policy that is user driven and backed by industry-leading research, Gabrielle has helped establish innovative and transparency policy with the needs of risk, security, and data privacy. In 2023, she was named a non-resident fellow at New America, where she developed and published a paper on user-informed nutrition labels for generative AI tools. Her work can be found at Rubrik, the Kapor Center, and the Bipartisan Policy Center, among other outlets.

She has served as a pro bono technical expert and Board Member of the Heller School for Social Policy's Tech Policy center since 2022.



  • Summary

    This project will analyze the current gaps within current AI governance frameworks. By understanding user behavior and current trends in AI use, organizations can learn to build effective AI governance frameworks beyond the top-down governance

    Deliverables: (1) A published policy report with actionable policy outcomes,

    (2) Submission to a conference or journal.

    Number of Fellows: 2

  • Skills needed: Experience with writing literature reviews, ability to conduct user interviews, sentiment analysis, and data analysis, & experience with past research contributions, and interest in Ai governance.

    Time Commitment: 8-10 hours per week

    Helpful Backgrounds: User behavior, international frameworks, & sentiment analysis

Track: Policy & Governance

ELFREDAH KEVIN-ALERECHI

Chief Innovation Officer, Journotech

  • Summary:
    Fellows will collaborate within NewsAssist AI and JournoTech to conduct adversarial audits of diaspora-centric AI systems, identifying vulnerabilities related to misinformation and cultural hallucination. Through hands-on stress-testing of these content synthesis tools, they will develop robust AI governance frameworks and truthfulness guardrails that protect information integrity for global users. The project culminates in actionable safety standards and ethical training modules designed to ensure reliable AI deployment across diverse socio-technical landscapes.

    Deliverables:

    A comprehensive technical blog post that will document their methodology for stress-testing the models and the resulting safety improvements.

    A set of adversarial prompt-test cases and a governance framework for NewsAssist AI.

    A structured Standard Operating Procedure(SOP) for users of your platforms. 

    Policy White Paper

    Number of Fellows: 3


  • Skills needed: Anyone interested in learning how they can solve this issue

    Time Commitment: 20- 25 hours per week

    Helpful Backgrounds: Diverse backgrounds, including ethics, social sciences, media studies, human-computer interaction, and public policy, who possess a foundational interest in AI literacy and a commitment to developing safer, more equitable information systems for the global diaspora.

Elfredah Kevin-Alerechi is an AI innovator, journalist, researcher, and ethical technology leader based in the UK. She is the founder of NewsAssist AI and Journotech, where she also serves as Chief Innovation Officer, leading responsible AI innovation, policy development, and ethical deployment. Her work focuses on AI safety, ethics, governance, and security, with a strong emphasis on building inclusive, human centered AI systems that serve communities rather than marginalize them.

Her background spans AI product development, AI policy drafting, governance design, and research mentorship

  • Through Journotech, she has trained over 300 professionals across 21 countries and built a network of nearly 1,000 practitioners, including educators, researchers, journalists, newsrooms, and civil society organisations. She designs and delivers training on responsible AI use, AI governance frameworks, secure AI deployment, and ethical innovation. She has also spoken at international conferences on AI security, responsible usage, and ethical AI implementation.

  • Summary

    Current frontier AI models are not fully understandable or predictable. Expectations of explainability and algorithmic transparency are misguided and sometimes naive. AI systems' unpredictability challenges current policy approaches. Deep learning models like GPT-4 function as "black boxes."

    Studies from Google DeepMind and OpenAI highlight gaps in mechanistic interpretability.

    Amidst all of this, policy analysts and policymakers have very little to no understanding of what are the fundamental gaps and bottlenecks to us gaining a better understanding of these systems. This project aims to bridge that gap.

    Goal: Communicate to policymakers the current state of our understanding of inner

    workings of AI models so they understand where gaps are. This would be useful for them to calibrate for future legislation.

    Deliverables:

    Blog Post + Longer policy paper if feasible

    Number of Fellows:  1- 2

    Skills needed: A willingness to read/understand technical work. Some basic qualitative knowledge of how LLMs/ neural nets work could be nice.

    Time commitment: 10-15 hours per week

    Helpful backgrounds: Basic understanding of neural nets, AI Policy

  • Summary

    Many countries are rushing to integrate AI systems into their public services for greater accessibility and scale. However, there are risks associated with procurement such as data storage, public profiling, facial recognition and infringement of privacy among others. This project would provide policy recommendations for guardrails on procurement of AI systems that governments could use.

    Deliverables:

    Blog Post + Longer policy paper if feasible

    Number of Fellows:  1- 2

    Requirements for Fellows

    Skills needed: Research analysis, Data privacy. Data Policy

    Time commitment: 10-15 hours per week

    Helpful backgrounds: Digital Public Infrastructure such as M-pesa or Aadhar (but not necessary)

Track: Security & Governance

Founder & CEO, Valix AI

Security AI Scientist and Ph.D. with 10+ years of experience building AI-driven offensive and defensive security solutions. I have 12+ publications in venues such as NeurIPS, HASP, and IEEE Access (140+ citations) and hold CVE recognition and multiple top cybersecurity certifications, including OSCP, PNPT, and CEH Practical. His work has been showcased at DEFCON, OWASP, BSides, and The Diana Initiative, with Hall of Fame honors from Nokia and Ford. He specializes in developing security AI algorithms, conducting penetration testing, and building intelligent threat detection systems. Currently, he founded and serves as CTO at Valix AI, where he leads the development of foundational AI security platforms that enable intelligent agents to detect, analyze, and neutralize both conventional and AI-powered threats.

  • Summary

    This project investigates the security risks of deploying large language models (LLMs) in Security Operations Center (SOC) workflows, with a focus on prompt injection and adversarial manipulation. We design a systematic red-teaming framework to evaluate how on-premises LLMs behave under adversarial inputs during malware analysis tasks and develop mitigation strategies to improve robustness and reliability.

    Deliverables:

    Blog post, research paper draft (IEEE S&P workshop/ USENIX), open-source toolkit, adversarial prompt dataset

    Number of Fellows: 2

  • Skills needed: Python, basic ML/LLM knowledge

    Time commitment: 10-15 hours per week

    Helpful backgrounds: Cybersecurity basics, NLP/LLMs

Serena Dokuaa

Policy Manager, Data & Society Institute

Serena Oduro is an AI policy expert and writer driven by her dedication to realizing an AI ecosystem that truly benefits us all. As Data & Society Research Institute’s policy manager, Serena Oduro leads the organization’s state-level policy engagement. Before her work on state policy, Serena led Data & Society’s engagement as a founding member within the US AI Safety Institute Consortium, where she advocated for a sociotechnical approach to AI safety. She is a HUMAN Residency Fellow, awarded by Ragdale, Lake Forest College, and The Mellon Foundation in support of her developing poetry collection which centers a Black feminist analysis and approach to AI.

  • Serena Oduro is an AI policy expert and writer driven by her dedication to realizing an AI ecosystem that truly benefits us all. As Data & Society Research Institute’s policy manager, Serena Oduro leads the organization’s state-level policy engagement. Before her work on state policy, Serena led Data & Society’s engagement as a founding member within the US AI Safety Institute Consortium, where she advocated for a sociotechnical approach to AI safety. She is a HUMAN Residency Fellow, awarded by Ragdale, Lake Forest College, and The Mellon Foundation in support of her developing poetry collection which centers a Black feminist analysis and approach to AI. Her work has appeared in academic journals and news media, including Politico, Internet Policy Review, Meatspace Press, and Patterns. Previously, Serena was a technology equity fellow at The Greenlining Institute, where she provided key support for Greenlining’s sponsorship of the Automated Decision Systems Accountability Act of 2021.

  • Summary:
    The field of AI safety has dominated political discourse for AI assessment and evaluation over the past two years – and its dominance has included shifting away from addressing on-the-ground harms to focus on existential risks. This project aims to assess the state of AI safety evaluation practices and whether these assessment and evaluation practices are able to address AI harms Black communities face. Through this assessment, this project aims to provide policy + research recommendations to align AI safety/broader AI evaluation practices with the interests of Black communities.

    Deliverables:

    Blogpost, Policy Brief, Conference paper

    Number of Fellows: 2

  • Skills needed: Technical knowledge to analyze AI evaluation practices, understand systems of domination, ability to analyze and synthesize policy and research

    Time Commitment: 5-7 hours per week

    Helpful Backgrounds: AI assessment, AI evaluation, AI’s impact on Black and marginalized communities

Track: Alignment

HeramB Podar

Fellow, Center for AI and Digital Policy

Heramb Podar is an AI policy fellow at the Center for AI and Digital Policy and has previously been a GovAI winter fellow and did the ERA and FIG fellowships. Currently, he works with Encode on their International Task Force to coordinate global activity among chapters. Heramb holds Bachelor's and Master's degrees in chemistry from IIT Roorkee

Track: Policy & Governance

Lawrence Krukrubo

AI Safety Researcher, University of Wolverhampton

Lawrence is a Researcher and Lecturer specializing in AI Safety, Causal Fairness, and Explainable AI (XAI). His work focuses on mitigating bias in Large Language Models and designing "Safe-by-Design" systems. In his recent paper, he introduced the LRR-TED framework, demonstrating that hybrid human-AI teams can achieve 94% accuracy by treating experts as "Exception Handlers." Lawrence is a Member of the London Initiative for Safe AI (LISA). At work, he mentors students on bridging the gap between theoretical fairness frameworks and robust, deployable code.

  • Summary

    This project applies mechanistic interpretability techniques to identify the specific model components (attention heads and MLP layers) responsible for sycophantic behavior. Fellows will use "Activation Patching" and "Causal Tracing" to map the flow of information in open-weights models (like Llama-3-8B) to understand how the model constructs an answer that agrees with a user's incorrect bias rather than objective truth.

    Deliverables:

    1. A "Circuit Map" diagram identifying the specific attention heads that correlate with sycophantic outputs.

    2. A `Jupyter Notebook` demonstrating "Activation Patching" on a sycophancy dataset using `TransformerLens`.

    3. A technical blog post explaining the internal mechanism of the behavior.

    Number of Fellows: 1-2

    Requirements for Fellows

    Skills needed: Strong Python, familiarity with PyTorch (tensor shapes, broadcasting), basic understanding of Transformer architecture (Residual stream, Key/Query/Value vectors).

    Time commitment: 10- 15 hours per week

    Helpful backgrounds:Reading "A Mathematical Framework for Transformer Circuits" (Elhage et al.) or similar mechanistic interpretability literature. More relevant literature will be provided for fellows.

  • Summary

    This project investigates whether training on simple, synthetic data can reduce sycophancy (the tendency of models to agree with users' incorrect beliefs) in open-source LLMs. Fellows will replicate the methodology from the 2023 DeepMind paper "Simple synthetic data reduces sycophancy in large language models”, by generating a synthetic dataset of "opinion vs. fact" claims and fine-tuning a model (e.g., Llama-3-8B) to prioritize factual accuracy over user agreement.

    Deliverables:

    1. A blog post visualising the reduction in sycophancy before and after fine-tuning.

    2. An open-source GitHub repository containing the synthetic dataset generation script and the fine-tuning notebook.

    3. (Optional) Evaluation on the "Sycophancy Eval" from Anthropic’s model-written evaluations.

    Number of Fellows: 2 (Pair Programming/Research)

    Requirements for Fellows

    Skills needed:Python, PyTorch, basic familiarity with Hugging Face (transformers, peft).

    Time commitment: 10- 15 hours per week

    Helpful backgrounds: Completion of a basic "Intro to LLMs" course or equivalent.

Track: Alignment