<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en"><generator uri="https://jekyllrb.com/" version="4.3.3">Jekyll</generator><link href="https://danielsc4.it/feed.xml" rel="self" type="application/atom+xml"/><link href="https://danielsc4.it/" rel="alternate" type="text/html" hreflang="en"/><updated>2026-03-03T10:25:53+00:00</updated><id>https://danielsc4.it/feed.xml</id><title type="html">blank</title><subtitle>A simple, whitespace theme for academics. Based on [*folio](https://github.com/bogoli/-folio) design. </subtitle><entry><title type="html">In-depth Notes to understand more about Transformers</title><link href="https://danielsc4.it/blog/2023/transformers-notes/" rel="alternate" type="text/html" title="In-depth Notes to understand more about Transformers"/><published>2023-09-10T15:59:00+00:00</published><updated>2023-09-10T15:59:00+00:00</updated><id>https://danielsc4.it/blog/2023/transformers-notes</id><content type="html" xml:base="https://danielsc4.it/blog/2023/transformers-notes/"><![CDATA[<h1 id="chapter-1-transformers-of-course"><strong>Chapter 1: Transformers, of course</strong></h1> <p>Mainly relying on random resources on transformers online (<a href="https://transformer-circuits.pub/2021/framework/index.html">here</a> and <a href="https://deepgram.com/learn/capturing-attention-decoding-the-success-of-transformer-models-in-natural-language-processing">here</a>), I started to take notes on how they work. I will not focus too much on the whole architecture as a whole but have mainly tried to focus on the attention mechanism and MLP and how these components act on the <a href="https://dynalist.io/d/n2ZWtnoYHrU1s4vnFSAQ519J#z=DHp9vZ0h9lA9OCrzG2Y3rrzH">residual stream</a> of a transformer. I will therefore leave out for now topics such as embeddings and other trying to implement this using <a href="https://en.wikipedia.org/wiki/Einstein_notation">Einstein notation</a> (<a href="https://einops.rocks/1-einops-basics/">here</a> a Python tutorial) for the various matrix operations, which I knew existed but had never used in practice. Below are handwritten notes to that effect.</p> <div style="width: 100%; max-width: 100%;"> <object data="https://www.danielsc4.it/assets/pdf/interpretability_study/Transformer notes.pdf" type="application/pdf" style="width: 100%; height: 500px;"> <embed src="https://www.danielsc4.it/assets/pdf/interpretability_study/Transformer notes.pdf"/> <p>This browser does not support PDFs. Please download the PDF to view it: <a href="https://www.danielsc4.it/assets/pdf/interpretability_study/Transformer notes.pdf">Download PDF</a>.</p> &lt;/embed&gt; </object> </div>]]></content><author><name></name></author><category term="study"/><summary type="html"><![CDATA[Start of the truly in-depth study of the state-of-the-art for Mechanistic Interpretability]]></summary></entry><entry><title type="html">Let the Models Respond: Interpreting the Detoxification process of LMs</title><link href="https://danielsc4.it/blog/2023/interpreting-detox-LM/" rel="alternate" type="text/html" title="Let the Models Respond: Interpreting the Detoxification process of LMs"/><published>2023-07-12T15:59:00+00:00</published><updated>2023-07-12T15:59:00+00:00</updated><id>https://danielsc4.it/blog/2023/interpreting-detox-LM</id><content type="html" xml:base="https://danielsc4.it/blog/2023/interpreting-detox-LM/"><![CDATA[<h4 id="framing"><strong>Framing</strong></h4> <p>🚨 This blogpost contains examples which are offensive in nature.</p> <p>This research project was carried out by <a href="https://www.danielsc4.it/">me 👋🏼</a> during the internship period at the <a href="https://www.rug.nl/research/clcg/research/cl/?lang=en">Computational Linguistics Research Lab</a> at the <a href="https://www.rug.nl/">University of Groningen</a>. Currently, the work is still in progress and nearing completion. The results and status of the work do not represent the final state of the research.</p> <p>The work is supervised by:</p> <ul> <li><a href="https://gsarti.com/">Gabriele Sarti</a>, PhD student @ University of Groningen</li> <li><a href="https://www.rug.nl/staff/m.nissim/">Malvina Nissim</a>, Full professor @ University of Groningen</li> <li><a href="https://en.unimib.it/elisabetta-fersini">Elisabetta Fersini</a>, Associate professor @ University of Milano - Bicocca</li> </ul> <hr/> <p><br/></p> <h2 id="-abstract"><strong>📜 Abstract</strong></h2> <p><strong>Language Models</strong> (LMs) represent complex systems that are difficult to manage and deploy safely. For this reason, various techniques have been proposed over time with the aim of detoxifying and controlling the behaviour of the models after their training process. With this in mind, this research project aims to <strong>explore the potential of the model detoxification process</strong>. Known techniques of <em>fine-tuning</em> and <em>Reinforcement Learning from Human Feedback</em> (RLHF) will be explored leading to less toxic models. The work also aims to <strong>understand the detoxification process through an exploration on the interpretability of the models</strong> themselves, having the ultimate goal of <strong>not limiting their responses</strong> but offering a contronarrative with respect to potentially toxic prompts.</p> <p><br/></p> <h2 id="-introduction-and-state-of-the-art"><strong>🎨 Introduction and State Of The Art</strong></h2> <p>In the recent period, LMs are observing a rise in terms of parameters, complexity and consequently results obtained that, in some cases, manage to exceed even human capabilities for specific tasks <a href="https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf">(Radford and Narasimhan, 2018)</a>. All this power, however, comes from large amounts of data used in the pre-training phase of LMs that learn primarily from corpora extracted from the Internet, forums and social media. The large availability of text on these platforms certainly implies an ease in extracting various aspects of language useful for the learning process but brings with it issues especially relevant to the quality and content itself in the text. Indeed, it is not at all uncommon to find toxic, dangerous, privacy-compromising content or more complex phenomena such as unintended bias hidden in the text itself <a href="https://dl.acm.org/doi/10.1145/3442188.3445922">(Bender et al., 2021)</a>. All these aspects, which are difficult to control <em>a priori</em>, inevitably end up in the data that make up the LMs’ pre-training datasets, leading them to language generations that cannot always be considered safe and harmless <a href="https://aclanthology.org/2020.findings-emnlp.301.pdf">(Gehman et al., 2020)</a>.</p> <p>It is for this reason that efforts in research have been made to try to mitigate these phenomena as much as possible, both from the data point of view and from the point of view of the pre-trained LMs. Among the best known techniques can be found fine-tuning, RLHF <a href="https://arxiv.org/abs/2204.05862">(Bai et al., 2022)</a> and model steering <a href="https://arxiv.org/abs/1912.02164">(Dathathri et al. 2020)</a>. These techniques turn out to be more than effective in controlling the toxicity in model input/output but, especially in the presence of particularly “tendentious” cases it still remains possible to fool the models that still end up generating potentially toxic or unsafe responses. In addition, the most well-known response pattern to prompts deemed as dangerous is to stop the conversation, trying to stop proceeding to toxic behaviors (e.g., “As an AI Language Model I cannot answer this question, …”).</p> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/detox_LMs/Example_chatGPT_toxic.png" sizes="95vw"/> <img src="/assets/img/detox_LMs/Example_chatGPT_toxic.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="lazy" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <p><em>Toxic Prompt on <a href="https://openai.com/blog/chatgpt">ChatGPT</a> that generates conversation blocking</em></p> <h4 id="goals">Goals</h4> <p>With the following research project, we therefore want to <strong>investigate the detoxification process</strong>, pushing not only the models to be safer but exploring their potential by <strong>allowing them to respond even to potentially toxic prompts</strong> by offering a useful counter narrative to send the conversation forward to reason with the user who authored the original prompt.</p> <p>As can be guessed, it is imperative that such <strong>a process be as transparent as possible</strong>. For this reason, techniques for interpreting the models themselves will be employed to discover how the models change their generation. This will hopefully lead to discovering not only new features of the models but also what techniques might be most effective for <strong>the safety and effectiveness of the LMs</strong> themselves.</p> <p><br/></p> <h2 id="-approach"><strong>🦾 Approach</strong></h2> <p>Of the various techniques previously listed, fine-tuning and reinforcement learning represent the state of the art, also employed by industry for the most modern LMs. The main problem related to the use of these techniques, however, is the size of the models themselves. In fact, over the utlim years, there has been a trend toward growth in the number of parameters in language models, reaching and exceeding hundreds of billions in the case of the largest models (GPT-3/4, Bard, …). For these reasons, even just performing fine-tuning or applying reinforcement learning techniques seems to be quite impossible on consumer hardware or otherwise accessible to the research community. Even just maintaining a 7B model of parameters, on RAM or VRAM, would take more than 32GB.</p> <h4 id="how-to-deal-with-large-lms">How to deal with Large LMs?</h4> <p>However, there are several techniques that have emerged over time in the literature that aim to mitigate this type of issue. Indeed, it is possible to load models in Half Precision (16 bits instead of 32 bits) or, even more recently, in 8 bits and 4 bits through quantization techniques <a href="https://arxiv.org/abs/2208.07339">(Dettmers et al., 2022)</a>. These techniques allow dynamic mapping of tensors from the original 32bit model in Full Precision to 16bit tensors and, eventually in 8bit tensors, allowing a theoretical reduction of up to 400% (ideal case without training/inference data, in practice less given the necessary preservation of some parameters).</p> <p>_Half precision input matrix \(X_{f16} \in \!R^{s×h}\), can be quantizited as follow_:</p> \[X_{i 8} = \biggl \lceil \frac{127 \cdot X_{f16}}{\max{(|{X_{f16}}_{i,j} |)}} \biggr \rfloor = \biggl \lceil \frac{127}{||X_{f16}||_{\infty}} \cdot X_{f16} \biggr \rfloor = \lceil {s_x}_{f16} X_{f_16} \rfloor\] <p><em>Scaling a tensor to his 8-bit version forces the range</em> \([-127, +127]\) <em>by multiplying with</em> \({s*x}*{f16}\) <em>which is 127 divided by the absolute maximum of the entire tensor. This is equivalent to dividing by the infinity norm and multiplying by 127. More info in the original paper.</em></p> <p>This advantage of matrix representation, however, comes at a cost in the inability to effectively modify the matrices within the model, in other words, to perform weight training.</p> <h4 id="how-to-efficiently-train-quantized-large-lms">How to efficiently train quantized Large LMs?</h4> <p>In order to fine-tune or otherwise modify the weights of the model there must be weights in FP32 or FP16 representation. For this very reason, <a href="https://arxiv.org/abs/2106.09685">(J. Hu et al., 2021)</a> with Low-Rank Adaptation (LoRA) aims to create adapters that, in parallel with the frozen weights of the model, allow one to circumvent the problem by offering trainable lower-rank matrices based on the frozen model. The details of this operation will not be exposed here (for more information look at the paper cited earlier) but it is important to mention how this solution allows not only the training of larger models but is shown to partially succeed in solving the catastrophic forgetting problem as well. The most convenient implementation, being integrated with 🤗 HuggingFace is the one provided by 🤗 <a href="https://huggingface.co/blog/peft">Peft</a>.</p> <h4 id="letting-lms-respond-with-contronarrative">Letting LMs Respond with Contronarrative</h4> <p>As previously mentioned, the state of the art so far has focused on generic detoxification of LMs, certainly leading them to be less toxic by avoiding responding to compromising prompts or otherwise imposing strong constraints on both the optimization process and the output of the model itself. In fact, based on what has been observed, the same models may be able to articulate more complex responses that capture even the most delicate aspects of the dialogue. Thus, we want precisely to explore this concept further, bringing, through fine-tuning, <strong>the model to a contronarrative generation responsive to the given prompt</strong>.</p> <p>For this purpose, <a href="https://aclanthology.org/2022.emnlp-main.549/">(Bonaldi et al., 2022)</a>, a dataset curated by experts is employed to provide accurate answers to prompts regarding topics and/or people particularly susceptible and vulnerable to online hate speech. The dataset, consists mainly of dialogues (thus with multiple prompt-response pairs); we chose to select each pair while maintaining all its antecedents, exploiting the potential of Chain-of-Thought <a href="https://arxiv.org/abs/2201.11903">(Wei et al., 2022)</a>.</p> <h4 id="fine-tuning-and-reinforcement-learning-from-automatic-feedback">Fine-tuning and Reinforcement Learning from (Automatic) Feedback</h4> <p>Fine-tuning and reinforcement learning of the models was employed using the <a href="https://github.com/DanielSc4/RewardLM">🥞 RewardLM</a> library. The library allows integration of the models with 🤗 HuggingFace (the <em>de facto</em> standard for OpenSource model sharing and manipulation), training and monitoring of the results obtained efficiently. In the case of Reinforcement Learning (<a href="https://github.com/DanielSc4/RewardLM#-reinforcement-learning-with-automatic-feedback-rlaf">RFAF</a>), besides all the hyperparameters involved, it is possible to specify different details of the reward model, being able to choose any classifier or a set more than one of them for greater efficiency.</p> <p>Particular attention can be paid to the Reinforcement Learning process where, following the diagram below, two identical initial models are kept in memory, one for reference and one that can be changed according to the direction imposed by the reward model.</p> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/detox_LMs/rlhf.png" sizes="95vw"/> <img src="/assets/img/detox_LMs/rlhf.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="lazy" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <p><em>Training scheme for the RLAF algorithm implemented on <a href="https://github.com/DanielSc4/RewardLM">🥞 RewardLM</a>. Image <a href="https://huggingface.co/blog/rlhf">source</a>.</em></p> <p>Specifically, we begin by having both models produce a response that follows a certain generation configuration. The <a href="https://en.wikipedia.org/wiki/Kullback–Leibler_divergence">Kullback-Leibler divergence</a> distance between the distributions of the two models is then calculated.</p> \[D_{KL}(\pi_{PPO}(y | x) || \pi_{base}(y | x))\] <p>_with \(\pi_{PPO}\) and \(\pi_{base}\) denoted the respective weights of the models._</p> <p>In parallel with this process, the reward \(r\_{\theta} (y \vert x)\) given by the reward model is calculated, which is added to the penalty given by the previous step. At this point it is the job of the <a href="https://openai.com/research/openai-baselines-ppo">PPO optimization</a> algorithm to update the tuned model weights based on what it received as input from the previously calculated loss.</p> <h5 id="toxicity-meter-an-easy-way-to-measure-lms-toxicity"><code class="language-plaintext highlighter-rouge">Toxicity Meter</code>: an easy way to measure LMs toxicity</h5> <p>Also provided in the <a href="https://github.com/DanielSc4/RewardLM">🥞 RewardLM</a> library is a tool for measuring the average toxicity of models, ⚖️ <code class="language-plaintext highlighter-rouge">Toxicity Meter</code>. By default, the tool employs the <code class="language-plaintext highlighter-rouge">RealToxicityPrompts</code> dataset <a href="https://aclanthology.org/2020.findings-emnlp.301/">(Gehman et al., 2020)</a>. It was therefore possible to quantitatively measure not only the initial toxicity of the different models, but also the post fine-tuning toxicity and RLAF. The toxicity itself can be measured either from any of the model configuration(s) used as reward model for RLAF, or from <a href="https://perspectiveapi.com/">Perspective API</a>, offering a better granularity in the different types of toxicity.</p> <p><br/></p> <h2 id="-experiment-and-results"><strong>🔬 Experiment and results</strong></h2> <p>As mentioned generative models are chosen to carry out the first experiments. Among the models selected by the HuggingFace Hub are <code class="language-plaintext highlighter-rouge">togethercomputer/RedPajama-INCITE-Chat-3B-v1</code> (<a href="https://huggingface.co/togethercomputer/RedPajama-INCITE-Chat-3B-v1">🤗 Hub ref.</a>) and <code class="language-plaintext highlighter-rouge">tiiuae/falcon-7b-instruct</code> (<a href="https://huggingface.co/tiiuae/falcon-7b-instruct">🤗 Hub ref.</a>), with 3 and 7 billion parameters, respectively. Their chat/istructed version was chosen to retain the ability to use their conversational capabilities, similar to what has been observed with the more popular <a href="https://openai.com/blog/chatgpt">ChatGPT</a> from OpenAI and <a href="https://bard.google.com">BARD</a> from Google.</p> <h3 id="result">Result</h3> <p>Results are calculated with the toxicity level reported by ⚖️ <code class="language-plaintext highlighter-rouge">Toxicity Meter</code>. They are further broken down into two tables highlighted below, where first all prompts from RealToxicityPrompts are present and then only those considered as toxic by the reward model itself.</p> <table> <thead> <tr> <th>Toxicity, all prompts</th> <th style="text-align: center">PT (Baseline)</th> <th style="text-align: center">Fine-tuned</th> <th style="text-align: center">RLAF</th> </tr> </thead> <tbody> <tr> <td>RedPajama-INCITE-Chat-3B</td> <td style="text-align: center">0.130</td> <td style="text-align: center"><strong>0.092</strong></td> <td style="text-align: center">0.099</td> </tr> <tr> <td>falcon-7b-instruct</td> <td style="text-align: center">0.095</td> <td style="text-align: center"><strong>0.078</strong></td> <td style="text-align: center">0.082</td> </tr> </tbody> </table> <table> <thead> <tr> <th>Toxicity, only toxic prompts</th> <th style="text-align: center">PT (Baseline)</th> <th style="text-align: center">Fine-tuned</th> <th style="text-align: center">RLAF</th> </tr> </thead> <tbody> <tr> <td>RedPajama-INCITE-Chat-3B</td> <td style="text-align: center">0.217</td> <td style="text-align: center"><strong>0.129</strong></td> <td style="text-align: center">0.160</td> </tr> <tr> <td>falcon-7b-instruct</td> <td style="text-align: center">0.140</td> <td style="text-align: center"><strong>0.107</strong></td> <td style="text-align: center">0.125</td> </tr> </tbody> </table> <p><em>Toxicity level, lower is better. <code class="language-plaintext highlighter-rouge">PT</code> stands for pre-trained model, aka the model after its pretraining and instruct fine-tuning phase (as described in the original paper from each model)</em></p> <p>The results obtained show that even <strong>without any limitation</strong> imposed on the models, a <strong>~30% reduction in toxicity is observed for the <code class="language-plaintext highlighter-rouge">RedPajama</code> fine-tuned model</strong> (~20% for <code class="language-plaintext highlighter-rouge">falcon</code> model) and ~24% for the model with RLAF one (~14% for <code class="language-plaintext highlighter-rouge">falcon</code>). <strong>The results improve when considering only the most toxic prompts, with a ~40% reduction for the <code class="language-plaintext highlighter-rouge">RedPajama</code> fine-tuned model</strong> (28% for <code class="language-plaintext highlighter-rouge">falcon</code>) and ~26% for the model with RLAF (~11% for <code class="language-plaintext highlighter-rouge">falcon</code>).</p> <p>It can be seen from the following flowcharts how the toxic responses shifted to contents considered less toxic by <a href="https://perspectiveapi.com/">Perspective API</a>; the different toxicity buckets are assigned as low ( \(x &lt; 0.33\) ), medium ( \(0.33 \leq x \leq 0.66\) ) and high ( \(x &gt; 0.66\) ):</p> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/detox_LMs/sankeymatic.png" sizes="95vw"/> <img src="/assets/img/detox_LMs/sankeymatic.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="lazy" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <p><em>Flow chart highlighting the shifts for different responses from <code class="language-plaintext highlighter-rouge">RedPajama</code> model. Starting from the center with the pre-trained (<code class="language-plaintext highlighter-rouge">PT</code>), model’ responses moves to left for the fine-tuned (<code class="language-plaintext highlighter-rouge">FT</code>) and right for the model trained with Reinforcement Learning (<code class="language-plaintext highlighter-rouge">RL</code>).</em></p> <p><br/></p> <h2 id="-current-status-and-new-research-questions"><strong>🚀 Current status and new research questions</strong></h2> <blockquote> <p>Editor Note: The following chapter contains content that is mainly derived from assumptions about future directions. None of this represents a constraint or formal expression of the work’s intentions.</p> </blockquote> <p>Considering the pipeline for training through fine-tuning and reinforcement learning, it will be intriguing to be able to extend the research to models of larger dimensions. Specifically, through the use of 4-bit mode, it is possible to scale up the number of parameters, being able to observe the behaviour of larger and more accurate models as well as with more “reasoning” capabilities.</p> <p>Possible future work with the models now available could be to perform a sanity check following the pre-trained model response generation. The token level log probability of the fine-tuned model or RLAF model is expected to be lower if compared to its pre-trained version. Moreover, from an interpretability point of view, measures such as entropy could be exploited to measure the <em>uncertainty</em> of the model producing a response; if the per token attribution entropy of the fine-tuned/RLAF model turn out to be much higher if compared to the pre-trained one the further trained model is not relying on the prompt anymore, suggesting a strange behaviour to be further analyzed.</p> <p>Other analysis can be made on RL models and how their responses relate with the prompt, comparing the output with the fine-tuned model: following the reward model, the LMs should avoid toxic responses, but does it still generate meaningful responses? Is it trying to avoid the penalty generating random text? Could be employed semantic similarity in this manner, trying to identify the connection between prompt and response itself.</p>]]></content><author><name></name></author><category term="journal"/><summary type="html"><![CDATA[Journal to keep track of work during internship @RUG]]></summary></entry></feed>