About Our Model

The Goal: Build an accurate model capable of categorizing online comments considered "Toxic."

Defining Toxic

We define "Toxic" as any unreasonably rude or hateful content, constituting as, "threats, extreme obscenity, insults, and identity-based hate" (Jigsaw).

This does not include:

This does include:

We present the following examples:

why are we having all these people from shithole countries coming hereToxic

Ling is a vile christianToxic

scum = chhhhineseToxic

This area is Asian-predominant just to let you know in case you notice an ugly smell…Toxic

yeah just fuck off, mailboxToxic

This site won't improve until the fucktards are banned.Toxic

The conservative punditocracy was always anti Trump.Neutral

Because you're everything to me DEMI LOVATONeutral

Curse you, Boston Bruins.Neutral

@thelostlolli *offers you chocolate... and gets out of the way*Neutral

I love those fucking pineapplesNeutral

Fuck me - I just failed my math final.Neutral

Crazy that people make so much just because they have the ability to speak computerNeutral


Diverse datasets is key to gaining an accurate understanding of linguistic style, landscape, and common terms. The follow datasets and attributions were incorporated in-part, or in-whole.

HateXplain:Binny Mathew, Punyajoy Saha, Seid Muhie Yimam, Chris Biemann, Pawan Goyal, and Animesh Mukherjee "HateXplain: A Benchmark Dataset for Explainable Hate Speech Detection". Accepted at AAAI 2021.

Hatespeech and Offensive Language:Repository for Thomas Davidson, Dana Warmsley, Michael Macy, and Ingmar Weber. 2017. "Automated Hate Speech Detection and the Problem of Offensive Language." ICWSM.

Hate speech from white supremacist forum:O. de Gibert, N. Perez, A. García-Pablos, M. Cuadros. Hate Speech Dataset from a White Supremacy Forum. In: Proceedings of the 2nd Workshop on Abusive Language Online (ALW2), pp. 11-20, 2018.

MLMA Hate Speech:Nedjma Ousidhoum, Zizheng Lin, Hongming Zhang, Yangqiu Song, & Dit-Yan Yeung. (2019). Multilingual and Multi-Aspect Hate Speech Analysis.

Ethos:Mollas, I., Chrysopoulou, Z., Karlos, S., & Tsoumakas, G. (2022). ETHOS: a multi-label hate speech detection dataset. Complex & Intelligent Systems.

Wikipedia Detox:Wulczyn, Ellery; Thain, Nithum; Dixon, Lucas (2016): Wikipedia Detox. figshare. doi.org/10.6084/m9.figshare.4054689

Dynamically Generated Hate Speech Dataset:Bertie Vidgen, Tristan Thrush, Zeerak Waseem, & Douwe Kiela. (2021). Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection.

Sentiment140:Go, A., Bhayani, R. and Huang, L., 2009. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, 1(2009), p.12.

Surge AI Toxicity Dataset:By Surge AI, the world's most powerful NLP data labeling platform and workforce. Accessed: https://github.com/surge-ai/toxicity

The ISMA Project:Anshul Gupta, Welton Wang, and Timothy Trick. Islamophobia in Social Media Analysis. Accessed: https://theismaproject.com/

Reddit Norm Violations:Eshwar Chandrasekharan, Mattia Samory, Shagun Jhaver, Hunter Charvat, Amy Bruckman, Cliff Lampe, Jacob Eisenstein, and Eric Gilbert. 2018. The Internet’s Hidden Rules: An Empirical Study of Reddit Norm Violations at Micro, Meso, and Macro Scales. Proceedings of the ACM on Human-Computer Interaction 2, CSCW (2018), 32.

Hacker News Corpus:Hacker News, Y Combinator. Hacker News Corpus. Accessed: https://www.kaggle.com/hacker-news/hacker-news-corpus

NYTimes Comments:Aashita Kesarwani. New York Times Comments. Accessed: https://www.kaggle.com/aashita/nyt-comments

Conversation AI:Lucas Dixon, John Li, Jeffrey Sorensen, Nithum Thain, & Lucy Vasserman (2018). Measuring and Mitigating Unintended Bias in Text Classification

BiasCorp:Olawale Onabola, Zhuang Ma, Yang Xie, Benjamin Akera, Abdulrahman Ibraheem, Jia Xue, Dianbo Liu, & Yoshua Bengio. (2021). HBert + BiasCorp – Fighting Racism on the Web.

Unhealthy Comments Corpus:Ilan Price, Jordan Gifford-Moore, Jory Fleming, Saul Musker, Maayan Roichman, Guillaume Sylvain, Nithum Thain, Lucas Dixon, & Jeffrey Sorensen. (2020). Six Attributes of Unhealthy Conversation.

Facebook News Comments:John Bencina. Facebook News. Accessed: https://github.com/jbencina/facebook-news

Additional custom data sourced from Reddit, Twitter, forums, blogs, and Parler were used, providing 293,822 data entries in aggregate. Data was split with 93,131 entries as "hateful," and 200,691 entries as "normal."


We tested several pre-trained transformer models: BERT, ALBERT, DistilBERT, XLNet, ERNIE, DeBERTa, ELECTRA, and ultimately the highest contender: RoBERTa.

Training on TPUv3 cores with a batch size of 18/core, we acheive an raw accuracy of 98% with k-fold validation, of which the most accurate model was selected for deployment.

Several iterations provided incrementing improvements in accuracy and bias. Our Changelogs provides published incremental changes to our models.

Bias and Trust

Bias plays a documented role in machine learning - both in the pre-trained model itself, and the dataset. Measuring and understanding bias is notoriously challenging as machine learning models are largely black boxes.

In accordance with Jigsaw's Conversation AI team's past work, we implemented testing using sentence templates to obtain a better understanding of present biases in the model towards certain groups.

Our first model iteration noted several severe biases - including the following:

Gay0.98 LGBTQ0.96Black0.86Muslim0.84 Baseline0.01

We theorize that these biases are the result of greater prevalence of derogatory statements towards these groups of people. In order to mitigate these biases, we complemented the existing dataset with targeted, balanced data and perform counterfactual data augmentation (CDA) by swapping the target of various "Toxic" and "Neutral" data points with others. After our initial attempt to mitigate bias, we reduce bias to the following:

Middle Eastern0.41 Muslim0.39Homosexual0.23Baseline0.01

We found biases relatively neutral after performing mitigations on the dataset, except for the the terms above in certain contexts. Raw values are listed, denoting the confidence in which an otherwise neutral string is given because of a term. (Scaled between 0-1).

For example, given the follow sentence template, we present some effects of the bias (where black is text correctly labeled as neutral, and red is text incorrectly labeled as "Toxic"; confidence scores for the predicted class are shown):

Being middle eastern is cool.0.68

Being chinese is cool.0.91

Being muslim is cool.0.65

Being gay is cool.0.81

Being lgbtq is cool.0.82

Being homosexual is cool.0.63

Being black is cool.0.71

Being old is cool.0.95

Our final iteration targeted at bringing bias to baseline levels was largely successful in reducing identity bias to levels below significance.

We see this reflected in the following sentence templates:

Being middle eastern is cool.0.97

Being chinese is cool.0.97

Being muslim is cool.0.95

Being gay is cool.0.99

Being lgbtq is cool.0.99

Being homosexual is cool.0.98

Being black is cool.0.97

Being latino is cool.0.99

Being old is cool.0.97

It's of importance to note this study of bias is not comprehensive, and unforseen biases may be present. Continued work will focus on further augmentation to target and measure current biases.

However, given the contextual use-case of our API, these biases do not present significant unintended consequences. The model is not perfect and serves as a first-line-of-defense for content moderators.

It should not make any conclusive, irreversible decisions.

Beyond Toxicity

Presence of toxicity isn't always actionable data. While certain content can be toxic, it may not necessarily warrant removal. For moderation purposes, we developed a new model, fine-tuning the generalized toxicity model on a significantly smaller subset of engineered data.

Across ~30 epochs, we fine-tuned a manually curated dataset of 2,038 comments, collected through direct collaboration with moderators of various communities.

Thus, we created a model that flags content more conservatively, but with higher confidence. Unlike our toxicity model, we focus on personal attacks and explicitly hateful content with this model.

This model can be accessed via our /moderate API endpoint.