User:ByteBard/plagiarism

Using the corpus of all transcripts to check for too-close similarities in article text, by running them through Natural Language Toolkit (nltk) processes in Python. Candidate sentences have stopwords removed and the resulting n-grams from sentences in transcripts cited in an article are compared pairwise to article text, resulting in sometimes over a million comparisons per article. (The most common n-grams across all transcripts are also excluded from analysis, such as ).

Tagged categories from first pass

 * Category:Non-player characters: 290/968 articles (30%)
 * Category:Player characters: fixed or tagged approximately 40/217 articles, around 18%.
 * Category:Locations: fixed or tagged approximately 110/425 articles, around 26%. Articles most needing work: Sour Nest, Urukayxl, Nicodranas, Widogast's Nascent Nein-Sided Tower, Rumblecusp, King's Cage, Rosohna, Folding Halls of Halas
 * Category:Episodes: plan to fix or tagged approximately 19/548, around 3%.