User:ByteBard/plagiarism

Using the corpus of all transcripts to check for too-close similarities in article text, by running them through Natural Language Toolkit (nltk) processes in Python. Candidate sentences have stopwords removed and the resulting n-grams from sentences in transcripts cited in an article are compared pairwise to article text, resulting in sometimes over a million comparisons per article. (The most common n-grams across all transcripts are also excluded from analysis, such as (nerdy, ass, voice, actors)).

Next categories to check

 * Category:Non-player characters. Articles in most need of editing: Isharnai, Vokodo (most other NPCs are very short Appearance section issues)

Completed categories

 * Category:Player characters: fixed or tagged approximately 40/217 articles, around 18%.
 * Category:Locations: fixed or tagged approximately 110/425 articles, around 26%.