User:ByteBard/plagiarism

Using the corpus of all transcripts to check for too-close similarities in article text, by running them through Natural Language Toolkit (nltk) processes in Python. Candidate sentences have stopwords removed and the resulting n-grams from sentences in transcripts cited in an article are compared pairwise to article text, resulting in sometimes over a million comparisons per article.

Next categories to check

 * Category:Locations
 * Category:Non-player characters

Completed categories

 * Category:Player characters: fixed or tagged approximately 40/217 articles, around 18%.