Plagiarism detection with Pysimilar in Python

Hi guys ,

In this blog post I will show you how you can detect textual plagiarism with Python using Pysimilar library in just two lines of codes.

This blog post is a continuation of a  previous published article titled How to detect plagiarism in text using python where by I shown how you can easily detect the plagiarism between documents as using cosine similarity using scikit-learn.

That article is one of my best article of all the time in terms of views and feedback on my personal blog and GitHub repository therefore as rewards to my readers I decided to make a very light library pysimilar to even make the process much simpler as print() statement whereby even completely absolute beginner are able to build sample projects on top of it .

Getting started with Pysimilar

To get started with pysimilar for comparing text documents, you just need to install first of which you can either install directly from GitHub or using pip.

Here how to install pysimilar using pip

$ pip install pysimilar

Here how to install directly from GitHub

$ git clone https://github.com/Kalebu/pysimilar
$ cd pysimilar
$ pysimilar -> python setup.py install

With Pysimilar you can either compare text documents  as strings or specify the path to the file containing the textual documents.

Comparing strings directly

You can easily compare strings using pysimilar using compare() method just as illustrated below;

>>> from pysimilar import compare
>>> compare('very light indeed', 'how fast is light')
0.17077611319011649

Comparing strings contained files

To compare strings contained in the files, you just need to explicit specify the isfile parameter to True just as illustrated below;

>>> compare('README.md', 'LICENSE', isfile=True)
0.25545580376557886

You can also compare documents with particular extension in a given directory, for instance let's say I want to compare all the documents with .txt in a documents directory here is what I will do;

Directory for documents used by the example below look like this

documents/
├── anomalie.zeta
├── hello.txt
├── hi.txt
└── welcome.txt

Here how to compare files of a particular extension

>>> import pysimilar
>>> from pprint import pprint
>>> pysimilar.extensions = '.txt'
>>> comparison_result = pysimilar.compare_documents('documents')
>>> [['welcome.txt vs hi.txt', 0.6053485081062917],
    ['welcome.txt vs hello.txt', 0.0],
    ['hi.txt vs hello.txt', 0.0]]

You can also sort the comparison score based on their score by changing the ascending parameter, just as shown below;

>>> comparison_result = pysimilar.compare_documents('documents', ascending=True)
>>> pprint(comparison_result)
[['welcome.txt vs hello.txt', 0.0],
 ['hi.txt vs hello.txt', 0.0],
 ['welcome.txt vs hi.txt', 0.6053485081062917]]

You can also set pysimilar to include files with multiple extensions

>>> import pysimilar
>>> from pprint import pprint
>>> pysimilar.extensions = ['.txt', '.zeta']
>>> comparison_result = pysimilar.compare_documents('documents', ascending=True)
>>> pprint(comparison_result)
[['welcome.txt vs hello.txt', 0.0],
 ['hi.txt vs hello.txt', 0.0],
 ['anomalie.zeta vs hi.txt', 0.4968161174826459],
 ['welcome.txt vs hi.txt', 0.6292275146695526],
 ['welcome.txt vs anomalie.zeta', 0.7895651507603823]]

Well that's all for this article, Excited to see what you will build with it

Here a link to Github Repository

Kalebu Jordan
I am a Mechatronics engineer and professional Python Developer passionate about innovation, opensource and technical writing, Here I will be writing mostly on Python, DevOps, AI/ML and security
Dar es Salaam, Tanzania