An approach to detect semantically similar python repositories using pre-trained language models.
This repository contains the notebooks and scripts conducted for our approach to detect semantically similar python repositories using pre-trained language models.
Currently our best performing model is UniXCoder fine-tuned on code search task with AdvTest dataset. For evaluations of different language models on repository similarity comparison, please refer to this Jupyter notebook: notebooks/BiEncoder/Embeddings_evaluation.ipynb
More details on our approach's implementations and applications can be found under the scripts folder.
RepoSnipy is a neural search engine for discoving similar Python repositories on GitHub, powered by RepoSim. Please feel free to give it a try!
RepoSim
βββ LICENSE
βββ README.md
βββ data
β βββ df2txt.py # Convert PoolC dataset for clone detection fine-tuning script
β βββ repo_topic.json # Topic-Repos mapping
βΒ Β βββ repo_topic.py # Script to select repos from topics
βββ notebooks
βΒ Β βββ BiEncoder
βΒ Β βΒ Β βββ Embeddings_evaluation.ipynb # Evaluations for comparing different language models
βΒ Β βΒ Β βββ RepoSim.ipynb # Our approach's implementation
βΒ Β βΒ Β βββ UnixCoder_C4_Evaluation.ipynb
βΒ Β βββ CrossEncoder
βΒ Β βββ Clone_Detection_C4_Evaluation.ipynb
βΒ Β βββ HungarianAlgorithm.ipynb # Cross-encoder approaches for repo similarity comparison
βΒ Β βββ keonalgorithms-TheAlgorithmsPython.csv # Evaluation results by ungarianAlgorithm.ipynb
βββ scripts
βββ LICENSE
βββ PlayGround.ipynb # For experimenting with repo embeddings
βββ README.md
βββ pipeline.py # Our approach's implementation as a HuggingFace pipeline
βββ repo_sim.py
βββ requirements.txtDistributed under the MIT License. See LICENSE for more information.