RepoSim

An approach to detect semantically similar python repositories using pre-trained language models.

About

This repository contains the notebooks and scripts conducted for our approach to detect semantically similar python repositories using pre-trained language models.

Currently our best performing model is UniXCoder fine-tuned on code search task with AdvTest dataset. For evaluations of different language models on repository similarity comparison, please refer to this Jupyter notebook: notebooks/BiEncoder/Embeddings_evaluation.ipynb

More details on our approach's implementations and applications can be found under the scripts folder.

Applications

RepoSnipy is a neural search engine for discoving similar Python repositories on GitHub, powered by RepoSim. Please feel free to give it a try!

Directory Structure

RepoSim
├── LICENSE
├── README.md
├── data
│   ├── df2txt.py  # Convert PoolC dataset for clone detection fine-tuning script
│   ├── repo_topic.json # Topic-Repos mapping
│   └── repo_topic.py  # Script to select repos from topics
├── notebooks
│   ├── BiEncoder
│   │   ├── Embeddings_evaluation.ipynb  # Evaluations for comparing different language models
│   │   ├── RepoSim.ipynb  # Our approach's implementation
│   │   └── UnixCoder_C4_Evaluation.ipynb
│   └── CrossEncoder
│       ├── Clone_Detection_C4_Evaluation.ipynb
│       ├── HungarianAlgorithm.ipynb  # Cross-encoder approaches for repo similarity comparison
│       └── keonalgorithms-TheAlgorithmsPython.csv  # Evaluation results by ungarianAlgorithm.ipynb
└── scripts
    ├── LICENSE
    ├── PlayGround.ipynb  # For experimenting with repo embeddings
    ├── README.md
    ├── pipeline.py  # Our approach's implementation as a HuggingFace pipeline
    ├── repo_sim.py
    └── requirements.txt

License

Distributed under the MIT License. See LICENSE for more information.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RepoSim

About

Applications

Directory Structure

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
data		data
notebooks		notebooks
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

License

RepoAnalysis/RepoSim

Folders and files

Latest commit

History

Repository files navigation

RepoSim

About

Applications

Directory Structure

License

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages