Code to generate the PTMTorrent dataset
This repository contains the scripts to generate the PTMTorrent dataset. The
dataset contains sets of pre-trained machine learning models (PTM)
git repositories hosted on popular model hubs.
Supporting metadata from each model hub as well as standardized metadata
specified by this JSON Schema is
also included in.
The following model hubs are supported by our software:
This project is dependent upon the following software:
Package dependencies are given in
pypoetry.tomland handled bypoetry
To run this project, it must be packaged and installed first.
The package can either be installed from our GitHub Releases or built and installed From Source.
- Download the latest
.tar.gzor.whlfile from here. - Install via pip:
python3.10 -m pip install ptm_torrent-*
Instructions were written for Linux operating systems
- Clone the project locally:
git clone https://github.com/SoftwareSystemsLaboratory/PTM-Torrent cdinto the project:cd PTM-Torrent- Create a
Python 3.10virtual environment:python3.10 -m venv env - Activate virtual environment:
source env/bin/activate - Upgrade
pip:python -m pip install --upgrade pip - Install
poetry:python -m pip install poetry - Install
Pythondependencies throughpoetry:python -m poetry install - Build with
poetry:python -m poetry build - Install with
pip:python -m pip install dist/ptm_torrent*.tar.gz
After installing the package, this project can be ran as individual scripts per model hub.
Each model hub's scripts are separated by directory in the
ptm_torrent directory. The directory for each specific model
hub's scripts, the main runner script, and download size, is listed in the table
below:
| Model Hub | Scripts Directory | Script Name | Download Size |
|---|---|---|---|
| Hugging Face | ptm_torrent/huggingface |
__main__.py |
61 TB |
| Modelhub | ptm_torrent/modelhub |
__main__.py |
721 MB |
| ModelZoo | ptm_torrent/modelzoo |
__main__.py |
151 GB |
| ONNX Model Zoo | pmt_torrent/onnxmodelzoo |
__main__.py |
441 MB |
| Pytorch Hub | ptm_torrent/pytorchhub |
__main__.py |
1.5 GB |
There are other supporting scripts within each model hub's scripts directory.
These scripts are ran in order by the model hub's __main__.py file. The order
in which to run these scripts (should the __main__.py file be insufficient) is
described in each model hub's README.md file within the scripts directory.
NOTE: Hugging Face's
__main__.pycan be parameritized to allow for a specific percentage of the model hub to be downloaded. By default, it is the first 0.1 (10%) of models sorted by downloads in descending order.
To run any of the scripts, execute the following command pattern:
python 'Scripts Directory'/'Script Name'
For example, to run Hugging Face's scripts:
python ptm_torrent/huggingface/__main__.py
Each model hub script generates the following directory structure per model hub:
📦data
┗ 📂MODELHUB
┃ ┣ 📂html
┃ ┃ ┗ 📂metadata
┃ ┃ ┃ ┃ ┗ 📂models
┃ ┃ ┣ 📂json
┃ ┃ ┃ ┗ 📂metadata
┃ ┃ ┃ ┃ ┗ 📂models
┃ ┃ ┗ 📂repos
┃ ┃ ┃ ┗ 📂AUTHOR
┃ ┃ ┃ ┃ ┗ 📂MODELThis directory structure is generated relative to where the script is ran from. Example: if the script was ran from the home directory (
~), then thedatadirectory would be stored at~/data.
Where:
- data/
MODELHUBis the same name as thePythonmodule directory that contained the script. - data/MODELHUB/repos/
AUTHORis the author name of the repository that was cloned. - data/MODELHUB/repos/AUTHOR/
MODELis the name of the repository that was cloned.
Model hub scripts do not overwrite the directory. In other words, it is a safe operation to run multiple model hub scripts from the same directory sequentially or concurrently.
Specifics about the types of metadata files and content that are produced by the
scripts can be found in each model hub's script directory's README.md file.
An existing dataset is available on this Purdue University Globus share.
If you are unfamiliar with Globus, we prepared a guide in the globus-docs/ directory.
An example usage of the dataset is described within the
example directory.
This project has a DOI on Zenodo. Please visit our Zenodo page for the latest citation information.
References are sorted by alphabetical order and not how they appear in this document.
[1] “Git.” https://git-scm.com/ (accessed Jan. 25, 2023).
[2] “Git Large File Storage,” Git Large File Storage. https://git-lfs.com/ (accessed Jan. 25, 2023).
[3] “Hugging Face – The AI community building the future.,” Jan. 03, 2023. https://huggingface.co/ (accessed Jan. 25, 2023).
[4] “Model Zoo - Deep learning code and pretrained models for transfer learning, educational purposes, and more.” https://modelzoo.co/ (accessed Jan. 25, 2023).
[5] “Modelhub.” https://modelhub.ai/ (accessed Jan. 25, 2023).
[6] “MSR 2023 - Data and Tool Showcase Track - MSR 2023.” https://conf.researchr.org/track/msr-2023/msr-2023-data-showcase (accessed Jan. 25, 2023).
[7] “ONNX Model Zoo.” Open Neural Network Exchange, Jan. 25, 2023. Accessed: Jan. 25, 2023. [Online]. Available: https://github.com/onnx/models
[8] “pip documentation v22.3.1.” https://pip.pypa.io/en/stable/ (accessed Jan. 25, 2023).
[9] “Poetry - Python dependency management and packaging made easy.” https://python-poetry.org/ (accessed Jan. 25, 2023).
[10] “Python Release Python 3.10.9,” Python.org. https://www.python.org/downloads/release/python-3109/ (accessed Jan. 25, 2023).
[11] “PyTorch Hub.” https://www.pytorch.org/hub (accessed Jan. 25, 2023).
[12] W. Jiang et al., “SoftwareSystemsLaboratory/PTMTorrent.” Zenodo, Jan. 25, 2023. doi: 10.5281/zenodo.7570357.