Contribution guidelines

Welcome to the scpdata package, and thank you for your interest in contributing!

The scpdata data package is a repository of curated mass spectrometry-based single-cell proteomics (SCP) datasets. The purpose of scpdata is to provide users with streamlined access to high-quality SCP data, alleviating the need for time-consuming data wrangling. We currently provide data at the peptide-to-spectrum match (PSM) level, the peptide level and/or the protein level. The package also encompasses a large diversity of technologies, including DDA and DIA, label-free and multiplexed experiments from various laboratories such as the Slavov Lab, the Kelly Lab, and the Schoof Lab.

Contributions are very much welcome. We happily accept major contributions such as adding a new dataset, as well as minor contributions as fixing typos or improving current documentation.

To facilitate our collaboration, this vignette will guide you through the process of adding a new dataset to the package. We will first get you started with some basic guidelines on how to contribute using GitHub. We’ll proceed with a description of the data structure and the data pieces we expect. Next, we will provide an overview of the package’s folder structure to help you navigate through the project. Finally, we’ll explain the workflow you should follow to add your dataset to the repository.

Getting started with GitHub

  1. Fork the scpdata GitHub repository (click here).
  2. Clone the forked repo locally using git:
git clone [email protected]:YOUR_USER_NAME/scpdata
  1. Adapt the cloned repo as desired. Do not forget to regularly `git commit`` your changes.
  2. Once finished, send your improvements and/or new features as a pull request.

If you have any questions or face any hurdles, do not hesitate to open a new issue and we’ll be happy to provide additional guidance.

What do we expect?

QFeatures object

All datasets in scpdata are stored in a QFeatures object (see intro vignette). The object is created following the scp data framework, as described in this short demo.

Feature data

We refer to feature data as the data generated by MS data identification and quantification tools. Depending on the tool, features may represent PSMs, peptides and/or proteins. For instance, MaxQuant provides an evidence.txt file with PSM-level information, a peptides.txt file with peptide-level information and proteinGroups.txt with protein-level information. We encourage adding as many of the three feature layers when contributing a dataset to scpdata.

For each feature, the tools provide quantification data as well as feature annotations. These two pieces of information should be separated in a SingleCellExperiment object. Feature annotations are stored in the rowData and the quantitative values are stored in the assay.

Sample annotations

Sample annotations contain information about each sample (single cell) in the dataset. This information is generated by the experimenter and should contain biological descriptors, such as the cell line or the treatment applied, and technical descriptors, such as the day of acquisition, the acquisition batch, the LC batch, etc. The sample annotations are stored in the colData of the QFeatures object.

If you want to contribute to scpdata with a dataset you generated yourself, we suggest you read the last section of initial recommendations for SCP experiments that provides a comprehensive discussion about descriptors of interest you should collect:

Gatto, Laurent, Ruedi Aebersold, Juergen Cox, Vadim Demichev, Jason Derks, Edward Emmott, Alexander M. Franks, et al. 2023. “Initial Recommendations for Performing, Benchmarking and Reporting Single-Cell Proteomics Experiments.” Nature Methods 20 (3): 375–86.

Experiment description

We also require the collection of experimental data that describes the dataset. This information is commonly retrieved from the publication associated with the dataset and provides a scientific context to the dataset. This information is used for building the dataset documentation.

Data source information

Finally, the ExperimentHub project, on which scpdata relies, requires every dataset to thoroughly provide a description of the data sources.

Folder structure

We here provide an overview of the key folders and files relevant when contributing a new dataset. The current files may provide a source of inspiration when preparing a new dataset.

inst/scripts/

The folder contains all R scripts used to generate the QFeatures objects from the source files, one script for each dataset. Each script is named as follows: make-data_ + DATASET_NAME + .R.

Note the file called make-metadata.R. It generates a CSV table required by ExperimentHub where each line corresponds to a dataset and the columns contains the data source information. The table is stored in inst/extdata/metadata.csv, which should never be changed manually.

R/

The folder contains 3 R scripts, but new contributions should only consider the data.R and can safely ignore the other two. The data.R script contains the documentation for each dataset, formatted using roxygen2 markup.

man/

The folder contains the compiled documentation manuals, one for each dataset. These were automatically generated by roxygen2 and should never be changed manually.

Workflow

In practice, contributing a new dataset involves 6 steps.

1. Collect data

If you want to contribute an already published dataset, identify the data sources for all feature data and the sample annotations. This is generally provided in the article, but you may need to request additional information from the authors.

If you want to contribute with your own dataset, make sure that all feature data and the sample annotation table are available from a public repository (eg PRIDE, MASSive or Zenodo).

2. Create the QFeatures object

Create a new R script, inst/scripts/make-data_DATASET_NAME.R, which contains all the code to convert the data source data into the QFeatures object. Here are some tips and tricks for generating a high-quality dataset:

  • Sample annotations are often cluttered, and spread over different tables or contained within sample names. Generating high-quality sample annotations may be time-consuming and frustrating. Don’t overlook this task, sample annotations are essential for rigourous and accurate downstream analysis.
  • Converting feature data tables and annotation tables into QFeatures or SingleCellExperiment objects can be streamlined using scp::readSCP() and scp::readSingleCellExperiment(), respectively.
  • Always start with the lowest feature level (eg PSMs). If available, you should add peptide and protein data using QFeatures::addAssay(). You should then add links between the assays. This is streamlined using QFeatures::addAssayLink().
  • Make sure to add data with as little processing as possible. For instance, MaxQuant provides peptide intensities, but also iBAQ and MaxLFQ normalised values. You should favour the former over the latter two, which you could add as supplementary assays (for example, see here).

3. Document the dataset

Add the data documentation and the data collection procedure in scpdata/R/data.R. Use roxygen2 markup language. The documentation is structured as follows, but you can best use the documentation of an existing dataset as a template:

  • Title: First authors et al. Year (Journal): minimal description.
  • Description: short description of the data set. What and how many cells were acquired? What technology? What is the research question?
  • Format: describe your QFeatures object. Describe each assay, namely what level features it contains, the number of features and the number of cells/samples
  • Data acquisition: summarise the data acquisition protocol, namely the sample isolation, sample preparation, liquid chromatography, mass spectrometry and raw data processing.
  • Data collection: summarise the steps you undertook to generate the QFeatures object, and where to find the script you created.
  • Source: link the public repository with the source data
  • References: if published, refer to the original work that acquired the data.
  • Example: add an example to show how to retrieve the dataset. To avoid the associated overhead when testing the package, we recommend adding the example as follows:
##' \donttest{
##' dataset_name()
##' }
  • Keywords: add the line ##' @keywords datasets
  • "dataset_name": end the documentation with the name of your dataset, ensuring your data set is correctly exported.

4. Update metadata

Add the data source information in the inst/script/make-metadata.R script and run the complete script that will update the inst/extdata/metadata.csv. You can use a previous dataset as template. All fields are mandatory: Title, Description, BiocVersion, Genome, SourceType, SourceUrl, SourceVersion, Species, TaxonomyId, Coordinate_1_based, DataProvider, Maintainer, RDataClass, DispatchClass, PublicationDate, NumberAssays, PreprocessingSoftware, LabelingProtocol, PsmsAvailable, PeptidesAvailable, ProteinsAvailable, ContainsSingleCells, Notes. See ?ExperimentHubData::makeExperimentHubMetadata for a comprehensive description of the fields.

Next, ensure that your updated metadata.csv file is valid by running ExperimentHubData::makeExperimentHubMetadata("scpdata").

5. Create a pull request

Push any change you made to GitHub and open a pull request to notify us of your contribution. The pull request should include all the commits related to the dataset you want to contribute. Provide in the description where we can retrieve your QFeatures object, e.g. through Zenodo.

6. Almost done!

Once your pull request is submitted, we will take over and will proceed to the following steps:

  1. We will review your changes to ensure you comply with the above guidelines. We may eventually request changes.
  2. We will contact the Bioconductor team ([email protected]) to upload your Rda to Microsoft Azure, if needed, and to update the metadata.csv on their server. See the help page for more information.
  3. We will compile the documentation with roxygen2 and check the package is still valid. We may eventually request changes.
  4. We will update the NEWS.md file and bump package version
  5. If this is your first contribution, we will add your name to the package authors.