Skip to content
Open
6 changes: 5 additions & 1 deletion docs/_includes/common-submission-info.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
Large submissions
~~~~~~~~~~~~~~~~~
We recommend using the web server if you have <10,000 variants or sequences. You will experience degraded performance when submitting a larger set of sequences. In those instances, we suggest that you split the set into multiple <10,000 submissions, or run the standalone version on your local machine, or contact our group directly.
We recommend using the web server for submissions of **10,000 or fewer variants or sequences**. You will experience degraded performance with larger submissions, and the absolute maximum per submission is **20,000**. For larger sets, we suggest one of the following:

* Split the set into multiple submissions of 10,000 or fewer variants each, submitting them sequentially (wait for each to complete before submitting the next).
* Run the standalone version on your local machine.
* Contact our group directly.
38 changes: 30 additions & 8 deletions docs/functional-networks.rst
Original file line number Diff line number Diff line change
@@ -1,24 +1,31 @@
Tissue-specific Networks
===========================
In order to leverage the vast collections of raw, noisy genomic data, they must be integrated, summarized, and presented in a biologically informative manner. We provide a means of mining tens of thousands of whole-genome experiments by way of functional interaction networks. Each interaction network represents a body of data, probabilistically weighted and integrated, focused on a particular tissue or process context.
In order to leverage the vast collections of raw, noisy genomic data, they must be integrated, summarized, and presented in a biologically informative manner. We provide a means of mining tens of thousands of whole-genome experiments by way of functional interaction networks. Each interaction network represents a body of data, weighted and integrated, focused on a particular tissue, cell, or process context.

It is important to consider gene relationships within a tissue context as the precise actions of genes are frequently dependent on their tissue context, and human diseases result from the disordered interplay of tissue- and cell lineage–specific processes. These factors combine to make the understanding of tissue-specific gene functions, disease pathophysiology and gene-disease associations particularly challenging.
It is important to consider gene relationships within a tissue or cell type as the precise actions of genes are frequently dependent on their context, and human diseases result from the disordered interplay of tissue- and cell lineage–specific processes. These factors combine to make the understanding of tissue-specific gene functions, disease pathophysiology and gene-disease associations particularly challenging.

Tissue-specific network construction is described in the following publication: Greene, C. S., Krishnan, A., Wong, A. K., Ricciotti, E., Zelaya, R. A., Himmelstein, D. S., ... & Troyanskaya, O. G. (2015). `Understanding multicellular function and disease with human tissue-specific networks <https://www.nature.com/articles/ng.3259>`_. Nature Genetics.

Method
---------------------------
Briefly, functional integration relies on the construction of process-specific functional relationship networks. These are interaction networks in which each node represents a gene, each edge a functional relationship, and an edge between two genes is probabilistically weighted based on experimental evidence relating to those genes. We integrate evidence from many data sets, with each data set weighted in a process-specific manner.
Briefly, functional integration relies on the construction of process-specific functional relationship networks. These are interaction networks in which each node represents a gene, each edge a functional relationship, where an edge between two genes is a probability based on experimental evidence relating to those genes. We integrate evidence from many data sets, with each data set weighted in a process-specific manner.

One naïve Bayesian classifier is trained per biological area of interest (e.g. a tissue, or a specific biological process), using the appropriate gold standard for the biological context in addition to one global process-unaware classifier trained using the complete gold standard. Each classifier consisted of a class node predicting the binary presence or absence of a functional relationship (FR) between two genes and n nodes conditioned on FR, each representing the value of a data set.
For GIANT, one naïve Bayesian classifier is trained per biological area of interest (e.g. a tissue, or a specific biological process), using the appropriate gold standard for the biological context in addition to one global process-unaware classifier trained using the complete gold standard. Each classifier consisted of a class node predicting the binary presence or absence of a functional relationship (FR) between two genes and n nodes conditioned on FR, each representing the value of a data set.

Parameter regularization is performed as described in `Steck and Jaakkola (2002) <https://proceedings.neurips.cc/paper_files/paper/2002/file/1819932ff5cf474f4f19e7c7024640c2-Paper.pdf>`_ using mutual information between data sets to estimate a strength of prior belief for each data set. While a large amount of shared information does not guarantee a redundant data set, since the same subset of information could be shared many times, it provides a valuable quantitative estimate of data set uniqueness.
Parameter regularization is performed as described in Steck and Jaakkola (2002) using mutual information between data sets to estimate a strength of prior belief for each data set. While a large amount of shared information does not guarantee a redundant data set, since the same subset of information could be shared many times, it provides a valuable quantitative estimate of data set uniqueness.

MAGE constructs networks in two stages.
In stage 1 (representation learning), each dataset is converted into a gene graph with edges derived from coexpression or protein/gene interactions. MAGE trains a masked graph autoencoder that hides a fraction of edges and learns to reconstruct them using information from neighboring genes in the graph. The decoder outputs a reconstruction probability for each gene pair, which serves as dataset-level evidence for functional relatedness.

In stage 2 (context-specific integration), MAGE learns a tissue- or cell-type-specific mapping from dataset-level evidence to a functional relationship probability. This supervised model is trained using a tissue- or cell-type-specific functional gold standard derived from Gene Ontology biological process relationships together with tissue expression patterns. The output is a tissue- or cell-type-specific functional network where each edge weight is the predicted probability that two genes participate in shared biological processes in that context.

Data integration
---------------------------
We collected and integrated 987 genome-scale data sets encompassing approximately 38,000 conditions from an estimated 14,000 publications including both expression and interaction measurements. To integrate these data, we automatically assess each data set for its relevance to each of 144 tissue- and cell lineage–specific functional contexts. The resulting functional maps provide a detailed portrait of protein function and interactions in specific human tissues and cell lineages ranging from B lymphocytes to the renal glomerulus and the whole brain. This approach allows us to profile the specialized function of genes in a high-throughput manner, even in tissues and cell lineages for which no or few tissue-specific data exist.
GIANT integrates 987 genome-scale data sets encompassing approximately 38,000 conditions from an estimated 14,000 publications including both expression and interaction measurements. To integrate these data, we automatically assess each data set for its relevance to each of 144 tissue- and cell lineage-specific functional contexts. The resulting functional maps provide a detailed portrait of protein function and interactions in specific human tissues and cell lineages ranging from B lymphocytes to the renal glomerulus and the whole brain. This approach allows us to profile the specialized function of genes in a high-throughput manner, even in tissues and cell lineages for which no or few tissue-specific data exist.

MAGE integrates 7,463 genome-scale datasets representing more than 250,000 experiments across multiple data types. These include protein–protein interaction resources, transcription factor binding motif information, perturbation and microRNA target profiles, and large collections of gene expression studies. Each dataset is processed into a graph representation, and the full collection of dataset-level edge evidence is then integrated into 289 tissue and cell-type networks.

* Gene co-expression: All gene expression data sets are from NCBI's Gene Expression Omnibus (GEO). Genes with more than 30% of values missing were removed, and remaining missing values were imputed using ten nearest neighbors. Non-log-transformed data sets were log transformed. Expression measurements were summarized to Entrez identifiers, and duplicate identifiers were merged. The Pearson correlation was calculated for each gene pair, normalized with Fisher's z transform, mean subtracted and divided by the standard deviation.
* Gene co-expression: All gene expression data sets are from NCBI's Gene Expression Omnibus (GEO) for GIANT and refine.bio for MAGE. Genes with more than 30% of values missing were removed, and remaining missing values were imputed using ten nearest neighbors. Non-log-transformed data sets were log transformed. Expression measurements were summarized to Entrez identifiers, and duplicate identifiers were merged. The Pearson correlation was calculated for each gene pair, normalized with Fisher's z transform, mean subtracted and divided by the standard deviation.

* Protein-interaction: Interaction data are collected from BioGRID, IntAct, MINT, and MIPS.

Expand All @@ -29,6 +36,7 @@ We collected and integrated 987 genome-scale data sets encompassing approximatel

Evidence
---------------------------
For GIANT:
The "evidence" for an edge is measured as the contribution or "influence" of each dataset on the posterior classification probability. Each dataset contribution is calculated as the posterior probability of a functional relationship given only that dataset, minus the prior probablility.

Contribution of dataset D to an edge functional relationship prediction (FR)::
Expand All @@ -37,11 +45,25 @@ Contribution of dataset D to an edge functional relationship prediction (FR)::

Note that the contributions will not sum to 1.0, as each contribution is measured separately. Generally, individual gene expression datasets will not contribute much to the posterior probability but cumulatively can make a significant contribution.

For MAGE:
In each tissue- or cell-type-specific MAGE network, an edge between genes *u* and *v* is assigned a single score produced by the stage 2 (context-specific integration) gradient-boosting integration model (XGBoost). Each gene pair is represented by a 7,463-dimensional feature vector (one feature per dataset) derived from the stage 1 (representation learning) masked-edge reconstruction probabilities, and the boosting model maps these features to a predicted score between 0 and 1, where the score represents the probability of a functional relationship in that context.

The final network edge weight is the predicted score:
edge_weight(u, v) ∈ [0, 1]

Higher values indicate a higher predicted probability that the two genes participate in a functional relationship in the selected tissue or cell type.


Example
---------------------------

IL1B in blood vessel
~~~~~~~~~~~~~~~~~~~~~~~~~
We examined and experimentally verified the tissue-specific molecular response of blood vessel cells to stimulation by IL-1β (IL1B), a pro-inflammatory cytokine. We anticipated that the genes most tightly connected to IL1B in the blood vessel network would be among those responding to IL-1β stimulation in blood vessel cells. We tested this hypothesis by profiling the gene expression of human aortic smooth muscle cells (HASMCs; the predominant cell type in blood vessels) stimulated with IL-1β.

Examination of the genes whose expression was significantly upregulated at 2 h after stimulation showed that 18 of the 20 IL1B network neighbors were among the top 500 most upregulated genes in the experiment (P = 2.07 × 10−23). The blood vessel network was the most accurate tissue network in predicting this experimental outcome; none of the other 143 tissue-specific networks or the tissue-naive network performed as well when evaluated by each network's ability to predict the result of IL-1β stimulation on the cells.
Examination of the genes whose expression was significantly upregulated at 2 h after stimulation showed that 18 of the 20 IL1B network neighbors were among the top 500 most upregulated genes in the experiment (P = 2.07 × 10−23). The blood vessel network was the most accurate GIANT tissue network in predicting this experimental outcome; none of the other 143 GIANT tissue-specific networks or the tissue-naive network performed as well when evaluated by each network's ability to predict the result of IL-1β stimulation on the cells.

Reproducing legacy results
--------------------------

The table of GO Enriched functions on GIANT network pages used a larger set of genes and annotations, including non-human relevant, between February 2024 and May 2026. To reproduce values shown during that window, see :doc:`Reproducing legacy results <reproducibility>`.
Binary file added docs/img/use-cases/functional-module-3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/use-cases/functional-module-4.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
6 changes: 4 additions & 2 deletions docs/modules.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
Functional Module Detection
===========================

HumanBase applies community detection to find cohesive gene clusters from a provided gene list and a selected relevant tissue. Genes within a cluster share local network neighborhoods and together form a cohesive, specific functional module. Module detection enables systematic association of genes - even functionally uncharacterized genes - to specific processes and phenotypes represented in the detected modules. Functional modules are identified with tissue-specific networks, which predict gene interactions from massive data collections. Thus the discovered modules potentially capture higher-order tissue-specific function.
HumanBase applies community detection to find cohesive gene clusters in a network from a provided gene list and a selected relevant tissue. Genes within a cluster share local network neighborhoods and together form a cohesive, specific functional module. Module detection enables systematic association of genes - even functionally uncharacterized genes - to specific processes and phenotypes represented in the detected modules. Functional modules are identified with tissue-specific networks, which predict gene interactions from massive data collections. Thus the discovered modules potentially capture higher-order tissue-specific function.

Functional module detection is described in: Krishnan A*, Zhang R*, Yao V, Theesfeld CL, Wong AK, Tadych A, Volfovsky N, Packer A, Lash A, Troyanskaya OG.(2016) `Genome-wide prediction and functional characterization of the genetic basis of autism spectrum disorder <https://www.nature.com/articles/nn.4353>`_. Nature Neuroscience.

Expand All @@ -22,5 +22,7 @@ This approach has two key desirable characteristics:

We use a dynamic :code:`k = min(50, 0.2 * |V|)` to obtain the shared-nearest-neighbor tissue-specific network and apply the Louvain algorithm to cluster this network into distinct modules, where V is the number of query genes. Krishnan et al. (2016) showed that module node membership and cluster sizes are robust by testing a range of values for k from 10 to 100. To stabilize clustering across different runs of the Louvain algorithm, we run the algorithm 100 times and calculate cluster comembership scores for each pair of genes that was equal to the fraction of times (out of 100) the pair was assigned to the same cluster. Genes are assigned to clusters where their comembership score ≥ 0.9.

Resulting modules are then tested for functional enrichment using genes annotated to Gene Ontology biological process terms. Representative processes and pathways enriched within each cluster are presented alongside of the cluster with their resulting Q value. The Q value of each term associated to the modules is calculated using one-sided Fisher's exact tests and Benjamini–Hochberg corrections to correct for multiple tests.
Resulting modules are then tested for functional enrichment using genes annotated to Gene Ontology biological process terms. GIANT networks use annotations from UniProt-GOA (experimental evidence codes), while MAGE networks use annotations from NCBI gene2go (all evidence codes including computationally inferred). Enrichment is also performed against Disease Ontology and MSigDB gene sets. Representative processes and pathways enriched within each cluster are presented alongside of the cluster with their resulting Q value. The Q value of each term associated to the modules is calculated using one-sided Fisher's exact tests and Benjamini-Hochberg corrections to correct for multiple tests.

To reproduce the GO term-enrichment values shown by FMD in HumanBase between February 2024 and May 2026 (e.g. for results cited in a publication or saved link), see :doc:`Reproducing legacy results <reproducibility>`.

2 changes: 1 addition & 1 deletion docs/netwas.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
=======================================
NetWAS - Network-wide Association Study
=======================================
Tissue-specific networks provide a new means to generate hypotheses related to the molecular basis of human disease. We developed an approach, termed network-wide association study (NetWAS). In NetWAS, the statistical associations from a standard GWAS guide the analysis of functional networks. This reprioritization method is driven by discovery and does not depend on prior disease knowledge. NetWAS, in conjunction with tissue-specific networks, effectively reprioritizes statistical associations from distinct GWAS to identify disease-associated genes, and tissue-specific NetWAS better identifies genes associated with hypertension than either GWAS or tissue-naive NetWAS.
Tissue-specific networks provide a new means to generate hypotheses related to the molecular basis of human disease. We developed an approach, termed network-wide association study (NetWAS). In NetWAS, the statistical associations from a standard GWAS guide the analysis of functional networks. This reprioritization method is driven by discovery and does not depend on prior disease knowledge. NetWAS, in conjunction with tissue-specific networks, effectively reprioritizes statistical associations from distinct GWAS to identify disease-associated genes, and tissue-specific NetWAS better identifies genes associated with hypertension than either GWAS or tissue-naive NetWAS. NetWAS supports both GIANT and MAGE tissue networks.

The NetWAS method is described in the following publication: Greene, C. S., Krishnan, A., Wong, A. K., Ricciotti, E., Zelaya, R. A., Himmelstein, D. S., ... & Troyanskaya, O. G. (2015). `Understanding multicellular function and disease with human tissue-specific networks <https://www.nature.com/articles/ng.3259>`_. Nature Genetics.

Expand Down
Loading