Customize Gen3 Search¶
The ability to search and parse through data (structured, unstructured, and semi-structured) in a data commons or mesh is a core feature of Gen3. In the sections below we describe what services must be running and actions taken to enable search via the different front end pages.
Exploration Page¶
The primary tool for exploring data within a Gen3 data commons is the Exploration Page, which offers faceted search of data across projects, for example, https://gen3.datacommons.io/explorer. This page can be accessed from the /explorer endpoint or the top navigation bar, by clicking on the “Exploration” icon. See the user guide Exploration Page section for details on how to interact with the Gen3 Exploration Page.
In order to have a functioning exploration page you must must have have Tube and Guppy microservices running as well as an ETL mapping.
ETL¶
Three microservices Peregrine, Guppy, and Tube were developed to support both “graph” and “flat” abstracted data models. Peregrine transforms GraphQL queries into Structured Query Language (SQL) statements into the data commons backend. Tube flattens the graph data into an ElasticSearch format, which is then able to be queried by Guppy. Execution of GraphQL queries are possible through both Gen3-developed APIs as well as browser-based GUIs.
Construction of the “flat” model starts with the Tube microservice, which extracts, transforms, and loads (ETL) the underlying relational database into the flattened ElasticSearch format. ElasticSearch is used due to its enhanced performance with the type of metadata queries run by users. The ETL process is run when changes are made to the graph so subsequent queries match between “graph” and “flat” abstractions. Once transformed, like Peregrine, the Guppy microservice transforms GraphQL statements into Elastic Search Query DSL.
Newly ingested data to the Sheepdog Service can be queried immediately via Peregrine, but not on the Data Explorer, which is powered by Guppy on the backend. During the ETL process, Tube will populate ElasticSearch indices and Guppy makes the ElasticSearch indices available for queries for the Data Explorer.
In practice, Guppy/Tube need to be configured with the ElasticSearch indices in the manifest.json (versions in the versions block and indices in the guppy block) and the etlMapping.yaml file has to be configured to list those indices. Additionally, aws-es-proxy
needs to be included in the versions block of the manifest.json, unless a customized endpoint to access ElasticSearch can be provided.
Note that configuring the etlMapping.yaml is dependent on what users want to display on the Explorer page and needs to match to the Data Dictionary. The etlMapping.yaml can be validated against the Data Dictionary as described here.
After configuring etlMapping.yaml, indices need to be created, cleaned, or/and re-populated using the gen3 gitops configmaps
command to read the new etlMapping.yaml, and the gen3 job run etl
command to run the ETL. Note, that new indices need to be added to both files etlMapping.yaml and manifest.json.
Configure Exploration Page¶
In the next step, the gitops.json needs to be configured to display and populate the indices of interest in the Data Explorer. The exploration page has one or several tabs at the top, which each represent a flattened ElasticSearch document of structured metadata records across all the projects in the data commons, which is displayed as a table towards the bottom center of the page.
Remember that only the properties occurring in the etlMapping.yaml can be brought into the gitops.json. The gitops.json can be tested locally and validated against the Data Dictionary and etlMapping.yaml file. Finally, after new indices are introduced, Guppy needs to be rolled using the command gen3 roll guppy
. A comprehensive list of commands in cloud automation is given here(https://github.com/uc-cdis/cloud-automation/blob/master/doc/README.md).
Query Page¶
The query page in the data portal provides users with an interactive interface for querying the structured data in a Gen3 system via GraphQL API requests. See the user guide Query Page section for details on how to interact with the Gen3 structured data API.
In order for the page to be functional and the data ready for query you must have Sheepdog, Peregrine, and Guppy installed and configured.
The graph model depends on Sheepdog and Peregrine. The Flat Model depends on ETL mapping, Tube, Guppy, and Elasticsearch.
For details on the ETL mapping please see the ETL section.
Discovery Page¶
The Gen3 Discovery Page allows the visualization of metadata from within the metadata service (MDS) for data commons or the aggregated metadata services (AggMDS) for data meshes. This typically includes public metadata about a project to make it discoverable. Users should be able to search based on free text or filter based on tags.
For more information on using the Discovery Page please see the User Guide Discovery Page section.
Metadata Service¶
To view data in the discovery page you must have a populated metadata service or alternatively an Aggregated metadata service (aggMDS), which caches the metadata from two or more metadata source to provide a unified view of the commons on the discovery page.
Instructions for the creation and modification of an MDS record can be found here as part of the Gen3 SDK. Every data commons is different as there is no standardization of MDS and therefore any example we provide may not apply to your particular system.
To view the MDS for the Gen3 Data Hub you can go here. You can see in the snippet below some summary metadata for the 1000 Genomes project with is part of the Gen3 Data Hub:
{
"1000_Genomes_Project": {
"_guid_type": "discovery_metadata",
"gen3_discovery": {
"link": "https://www.genome.gov/27528684/1000-genomes-project",
"tags": [
{
"name": "Aligned Reads",
"category": "Condition"
}
],
"authz": "/programs/OpenAccess/projects/1000_Genomes_Project",
"source": "1000 Genomes Project",
"commons": "Open Access Data Commons",
"funding": "",
"summary": "The 1000 Genomes Project is a collaboration among research groups in the US, UK, and China and Germany to produce an extensive catalog of human genetic variation that will support future medical research studies. It will extend the data from the International HapMap Project, which created a resource that has been used to find more than 100 regions of the genome that are associated with common human diseases such as coronary artery disease and diabetes. The goal of the 1000 Genomes Project is to provide a resource of almost all variants, including SNPs and structural variants, and their haplotype contexts. This resource will allow genome-wide association studies to focus on almost all variants that exist in regions found to be associated with disease. The genomes of over 1000 unidentified individuals from around the world will be sequenced using next generation sequencing technologies. The results of the study will be publicly accessible to researchers worldwide.",
"study_id": "1000_Genomes_Project",
"full_name": "1000 Genomes Project",
"study_url": "https://www.genome.gov/27528684/1000-genomes-project",
"__manifest": [
{
"md5sum": "e1e56e29efad64c002e5e9749f85350f",
"file_name": "ALL.chrY.phase3_integrated_v2b.20130502.genotypes.vcf.gz",
"file_size": 5656911,
"object_id": "dg.OADC/60afa140-d2ab-4e32-bf73-40bf48787655",
"commons_url": "gen3.datacommons.io/"
},
Instructions for working with the API are found here.
Aggregated Metadata Service¶
An Aggregated metadata service (aggMDS) caches the metadata from two or more metadata sources to provide a unified view of the commons on the discovery page.
This is performed via AggMDS sync job, which copies metadata from multiple data commons into a single data store.
The AggMDS sync job depends on JSON based configuration file, which defines information about:
- Data source and data adaptor information
- Normalizing data fields
- adding optional individual overrides
An example json config file from the Biomedical Research Hub can be found here. A snippet of the adapter section for the Gen3 Data Hub (formerly known as the Open Access Data Commons) is shown below. For more information on constructing the json config file please review the instructions here.
"Open Access Data Commons": {
"mds_url": "https://gen3.datacommons.io/",
"commons_url" : "gen3.datacommons.io",
"adapter": "gen3",
"config" : {
"guid_type": "discovery_metadata",
"study_field": "gen3_discovery"
},
"keep_original_fields": false,
"field_mappings" : {
"authz": "path:authz",
"tags": "path:tags",
"_unique_id": "path:study_id",
"study_id": "path:study_id",
"study_description": "path:study_description",
"full_name": "path:full_name",
"short_name": "path:short_name",
"commons": "Open Access Data Commons",
"study_url": "path:study_url",
"_subjects_count" : {"path":"_subjects_count", "default" : 0 },
"__manifest": "path:__manifest",
"commons_url" : "gen3.datacommons.io"
}
The aggregated metadata service for the Biomedical Research Hub can be seen here.
Configure Discover Page¶
Customize and enable the Discovery Page by editing the table items, advanced search fields, tags, and study page fields (i.e. page that opens up upon clicking on a study).
To enable your Discovery Page you must modify your gitops.json file.
Configure your Discovery page by further modifying your gitops.json file.
Workspace token service¶
Setting up a functioning mesh where you can access files from individual commons also requires the workspace token service. You can see a demo of what is needed to link a commons to a mesh in the Data Mesh Webinar.