Configure Data-Related Services for Helm Deployment¶

Indexd¶

What does it do¶

Indexd is a core service of the commons. It is used to index files within the commons, to be used by Fence to download data.

Note: Indexd is used to hold information regarding files in the commons. We can index any files we want, but should ensure that bucket in Indexd are configured within Fence, so that downloading the files will work. To index files, we have a variety of tools. First, data upload will automatically create indexd records for files uploaded. If we want to index files from external buckets, we can also use indexd-utils, or if the commons has dirm setup, create a manifest and upload it to the /indexing endpoint of a commons. From there, GUID's will be created and/or assigned to objects. You can view the information about the records by hitting the (commons url)/index/(GUID) endpoint. To test that the download works for these files, you will want to hit the (commons url)/user/data/download/(GUID) endpoint, while ensuring your user has the proper access to the ACL/AuthZ assigned to the Indexd record.

Default settings¶

If you deploy Helm without customizing any configuration, you can see the default Indexd values in the values.yaml here.

How to configure it¶

For a full set of configuration see the Helm README.md for Indexd or read the Indexd values.yaml directly.

Text Only

indexd:
  enabled: true

  image:
    repository:
    tag:

  # default prefix that gets added to all indexd records.
  defaultPrefix: "TEST/"

  # Secrets for fence and sheepdog to use to authenticate with indexd.
  # If left blank, will be autogenerated.
  secrets:
    userdb:
      fence:
      sheepdog:

Sower¶

What does it do¶

Sower is a job dispatching service. Jobs are configured within the manifest, and sower handles dispatching the jobs.

Default settings¶

If you deploy Helm without customizing any configuration, you can see the default Sower values in the values.yaml here.

How to configure it¶

For a full set of configuration see the Helm README.md for Sower or read the Sower values.yaml directly.

Sheepdog¶

What does it do¶

Sheepdog is a core service that handles data submission. Data gets submitted to the commons, using the dictionary as a schema, which is reflected within the sheepdog database.

Default settings¶

If you deploy Helm without customizing any configuration, you can see the default Sheepdog values in the values.yaml here.

How to configure it¶

For a full set of configuration see the Helm README.md for Sheepdog or read the Sheepdog values.yaml directly.

Peregrine¶

What does it do¶

The Peregrine service is used to query data in Postgres. It works similar to Guppy, but relies on querying Postgres directly. It will create the charts on the front page of the commons, as well as the /query endpoint of a commons.

Default settings¶

If you deploy Helm without customizing any configuration, you can see the default Peregrine values in the values.yaml here.

How to configure it¶

For a full set of configuration see the Helm README.md for Peregrine or read the Peregrine values.yaml directly.

To configure Peregrine, you must have an entry in the versions block. It also requires a dictionary in the global block.

ETL (Tube)¶

What does it do¶

The Gen3 Tube ETL is designed to translate data from a graph data model, stored in a PostgreSQL database, to indexed documents in ElasticSearch (ES), which supports efficient ways to query data from the front-end. The purpose of the Gen3 Tube ETL is to create indexed documents to reduce the response time of requests to query data. It is configured through an etlMapping.yaml configuration file, which describes which tables and fields to ETL to ElasticSearch.

Default settings¶

If you deploy Helm without customizing any configuration, you can see the default ETL values in the values.yaml here.

How to configure it¶

For a full set of configuration see the Helm README.md for ETL or read the ETL values.yaml directly.

You can configure the ETL like this:

Text Only

etl:
  enabled: true
  esEndpoint: ""
  etlMapping:
    <your etl mapping here>

To kick off the ETL job, run this command:

Text Only

kubectl create job --from=cronjob/etl-cronjob etl

If you already have a job called etl, run the following. This will delete the old job and create a new instance.

Text Only

kubectl delete job etl
kubectl create job --from=cronjob/etl-cronjob etl

For more information about our ETL, read more in our Tube repo.

Guppy¶

What does it do¶

Guppy is used to render the Explorer page. It uses Elasticsearch indices to render the page, so it depends on ETL.

Note: Guppy relies on indices being created to run; if there are no indices created, Guppy will fail to start up.

To create these indices, you can run ETL; however a valid ETL mapping file must be created, and data must be submitted to the commons before you can run ETL.

Default settings¶

If you deploy Helm without customizing any configuration, you can see the default Guppy values in the values.yaml here.

How to configure it¶

For a full set of configuration see the Helm README.md for Guppy or read the Guppy values.yaml directly.

There is also config that needs to be set within the global block around the tier access level, defining how the explorer page should handle displaying unauthorized files, and the limit to how far unauthroized user can filter down files. Last there is a guppy block that needs to be configured with the elastic search indices guppy will use to render the explorer page.

Text Only

global:
  tierAccessLevel: "(libre|regular|private)"

guppy:
  # -- (int) Only relevant if tireAccessLevel is set to "regular".
  # The minimum amount of files unauthorized users can filter down to
  tierAccessLimit: 1000

  # -- (list) Elasticsearch index configurations
  indices:
    - index: dev_case
      type: case
    - index: dev_file
      type: file

  # -- (string) The Elasticsearch configuration index
  configIndex: dev_case-array-config
  # -- (string) The field used for access control and authorization filters
  authFilterField: auth_resource_path
  # -- (bool) Whether or not to enable encryption for specified fields
  enableEncryptWhitelist: true
  # -- (string) A comma-separated list of fields to encrypt
  encryptWhitelist: test1


  # -- (string) Elasticsearch endpoint.
  # defaults to "elasticsearch:9200"
  esEndpoint: ""

You will also need a mapping file to map the fields you want to pull from postgres into the elasticsearch indices. There are too many fields to describe here, but an example BDC mapping file can be found here.

Last, Guppy works closely with Portal to render the Explorer page. You will need to ensure a proper dataExplorer block (see this BDC example) is setup within the gitops.json file, referencing fields that have been pulled from Postgres into the Elasticsearch indices.

aws-es-proxy¶

What does it do¶

aws-es-proxy is a small web server application sitting between Gen3 services and Amazon Elasticsearch service.

Note: * This service is only needed when you deploy Gen3 on AWS and use the AWS OpenSearch Service. * This pod can also be used to make direct queries to ElasticSearch. If you know you want to make a manual query to ElasticSearch, you can exec into the aws-es-proxy pod and run the following, filling in the appropriate endpoint you want to hit to query elasticsearch:

Text Only

kubectl exec -it <aws-es-proxy-pod-name-here> bash
curl http://localhost:9200/_cluster/status

Default settings¶

If you deploy Helm without customizing any configuration, you can see the default aws-es-proxy values in the values.yaml here.

How to configure it¶

For a full set of configuration see the Helm README.md for aws-es-proxy or read the aws-es-proxy values.yaml directly.

Some important configuration items for aws-es-proxy in helm:

Text Only

# -- AWS user to use to connect to ES
aws-es-proxy:
  # Whether or not to deploy the service or not
  enabled: true

  # What image/ tag to pull
  image:
    repository:
    tag:

  # AWS secrets
  secrets:
    awsAccessKeyId: ""
    awsSecretAccessKey: ""

  # Elasticsearch endpoint in AWS
  esEndpoint: test.us-east-1.es.amazonaws.com

Metadata¶

What does it do¶

The Metadata Service (also called MDS) provides an API for retrieving JSON metadata of GUIDs. It is a flexible option for "semi-structured" data (key:value mappings).

The GUID (the key) can be any string that is unique within the instance. The value is the metadata associated with the GUID; it is a JSON blob whose structure is not enforced on the server side.

Default settings¶

If you deploy Helm without customizing any configuration, you can see the default Metadata values in the values.yaml here.

How to configure it¶

For a full set of configuration see the Helm README.md for Metadata or read the Metadata values.yaml directly.