Getting Started with the AMP PDRD Workbench

Overview of Verily Workbench

Verily Workbench is a cloud-based platform designed for collaborative biomedical research. It integrates data management, analysis tools, and governance controls within a unified interface.

Key features include:

Workspaces
- Workspaces are where researchers and teams connect, collaborate, and organize all the elements of their research, including data, documentation, code, and analysis.
Cloud apps
- Cloud apps are a configurable pool of cloud computing resources. They’re ideal for interactive analysis and data visualization, and can be finely tuned to suit analysis needs.
Data Collections
- Data collections represent multimodal datasets that you can publish to Verily Workbench’s data catalog, so users can reference these data in workspaces.
Access, browse, save, and share data
Verily Workbench provides a variety of features to browse and interact with data in your workspace, making accessing, browsing, saving, and sharing your data easy.

Using AMP PDRD Data

To support your explorations, AMP PDRD offers “Getting Started” workspaces within Workbench, providing sample notebooks and guidance tailored to AMP PDRD datasets. This companion document outlines the specific steps and resources necessary to use the AMP PDRD datasets in Verily Workbench.

Workspaces and Data Resources
The following workspaces and data catalogs are available to support your explorations of AMP PDRD data. These resources provide program-specific examples, documentation, and data references. This section, Using AMP PDRD Data, provides step-by-step guidance for setting up your own workspace, incorporating data from the program, and running analyses across your selected datasets.

AMP PDRD Resources
Getting Started Workspaces

Data Catalogs

Before You Begin

In order to run the following example of an AMP PDRD analysis, you will need to have access to the repository, with a signed Data Use Agreement (DUA). Registration for AMP PDRD must be done with the same Google account you are using to log into Verily Workbench. If you do not have access to the repository, the following link provides instructions for registration:

AMP PDRD Registration

In addition to having access to the repository, users must have access to billing configured in Verily workbench. In Verily Workbench, Pods are created that are associated with cloud billing accounts, and when Workspaces and Data Collections are created, they are associated with Pods. This allows any cloud service costs generated during the use of Verily Workbench to be charged to the correct billing account.

Many organizations that use Verily Workbench will have billing accounts and Pods set up. If you do not see any Pods available to use in Verily Workbench, contact your administration to determine whether a billing Pod is available for you to use and how to request access.

For more information about how to set up billing with pods in Verily workbench, please see the following link:

Setting up billing with pods in Verily Workbench

Migrating Workspaces for Legacy Terra Users

Some AMP PDRD researchers may already have active workspaces in Terra. If you are a legacy Terra user and would like to continue your work in Verily Workbench, a dedicated migration guide is available to help you move your existing workspaces, notebooks, and data. This process is only relevant for users who previously worked in Terra; new users do not need to complete these steps. Please refer to the AMP PDRD Terra to Verily Workbench Workspace Migration guide [here] for detailed instructions.

Create a New Workspace in Verily Workbench

Create your own new blank workspace and title it:

Assign a title and description, and select the pod you wish to associate this workspace with; click Next
Choose whether to limit access to groups
Select the bucket region for your workspace; click Next
Optionally add a description; click Create Workspace

Add AMP PDRD Resources to Workspace

Click the + Data from Catalog button

Select AMP PDRD Tier 1 from the list of available Collections; click Next
Select checkboxes for the Resource(s) needed for your analysis (gs_path_amp_pd_tier1_release); click Next
(If prompted) Accept the policy acknowledgement statement; click Next
Choose Add to an existing Folder; click Add to your workspace

Select Files for Analysis

From AMP PDRD Resource

Select the AMP PDRD resource and choose Browse from the info panel

Navigate to the clinical folder; click Demographics
Click Add as reference from the info panel
Accept defaults; click Add to resources
(Optionally) Choose additional resources
Click Close to return to the Resources page

Create an Output Bucket

From the Resources tab, click + New Resource

Select New Cloud Storage Bucket
Assign resource ID; click Create bucket

Create a Virtual Machine to Run Analysis

From the Apps tab, click New app instance, select JupyterLab; click Next

Choose Default configuration; click Next
(Optionally) Reduce Machine Type to n1-standard-1
(Optionally) Reduce Data disk size to 100 GB
Accept defaults; click Next
Review and confirm the app summary; click Create App
(Note) This step may take several minutes

Create a Notebook for Analysis

From the Apps tab, Launch the newly created virtual machine

In the JupyterLab Interface that opens, click the Folder Icon in the left menu bar
In the file browser panel, select the Output Bucket you created earlier.
In the bucket contents section, right click and select New Notebook
Select Python 3 (local) kernel type; click Next

Configure Notebook to Analyze Selected Resources

Import Required Libraries

# Use the os package to interact with the environment

import os    
# Bring in Pandas for Dataframe functionality

import pandas as pd

# Use StringIO for working with file contents

from io import StringIO

# Enable IPython to display matplotlib graphs.

import matplotlib.pyplot as plt
%matplotlib inline

Create Utility Routines

# Utility routines for reading files from Google Cloud Storage

def gcs_read_file(path):

"Return the contents of a file in GCS"

contents = !gsutil -u {BILLING_PROJECT_ID} cat {path}

return '\n'.join(contents)

Read GCS Locations from Workspace Resources

# Get the GCP billing project ID from workbench environment variables

OUTPUT = !wb utility execute env | grep "GOOGLE_CLOUD_PROJECT" | cut -d"=" -f2

BILLING_PROJECT_ID = OUTPUT[0]


# Get the AMP PD Demographics file location from workbench environment variables

OUTPUT = !wb resource resolve --name=Demographics-csv

GS_AMP_PD_DATA_PATH = OUTPUT[0]
print(f'BILLING_PROJECT_ID = {BILLING_PROJECT_ID}')


print(f'GS_AMP_PD_DATA_PATH = {GS_AMP_PD_DATA_PATH}')

Read AMP PD Demographics Data

# Read CSV file from GCS into dataframe


amp_pd_demographics_df = pd.read_csv(StringIO(gcs_read_file(GS_AMP_PD_DATA_PATH)), engine='python', index_col=False)
amp_pd_demographics_df.head()

Create Age Distribution Bar Chart

# Configure plot



fig = plt.figure()
fig.set_figheight(6)
fig.set_figwidth(10)

# Create plot


plt.hist(amp_pd_demographics_df ['age_at_baseline'].dropna(), bins=20)
plt.xlabel('Age')


plt.ylabel('Number of Participants')


plt.show()

Create Sex Distribution Pie Chart

# Configure plot



counts_by_sex = amp_pd_demographics_df ['sex'].value_counts()
fig = plt.figure()


fig.set_figheight(6)


fig.set_figwidth(10)

# Create plot


plt.pie(counts_by_sex.values, labels=counts_by_sex.index, autopct="%1.1f%%")


plt.figure(figsize=(10, 6))


plt.show()