Accessing the NCCID data
The data is stored in Amazon Web Services S3. Once your organisation has been granted access, NHSX will send AWS credentials by encrypted email. The credentials will allow accessing the data.
We recommend accessing the data using the Amazon Web Services Command Line Interface (AWS CLI), or client libraries that interact with S3 such as Boto3. Some examples are provided below.
Warehouse structure
The warehouse data is stored in the nccid-data-warehouse-prod
S3
bucket, and access is granted to different
prefixes.
The warehouse training data is organised into prefixes within the
training
prefix, based on image types (or "modality"), patient ID,
and date as follows:
# CT images & metadata
training/ct/PATIENT_ID/STUDY_UID/SERIES_UID/IMAGE_UUID.dcm
training/ct-metadata/PATIENT_ID/STUDY_UID/SERIES_UID/IMAGE_UUID.json
# MRI images & metadata
training/mri/PATIENT_ID/STUDY_UID/SERIES_UID/IMAGE_UUID.dcm
training/mri-metadata/PATIENT_ID/STUDY_UID/SERIES_UID/IMAGE_UUID.json
# X-ray images & metadata
training/xray/PATIENT_ID/STUDY_UID/SERIES_UID/IMAGE_UUID.dcm
training/xray-metadata/PATIENT_ID/STUDY_UID/SERIES_UID/IMAGE_UUID.json
# Patient clinical data
training/data/PATIENT_ID/status_DATE.json
training/data/PATIENT_ID/data_DATE.json
- The
ct
,mri
,xray
folders hold the DICOM images of the given kind. - The de-identified
Patient_ID
value is equivalent to the(0010,0020)
DICOM tag from the images andPseudonym
field from thestatus_DATE.json
anddata_DATE.json
clinical data files. STUDY_UID
andSERIES_UID
are equivalent to the(0020,000D)
and(0020,000E)
DICOM tags in the given images.- The
...-metadata
folders hold the DICOM tags exported as JSON from the corresponding image fileIMAGE_UUID.dcm
intoIMAGE_UUID.json
to enable quick parsing without the need to download the given image - The
data
folder holds the patient medical data,status_DATE.json
files for negative results, anddata_DATE.json
file/files for positive results.DATE
is formatted asYYYY-MM-DD
, for example2020-04-21
.
Using the AWS Command Line Interface
The simplest way to retrieve the imaging data is using the AWS CLI:
$ aws s3 sync s3://nccid-data-warehouse-prod/training/ct ct
download: s3://nccid-data-warehouse-prod/training/ct/Covid1/1.2.3/A.B.C/x.y.z.dcm to ct/Covid1/1.2.3/A.B.C/x.y.z.dcm
...
Repeating this for all the relevant directories you would download the latest data and images that you don't have locally:
# Remove items from this array that you don't want to download
modalities=("data" "ct" "ct-metadata" "mri" "mri-metadata" "xray" "xray-metadata")
for modality in ${modalities[@]}; do
aws s3 sync "s3://nccid-data-warehouse-prod/training/${modality}" "${modality}"
done
In the above example Bash
arrays
were used (the modalities
variable).
For more information check the AWS CLI documentation. If you encounter any problems, open an issue on our GitHub repository.
Using Python and Boto3
If you are scripting access to files, we recommend using Python and Boto3.
For more information check the Boto3 documentation. If you encounter any problems, open an issue on our GitHub repository.
Below you may find examples of accessing the data in various ways with Python and Boto3.
Listing files
import boto3
s3 = boto3.resource("s3")
bucket = s3.Bucket(name="nccid-data-warehouse-prod")
# List the objects at a given prefix
for obj in bucket.objects.filter(Prefix="training/data"):
print(f"{obj.key}\t{obj.size}\t{obj.last_modified}")
This will result in a list such as:
training/data/Covid1/data_2020-05-14.json 1416 2020-05-22 13:38:30+00:00
training/data/Covid6/data_2020-05-15.json 1560 2020-05-22 13:38:31+00:00
....
Downloading image files
To download files using Boto3, if you don't have them locally already:
import os
import boto3
BUCKET_NAME = "nccid-data-warehouse-prod"
def downloadPrefixFromS3(bucketName, prefix):
"""This function takes a remote S3 bucket and a prefix,
and downloads all the objects from there, that are not
already stored locally.
"""
s3 = boto3.resource("s3")
bucket = s3.Bucket(name=bucketName)
for obj in bucket.objects.filter(Prefix=prefix):
key = obj.key
if os.path.exists(key) and os.stat(key).st_size == obj.size:
# If the file exists and it's the right size, we should be done
print(f"{key}: already have locally")
continue
if not os.path.exists(os.path.dirname(key)):
os.makedirs(os.path.dirname(key))
print(f"{key}: downloading")
bucket.download_file(key, key)
# Download a specific prefix. Don't forget the final "/" to limit to the exact prefix
downloadPrefixFromS3(BUCKET_NAME, "training/mri/")
The above code will create the folders corresponding to the remote
prefixes in the current working directory as needed, and only download
files that are not yet downloaded (similar to aws s3 sync
.
Opening image files
You can also access a remote DICOM image, download into memory and open it with, for example with the PyDICOM library:
from io import BytesIO
import boto3
import pydicom
s3 = boto3.resource("s3")
bucket = s3.Bucket(name="nccid-data-warehouse-staging")
image_name = "training/xray/Covid1/1.2.3/A.B.C/x.y.z.dcm"
with BytesIO() as tmp:
print(f"Downloading: {image_name}")
bucket.Object(key=image_name).download_fileobj(tmp)
tmp.seek(0)
# Do not read the image only the metadata here.
# To also read the image, remove set stop_before_pixels to False
image_data = pydicom.dcmread(tmp, stop_before_pixels=True)
print(image_data)
This code would result in an output such as:
Downloading: training/xray/Covid1/1.2.3/A.B.C/x.y.z.dcm
(0008, 0005) Specific Character Set CS: 'ISO_IR 100'
(0008, 0008) Image Type CS: ['ORIGINAL', 'PRIMARY', '', 'RT', '', '', '', '', '150000']
(0008, 0016) SOP Class UID UI: Digital X-Ray Image Storage - For Presentation
...
Loading a JSON file
Similarly to the image download above, JSON files can also be directly accessed, using the built in Python json library such as:
import json
from io import BytesIO
import boto3
s3 = boto3.resource("s3")
bucket = s3.Bucket(name="nccid-data-warehouse-prod")
json_name = "training/data/Covid1/data_2020-05-14.json
with BytesIO() as tmp:
print(f"Downloading: {json_name}")
bucket.Object(key=json_name).download_fileobj(tmp)
tmp.seek(0)
json_data = json.load(tmp)
print(json.dumps(json_data, indent=4, sort_keys=True))
The output of the above code would be similar to this:
Downloading: training/data/Covid1/data_2020-05-14.json
{
"Pseudonym": "Covid1",
...
}