This article demonstrates how to build a machine learning pipeline for breast cancer classification using Kubeflow Pipelines (KFP) and Vertex AI. The pipeline leverages the Curated Breast Imaging Subset of DDSM (CBIS-DDSM) dataset and utilizes TensorFlow Datasets for data processing and a custom Flax CNN model for training.
https://github.com/Davidnet/breast-cancer-detection-nnx-pipeline
g1. The DDSM is a database of 2,620 scanned film mammography studies. It contains normal, benign, and malignant cases with verified pathology information.
The pipeline consists of three main stages:
Let us break the above into the pieces:
1. Downloading the Dataset:
Processing and download of the dataset was encapsulated in this repo, which is a fork of Lazaros Tsochatzidis repo.
Mostly, it facilitates the pre-processing of the CBIS-DDSM mammographic database. It involves downloading the whole database, converting the DICOM images to PNG and parsing the database files, creating then an Kubeflow Artifact, the code is encapsulated in a a custom Docker image (davidnet/cbis_ddsm_dataloader:1.0.2) to handle the downloading and preprocessing of the CBIS-DDSM dataset.
import tensorflow_datasets as tfds
with tarfile.open(dataset.path) as tar:
# Extract all contents to the specified directory
tar.extractall(path="./extracted_dataset")
curated_breast_imaging_ddsm = tfds.builder("curated_breast_imaging_ddsm")
curated_breast_imaging_ddsm.download_and_prepare(
download_config=tfds.download.DownloadConfig(manual_dir="./extracted_dataset")
)
This component create_tf_records uses TensorFlow Datasets to process the downloaded dataset and convert it into TF Records. It extracts the downloaded dataset (from the previous components), and utilizes the curated_breast_imaging_ddsm builder from TensorFlow Datasets to prepare the data, and finally packages the processed data into a tarball containing the TF Records.
curated_breast_imaging_ddsm | TensorFlow Datasets
3.Training the Model