This article demonstrates how to build a machine learning pipeline for breast cancer classification using Kubeflow Pipelines (KFP) and Vertex AI. The pipeline leverages the Curated Breast Imaging Subset of DDSM (CBIS-DDSM) dataset and utilizes TensorFlow Datasets for data processing and a custom Flax CNN model for training.

https://github.com/Davidnet/breast-cancer-detection-nnx-pipeline

g1. The DDSM is a database of 2,620 scanned film mammography studies. It contains normal, benign, and malignant cases with verified pathology information.

g1. The DDSM is a database of 2,620 scanned film mammography studies. It contains normal, benign, and malignant cases with verified pathology information.

Pipeline Overview

The pipeline consists of three main stages:

  1. Download Dataset: Downloads the CBIS-DDSM dataset using a custom container image.
  2. Create TF Records: Processes the downloaded dataset and converts it into TensorFlow Records format for efficient training.
  3. Train Model: Trains a Flax CNN model using the prepared TF Records.

Let us break the above into the pieces:

Code Breakdown

1. Downloading the Dataset:

Processing and download of the dataset was encapsulated in this repo, which is a fork of Lazaros Tsochatzidis repo.

Mostly, it facilitates the pre-processing of the CBIS-DDSM mammographic database. It involves downloading the whole database, converting the DICOM images to PNG and parsing the database files, creating then an Kubeflow Artifact, the code is encapsulated in a a custom Docker image (davidnet/cbis_ddsm_dataloader:1.0.2) to handle the downloading and preprocessing of the CBIS-DDSM dataset.

  1. Create TF Records:
import tensorflow_datasets as tfds
with tarfile.open(dataset.path) as tar:
    # Extract all contents to the specified directory
    tar.extractall(path="./extracted_dataset")
    curated_breast_imaging_ddsm = tfds.builder("curated_breast_imaging_ddsm")
    curated_breast_imaging_ddsm.download_and_prepare(
        download_config=tfds.download.DownloadConfig(manual_dir="./extracted_dataset")
    )

This component create_tf_records uses TensorFlow Datasets to process the downloaded dataset and convert it into TF Records. It extracts the downloaded dataset (from the previous components), and utilizes the curated_breast_imaging_ddsm builder from TensorFlow Datasets to prepare the data, and finally packages the processed data into a tarball containing the TF Records.

curated_breast_imaging_ddsm  |  TensorFlow Datasets

3.Training the Model