– Nirali Rajnikant Gandhi
Dicom Data Source for Apache Spark:
This Project provides a scalable spark-based mechanism to efficiently read Dicom Images in Spark-Sql Dataframe.
One year ago, we came across a problem of Healthcare Domain where reading huge number of Dicom Images for analysis were involved on Spark-Hadoop Cluster. Though Spark provides API to read many input formats like CSV, Json, Parquet and even jpeg/png Images, there was no direct API to read Dicom Images available in Spark.
So, We, Bigdata Engineers at Abzooba, built a Spark-Scala library to parse Dicom Images in Spark Dataframe, which will improve developer productivity and also performance when it involved Dicom Images for Analysis in Spark.
This Project is built using Spark 2.4, Scala 2.11 and a java based Open Source Library “dcm-4che”.
Introduction to Dicom Images:
DICOM (Digital Imaging and Communications in Medicine) is a standard protocol for the management and transmission of medical images and related data and is used in many healthcare facilities.
DICOM is the international standard to communicate and manage medical images and data. Its mission is to ensure the interoperability of systems used to produce, store, share, display, send, query, process, retrieve and print medical images, as well as to manage related workflows.
Vendors who manufacture imaging equipment — e.g., MRIs — imaging information systems — e.g., PACS — and related equipment often observe DICOM standards, according to NEMA.
These standards can apply to any field of medicine where medical imaging technology is predominately used, such as radiology, cardiology, oncology, obstetrics and dentistry.
Dicom Images comprise of Images and Metadata (Associated Information to Images regarding Patient, Device etc.)
Sample Dicom Image:
Sample Use Case with Architect Diagram:
In medical domain, there would be requirement to perform searching on all available DICOM images based on some metadata Value.
For e.g., Get all Dicom Images where patient age is between 20-25 and patient gender is female. There could be billion of Images in Data Source System.
In traditional approach, the searching team has to check all the images manually and then make them ready for further analysis. Using this Library , we can automate this searching as we can extract metadata (In Spark Batch Job) and then store metadata values with Image reference in some Data Store like HBASE where it is possible to query directly on metadata or metadata json can be fed directly to any search engine like Solr or Elastic Search. Sample Architecture diagram in this case would be as follows.
Dicom Images Reading in Spark:
This Library reads Dicom Images from the specified Location and generates Dataframe with the below Schema:
- origin contains the file path of dicom file.
- metadata contains information (Patient, Device) in Json.
- pixeldata contains the pixel data as array of bytes.
It also provides corrupt result in another dataframe.
In this section, we are going to look at how to load Dicom Images in Spark Dataframe. The Apache-Spark Third Party Package for Dicom Reading can be found on https://spark-packages.org/package/abzoobabd/spark-dicom and The complete project is available on github.
val (dcmdf,cdf) = dicomread.readDicom(path,sparksession,numpartitions)
Here, dcmdf is the dicom dataframe and cdf is the exception dataframe (provides list of corrupt dicom files or non-dicom files)