Formatting datasets¶
This section describes how to format your own datasets for importing into HistomicsML. A datasets consists of whole-slide images (.tif), a slide description (.csv), object boundaries (.txt) and histomic features (.h5).
Whole-slide images¶
Whole-slide images need to be converted to a pyramidal .tif format that is compatible with the IIPImage server (http://iipimage.sourceforge.net/documentation/server/). We have used Vips (http://www.vips.ecs.soton.ac.uk/index.php?title=VIPS) to perform this conversion for our datasets.
Note
The path to the image needs to be saved in the database. HistomicsML uses the database to get the path when forming a request for the IIPIMage server.
Slide description¶
A table (.csv) needs to be created to capture the dimensions, magnification, and location of the files for each slide image:
<slide name>,<width in pixels>,<height in pixels>,<path to the pyramid on IIPServer>,<scale>
where scale = 1 for 20x and scale = 2 for 40x.
For the sample data provided in the database container, our slide description file (GBM-pyramids.csv) has the following contents:
TCGA-02-0010-01Z-00-DX4,32001,38474,/fastdata/pyramids/GBM/TCGA-02-0010-01Z-00-DX4.svs.dzi.tif,1
Object boundaries¶
Boundary information is formatted as a tab-delimited text file where each line describes the centroids and boundary coordinates for one object:
<slide name> \t <centroid x coordinate> \t <centroid y coordinate> \t <boundary points>
where t is a tab character and <boundary points> are formatted as: x1,y1 x2,y2 x3,y3 … xN,yN (with spaces between coordinate pairs)
One line from the sample data boundaries file (GBM-boundaries.txt):
TCGA-02-0010-01Z-00-DX4 2250.1 4043.0 2246,4043 2247,4043 2247 ... 2247,4043 2246,4043
Histomic features¶
Features are stored in an HDF5 binary array format. The HDF5 file contains the following variables:
/features - A D x N array of floats containing the feature values for each object in the dataset (N objects, each with D features). Each feature/row should be normalized by z-score.
/slides - Names of the slides/images in the dataset
/slideIdx - N-length array containing the slide index of each object. These indices can be used with the 'slides' variable to determine what slide each object originates from.
/x_centroid - N-length array of floats containing the x coordinate of object centroids.
/y_centroid - N-length array of floats containing the x coordinate of object centroids.
/dataIdx - Array containing the index of the first object of each slide in 'features', 'x_centroid', and 'y_centroid' (this information can also be obtained from 'slideIdx' and will be eliminated in the future).
/mean - A D-length array containing the mean of each feature prior to normalization. This provides a record of z-score normalization parameters so that the data can be de-normalized if needed.
/std_dev - A D-length array containing the standard deviation of each feature prior to normalization. This provides a record of z-score normalization parameters so that the data can be de-normalized if needed.
The sample file (GBM-features.h5) provided in the database docker container can be queried to examine the structure with the following the command.
>>> import h5py
>>> file="GBM-features.h5"
>>> contents = h5py.File(file)
>>> for i in contents:
... print i
...
# for loop will print out the feature information under the root of HDF5.
dataIdx
features
mean
slideIdx
slides
std_dev
x_centroid
y_centroid
#for further step, if you want to see the details.
>>> contents['features'][0]
array([ -7.30991781e-01, -8.36540878e-01, -1.07858682e+00,
9.26770031e-01, -9.31272805e-01, -4.36136842e-01,
-1.13033086e-01, 5.28297901e-01, 6.85962856e-01,
5.07918596e-01, -5.27561486e-01, -7.48096228e-01,
-6.84849143e-01, -8.79032671e-01, -1.41368553e-01,
-3.24195564e-01, -4.50991303e-01, -1.32366025e+00,
9.17324543e-01, 8.36400129e-03, -2.92657673e-01,
2.01028720e-01, -1.93680093e-01, 8.68237793e-01,
5.72155595e-01, 3.29810083e-01, -3.63551527e-01,
-2.87026823e-01, -8.47819634e-03, -4.55458522e-01,
1.43787396e+00, 5.24487114e+00, -9.62561846e-01,
5.94001710e-01, 3.57634330e+00, -2.94562435e+00,
-9.18125820e+00, 2.87391472e+01, -9.34123135e+00,
2.55983505e+01, -2.99653459e+00, -1.17376029e-01,
-5.40324259e+00, 1.01094952e+01, 5.87054205e+00,
6.21094942e+00, -2.59355903e+00, -4.27142763e+00], dtype=float32)