Bio-medical image analysis


With the rapid advancement of microscopy imaging techniques, the task of analyzing images and converting them into quantitative data is becoming the limiting factor in many areas. Almost all of the analysis is manual and thus extremely expensive. I use a combination of computer vision and machine learning to create software for automating this type of analysis.

The main machine learning algorithms that I use are Adaboost and active learning.

Adaboost  is an algorithm for learning a classification rule from manually labeled examples. It does so by combining many simple “rules of thumb” each of which is only slightly correlated with the label that is to be predicted. The combined rule is very accurate. To apply Adaboost to a particular classification task, such as differentiating between images of cancerous and non-cancerous prostate tissue, we take the following steps:

  1. (1) In collaboration with a domain expert (in this case, a pathologist) we create a list of quantitative measures which capture important salient features of the class we want to detect. In case of prostate cancer, the size and concentration of lumens in a region is an important feature because a high concentration of small glands is indicative of prostate cancer.

  2. (2) The domain expert labels a set of training examples. In the case of prostate cancer, the pathologist identifies cancerous regions in prostate images.

  3. (3) Adaboost is run on the training set and finds a combination of the selected features which accurately predicts the desired class. If a feature is not informative, Adaboost will ignore it, which means that in step (1) we need not worry about including many irrelevant or unreliable features.

Active learning is a methodology we have developed over time. Using Adaboost made it possible to solve classification problems that were previously considered intractable. However, it became clear that in order to achieve high accuracy it is critical to have a large number of accurately labeled examples and a large variety of features. This places a high up-front cost on creating classifiers for new problems. Our solution is to use an iterative approach. We start by collecting a large set of unlabeled examples. initially, a small fraction of these examples are labeled and a small number of features are combined. The rule generated by Adaboost using these small sets is too inaccurate to use as a classifier, but is accurate enough to reliably classify a significant fraction of the unlabeled set. Based on this reliability measure the computer can identify the examples whose label will be most valuable for increasing the accuracy of the classifier. In addition, by browsing these hard-to-classify examples, the domain expert can identify additional features that are useful for classifying those hard examples. We thus repeat iterations 1-3 described above, concentrating more and more on the hardest examples, until we converge to a rule whose performance is close to that of a human. More precisely, we find a rule whose prediction is confident when most human experts agree on the correct label, and it not confident otherwise.

For a video of a talk on this subject, click here.

Summarizing prostate pathology images

The digital image of a 10mmX10mm pathology slide, taken at 400X, is a 20Gbyte file (1Gbyte compressed). This large size makes it very hard to send such files over the Internet. We are use machine learning techniques to identify the most suspicious locations of the specimen and create a summary file of 10Mbyte (2000 times smaller than the original). Small enough to be sent as a standard email attachment. The summary contains sufficient detail to enable the pathologist to diagnose the specimen.

Analysis of gene activity  in Drosophila embryos

Fluorescent In Situ Hybridization (FISH) is a technique for tagging RNA and Protein molecules in situ. Using a confocal microscope, one can measure the concentration of these molecules with sub-cellular resolution. Using FISH on Drosophila embryos the Bier Lab and McGinnis Lab have collected detailed data on gene regulation networks that control development. However, transforming this image data into usable quantitative data is extremely laborious. My student william beaver is working with these labs to automate this analysis.

Detection of protein micro-crystals

One of the largest projects in structural biology is the reconstruction of the 3D structure of proteins using X-ray crystallography. One of the biggest challenges of this project is to crystalize the protein. Highly automated methods are used to sample many different solutions of the protein with various  additional small molecules that can help it form a crystal. A big bottleneck in this process is the human labor required to identify micro-crystals in the solution. In collaboration with Glenn Spraggon of Novartis research we have developed a method to aid the manual search by sorting crystal images according to the likelihood that they contain a micro-crystal.

Quantifying the movement of the lammelopedia

The Sheetz lab studies the dynamics of cell motility for over twenty years. Cell motility proceeds by cycles of edge protrusion, adhesion, and retraction. We used computer vision and machine learning methods to identify and quantify the backwards moving waves in time lapse microscopy images. Using this analysis the dynamics of the back moving waves has been quantified.

Giannone, Dubin-Thaler, Rossier, Cai, Chaga, Jiang, Beaver, Dobereiner, Freund, Borisy and Sheetz
Lamellipodial Actin Mechanically Links Myosin Activity with Adhesion-Site Formation
Cell, Vol 128, 561-575, 09 February 2007