Keith Harshman

GTF Facility coordinator

The Genomic Technologies Facility (GTF)

Since November 2003, Dr. Keith Harshman has been the Coordinator of the Lausanne Genomic Technologies Facility (GTF), providing customers with State of the Art technologies to detect and measure quantitative and qualitative variations in nucleic acids. GTF services cover every step of the project workflow: experiment design, sample processing, data generation, data analysis and management.

GTF: the early years

– Dr. Harshman, I would not be surprised at all if you were to tell me that you have been interested in genomics and sequencing since you were a small boy, like those whizz-kids who can play the violin before they can properly walk.

KH: (laughs) Not exactly. As a matter of fact, I was trained as an analytical chemist before my scientific interests turned to Biochemistry, more specifically towards the isolation and characterization of eukaryotic transcription factors. My fascination for genomics and genomic-related issues came later – somewhere around the early nineties – when I was employed at Myriad Genetics Inc, a small Biotech Company in Salt Lake City, Utah. They had a project running, dealing with overlaying a disease database with human genealogies, in search for familial clustering of diseases and trying to find a causal link with particular genes; at that time, they were focusing on breast cancer and melanoma. The ultimate goal was to develop diagnostic tests, based upon the genes that were identified. This strategy implied a genomic analysis, dealing with data generation on a large scale and processing lots of samples…

– … which managed to capture your full attention, I presume?

KH: Exactly. I was impressed by the multidisciplinarity and the workflow characteristics: molecular biologists, population geneticists and bio-informaticians were working side-by-side with medical doctors and automation specialists – a rather non-academic setting for those days. As a matter of fact, this was the first time I’ve seen liquid-handling robots for processing samples.

– This must have been a solid experience for your later work at the CIG.

KH: Definitely. I arrived at the CIG around 2002-2003. Actually, we were the first group to enter the Genopode building, which was still called the Pharmacy building in those days.

– What did you find? Were genomics technologies “up and running”?

KH: Some of those technologies were indeed available – such as microarrays for Arabidopsis and human and mouse cDNAs – but were scattered over different laboratories with relatively little coordination. Consequently, the idea arose to establish a core facility at the service of the entire scientific community in Lausanne. And this was my job, as I became head of the DNA Array Facility (DAF), as it was called in those days. I made of course use of the expertise that was available and focused on expanding and establishing the bioinformatics infrastructure, thereby interacting with people and facilities that were being established or were already present at the time. Maybe somewhat surprisingly is the fact that sequencing in those days was not part of the DAF and did not become important for us till around 2008. And I must add that the transition to high-throughput sequencing was fairly smooth, because much of the molecular biology and data analysis is fairly similar to the experimental work our team had been doing thus far.

Massive sequencing enters the stage

– Have you seen major developments in your field over the past decade(s)?

KH: One can fairly say that the new sequencing paradigm has just about changed everything. Most of the questions that were being answered by using micro-arrays are now being addressed, mostly in a more comprehensive manner, with high-throughput sequencing.

– Why “in a more comprehensive manner”?

KH: Because, no matter how many answers micro-array technology can provide, you have to make assumptions – to some extend you have to know what you are searching for within your experimental set-up – in order to represent a particular nucleic acid sequence with an oligonucleotide on a gene chip. When performing high-throughput sequencing on the other hand, there is no need for such a priori assumptions: whereas a micro-array tells you whether a complementary sequence to a given pre-designed oligo is present in your sample, massive sequencing also reveals the presence of nucleic acids that were not sought after. That is a fundamental difference between both technologies and the biological relevance of this can hardly be overestimated.

– For an outsider it is not always obvious to comprehend why a “wild sequencing approach” gives you relevant information.

KH: Obviously, the information isn’t very useful in itself, but this is where bioinformatics comes in. Another dramatic change by the way: the bio-informatics evolution, which went through a dramatic acceleration with the introduction of massive sequencing and the huge data sets it generates.

– Massive sequencing has become one of the core activities of the GTF?

KH: Indeed. For example, a high-throughput RNA sequencing experiment will generate around 30 million or more bits of information for one sample: one read of a complex nucleic acid population with an average length of about hundred nucleotides. The next step is to correlate these nucleic acids (quite often they are cDNAs) with the genome and determine how many individual nucleic acid species map to a given gene. As such, you know how many transcripts originate from a particular gene – as an accurate measure of RNA abundancy and hence the gene’s expression level – in both samples, for example a wild type liver compared to a cancerous liver.

– How many genes can you detect?

KH: A typical experiment allows us to determine the expression level for 10.000 – 15.000 genes. We always try to minimize the cost, whilst maximizing the utility of an experiment. Reaching high-sensitivity – that is to say: detecting very low abundant transcripts with high accuracy – is to some extend dependent on the user’s budget. For example, on customer’s demand, we can generate 40 million or 50 million or more bits of information in one experiment, by doing more sequencing reactions, but at a given time point you reach saturation. So, based on our experience, we find that, except in special cases, 20 million – 30 million bits per sample is adequate, even for genes that are expressed at a low level.

– Which technologies are you using?

KH: At the GTF, we are using two different technologies:

* Illumina HiSeq and MiSeq, which is essentially a nucleic acid molecule counting technology that generates huge amounts of data: using an Illumina flow cell, you can monitor 2 billion reactions in one experiment, in a cost-effective manner. For example, this technique allows you to detect the 10 sevenless RNA transcripts present in a wild type cell, compared to the 50 sevenless transcripts in a mutant cell.

* On the other hand, counting nucleic acid molecules will most likely not have a high priority when sequencing and de novo assembling a genome. In such case, you want to generate long read sequences of the target nucleic acid, thereby allowing you to reassemble the genome. To that end, we are using the SMRT or “single molecule real time sequencing” technology from Pacific Biosciences, which monitors a sequencing reaction – a polymerase extending a primer upon a template – in real time. Each of the four nucleotides is labeled with a different fluorescent dye and, when incorporated in the growing chain, generates a signal which is captured by a detector that is focused on one growing nucleic acid chain. So, when you detect a green flash, you know that A has been incorporated and when you detect a red flash, T has been incorporated, and so on. For the Illumina technology, the maximum read length is around 300 nucleotides. With the SMRT technology we can get reads of up to 60.000 nucleotides, with an average read length of 15.000 – 20.000 nucleotides. And, in many cases, that is adequate for assembling genomes.

Looking into the future

– Do you expect major developments for the future, in your field?

KH: These days, there is much ado about “single cell genomics”, a hotly pursued topic among experts in the field.

– Single cell genomics?

KH: Yes. Nowadays, for a standard high-throughput sequencing experiment, we need around 100 nanogram of total RNA – usually obtained from tissue or cell samples – collected from 10.000’s of cells. Needless to say that the expression profile we obtain as such, provides us with an average over this number of cells in the original sample. Or in other words, we lose cell-to-cell variation. On the other hand, determining single cell expression profiles would enable us to determine intercellular variability within a single cell population. Now, depending on the experimental design and the questions our customers would like to see answered, such cell-to-cell variation might or might not be biologically relevant. But for example, gene expression differences between single cells can be important in particular immunological, oncological or neurobiological settings.

– But somehow you need to amplify the nucleic acid obtained from a single cell sample.

KH: That is correct: this technology incorporates a PCR amplification step which might jeopardize the integrity of the sample. To that end, improvements are needed – to some extent, single cell genomics already exists today – to minimize experimental bias and enabling to perform the sequencing reactions in smaller volumes. At present, a lot of efforts are directed towards developing microfluidic methods for processing single cell samples.

– Other developments in sequencing technology appearing on the horizon?

KH: People are developing methods that remove the need of fluorescent dyes and the optically-based detection system. In many cases this is a big advantage over the technology we are using today…

– … because it would be much cheaper.

KH: Not only that, it also simplifies things. The entire optical system behind the high-throughput sequencing detection would become obsolete: lasers, fluorescence detectors, fluidic systems… So, that is where nanopore technology comes in, which essentially measures the potential perturbation across a channel by a single nucleic acid polymer: depending on the characteristics of the perturbation, the identity of the nucleic acid passing through the channel can be determined.

– Does this technology exist already?

KH: Yes, we’ve done already a few experiments, but it is not something that we offer as a service yet.

Ronny Leemans