Bioinformatics – Biostatistics Service Facility coordinator
Dr. Frédéric Schütz is Maître d’Enseignement et de Recherche at the Center for Integrative Genomics. His main responsibilities include training in statistics and related topics, for bachelor and master students. He is also responsible for the Biostatistics Service at the SIB (Swiss Institute of Bioinformatics), where his tasks include training in statistics as well as consulting on all aspects linked to data analysis.
About building bridges…
– Dr. Schütz, something tells me that computers are never far away where you are. Am I right?
FS: (laughs) It’s apparently difficult to hide, no? Indeed, I have always had a fascination for computers, since I was a kid.
– Did you study computer sciences?
FS: Yes. I studied Mathematics and Computer Science at the University of Geneva and did my PhD work at the University of Melbourne on developing algorithms to improve the statistical interpretation of tandem mass spectrometry data, applied to proteomics.
– Presently, be it at the CIG or at the SIB, you are employed as a biostatistician, right?
FS: That is correct.
– And you have teaching obligations?
FS: Quite a few. There are not that many statisticians around at the University of Lausanne – although this is slowly changing now – and nowadays both students and senior scientists need to have an understanding of biostatistics, as research in the field of Life Sciences quite often deals with the interpretation of large data sets. My job is mostly to build bridges, although that seems to be a well-kept secret.
– You build bridges?
FS: As a matter of fact, I do. To put it somewhat “black-and-white”: at one extreme end of the research spectrum you will find biologists who want to have solid results – from, let’s say, a micro-array experiment – without being too much involved in the bioinformatics. On the other end of that spectrum, you will find hardcore statisticians or computer programmers who are developing algorithms, doing the calculations etc… but do not have much contact with the biologists and the details of their experiment. And my position is somewhat situated “in the middle”, so to speak. I talk to both the bioinformaticians and the biologists, whom I explain the fundamentals of biostatistics. I adapt both existing and newly developed statistical methods to their experimental settings and particular type of data.
Statistics has penetrated the biologist’s mind
– Have you witnessed major developments in the relationship between biologists and biostatistics?
FS: Yes, there have been quite a few changes over the past decade. I already came to mention – and there are few exceptions to that rule – that life scientists cannot do without statistics. Not anymore. Over the past years, I have witnessed an evolution in the mindset of many biologists at the CIG: a lot of researchers want to understand the fundamentals of biostatistics, enabling them to perform the analysis of their complex data themselves. To that end, they consider biostatistics an integral part of their work. This was different 10 – 12 years ago.
– Other developments worth mentioning?
FM: Obviously, something that has changed dramatically is the nature of biological data. About 15 or 20 years ago, many biologists could still employ classical statistics on a sample size of 10 or 100, or in that range. Nowadays, hypothesis building is much more at the forefront of biostatistics: the number of hypotheses that researchers consider, given a set of complex data, is orders of magnitude higher than it used to be and cannot be handled anymore with an excel sheet. Although I want to stress that we are not yet dealing with so-called “Big Data”, even though many researchers believe they are.
– But would you not consider massive sequencing data created by an Illumina sequencer large data?
FS: It depends of course on your perspective, but even when measuring the expression levels of 30.000 genes from 1000 patients is, from a purely statistics point of view, still considered a relatively small dataset. And of course, you create highly complex data by high-throughput sequencing, but at the end of the day you have a solid dataset that does not change anymore, and where computing power is unlikely to become a limiting factor. On the other hand, a database of 100 million records that is augmented with 100.000 new records each and every day – like some scenarios encountered in the pharmaceutical industry, for example – is statistically more demanding. What turns data into “Big Data” is not only the number of records, but the fact that the data are heterogeneous – let’s say from a lot of different sources – and highly dynamic.
Biostatistics: a back-and-forth game
– Are new, alternative methods being developed in your field?
FS: Yes, because new types of data constitute new statistical challenges, as I just came to mention. However, I would be cautious to use the word “alternative” in this context. It suggests that entirely new methods are being developed, which is a rather seldom phenomenon in our field. More often than not, statisticians are adapting existing methods to specific types of questions they need to see answered. In that sense, statistics quite often involves a back-and-forth game, adapting existing methods or algorithms to particular data. That is why it can never be wrong to keep the history of statistical methods in the back of one’s head.
– Dr. Schütz, allow me a provocative question. Suppose, I have analyzed a set of raw transcriptome data with a given algorithm and I have picked up 100 statistically significant differentially expressed genes between two samples. Is it possible that I would pick an entirely different set of genes, when analyzing the same data with another algorithm?
FS: It is not impossible, but it would most likely mean that you are not using the appropriate statistical method and/or that the quality of your data is inferior, meaning that they contain more noise than signal. As a matter of fact, in order to find out whether a statistical method is well adapted to a particular data-set – for example, data gathered from a micro-array experiment – we sometimes include some nonsense expression data (highly deviating from the expression data in this particular experimental setting) and then redo the analysis, to see whether indeed the algorithm can discriminate between statistically significant versus statistically insignificant.
– One has to be cautious.
FS: Absolutely. Too many scientists are still blindly using some algorithm, which makes them happy with their results as soon as a significant P value is attached to them. I cannot stress this enough: complex data gathering in the field of Life Sciences is prone to errors and researchers have to be aware of that. They should scrutinize their data and be skeptical about them. Keep in mind that, for example, a pathway analysis performed on transcriptome data will nearly always return you a list of genes. And an analysis can be very misleading if the quality of your raw data was inferior from the very beginning. One should never forget: “garbage comes in, garbage goes out”.
– And providing a misleading list of genes, that often is meant to be the focus for many researchers in a team.
FS: Exactly. You know, the human brain is really good at detecting patterns. If I show a list of 50 randomly chosen genes to a PI and tell him that they are the result of a transcriptome profiling experiment in his field, chances stand not bad the he will start to see connections that make sense to him. And I am not different: our grey matter has evolved to detect patterns.
– Evolutionary collateral damage for the interpretation of gene expression data.
FS: (laughs) That would be one way of putting it. But it clearly illustrates the importance of the design and the set-up of a complex data gathering experiment, including the need for negative controls. Bad experimental design and/or sloppy data analysis can put an entire research team on a wrong foot.
A look into the future
– Do you expect big developments for the future, in your field?
FS: My answer is no, certainly not in the same way as high-throughput sequencing has constituted a paradigm shift in genomics. Changes in the field of biostatistics will be most likely more gradual. Probably, a key concept for the future is “data integration”, like for example integrating expression data from massive sequencing with proteomics data from the same experimental setting. This is also a trend that will be seen in more clinically oriented research: scientists will want to integrate sequencing data and expression data with 30 different physiological or clinical parameters (when dealing with patients, for example). So, I do not expect a dramatic increase in the size of the data – they will most probably still fit on my USB key – but I foresee the generation of more heterogeneous data, which have to be matched with each other and integrated, and this will obviously impose additional challenges in bioinformatics and biostatistics.