Bioinformatics and the importance of curation

I recently finished a class on bioinformatics, the study of how best to use all the information scientists have accumulated and are accumulating in databases scattered around the world and web. And there’s a lot of information out there. Sometimes you hear the word “inventory” used to describe massive studies that characterize a lot of cellular components at once, as in, at some point in the future we will have a complete inventory of all proteins in all cancer cells, or stages of embryonic development, or what have you. I have to confess, I always picture an attic full of boxes, with structures and sequences and expression patterns, all minimally labeled, gathering dust.

Huge aggregations of information are scary.  Even just in one database for just one field of study—say, the NCBI website—even that tiny slice of All The Knowledge in the World makes us face that no one person will ever be able to assimilate it all. My feelings about bioinformatics are a lot like my feelings as a child, when I realized I would never be able to read all the books in the world: a crestfallen sense of having to miss out on something really, really interesting.

And at this point, maybe everyone is missing out. Even the authors of big-data studies, say, microarrays that assay expression levels of every gene in the genome in disease and non-disease tissues, can’t follow every lead to its source; often a few differently-regulated genes get followed up, but the rest just get put out there for others to work with. Sometimes I suspect that we (that collective, Internet-era “we”) have all the information to answer any cell-level biological question we could ask, but no idea of what the questions will be or how best to frame them.

One of the solutions to this problem is data mining—enlisting a computer’s help to sift through the masses of information with a program that looks for connections a human might see, if a human had the time and brainpower. Reading up on the art of data mining, I stumbled across a company called Narrative Science with a novel idea on how to present the results of data mining. Instead of making graphs from data, it makes sentences or even short passages, computer-generated but composed so that they read like something a human wrote. The company calls it a novel kind of visualization.  I think it’s absolutely brilliant, both as a way forward for handling a huge amount of information, and for the amount of cleverness that must have gone into programming a computer to take scores from a game and generate something like this:

WISCONSIN appears to be in the driver’s seat en route to a win, as it leads 51-10 after the third quarter…

As I’ve mentioned, I like my science in story form; decontextualized data, however useful they may be, are automatically less compelling to me.  So I’m glad that, in this age when the gestalt seems to be moving from the longform newspapers of the past to the Twitters of the future, that there’s still some consensus that narrative is a good vehicle for understanding the world.  According to some neuroscientists, perhaps it is the best vehicle.

When I think it over, I realize that nature has, or is, exactly as staggering a dataset as anything they’ve got at NCBI, with considerably more information much better encrypted. People have been trying to extract laws and trends from it for generations. All we’ve accomplished with our microarrays and our high-throughput proteomic studies is removing one step between the framing of a question and finding out the answer. Plenty of work of interpretation and meaning-making remains to be done.


1 thought on “Bioinformatics and the importance of curation”

