Big Data Part III: Other People's Data
October 4, 2012
One afternoon in March 2012, a casually dressed crowd of more than 60 researchers crammed themselves into the largest room at NEON headquarters in Boulder, Colorado. The principal investigators of the National Science Foundation’s new Macrosystems Biology program had gathered to exchange knowledge and ideas and to explore the needs and implications of these data-intensive, large-team research projects. The researchers discussed their data needs. All of the projects were planning to collect huge amounts of data in the field and/or harvest large amounts of existing data, in some cases from multiple sources or from NEON itself. Each team had a dedicated data manager taking a lead role on issues like data storage and best practices to make their data sets easy to use and cite. The researchers also talked about careers, particularly the careers of the younger members of their teams. Academic scientists need to establish a professional reputation to land jobs, promotions and funding, and the primary way to do so is still by authoring articles in peer-reviewed scientific journals – the more articles and the more prestigious the journals, the better. Large team projects can produce journal articles with many co-authors, and the more co-authors the paper has, the smaller each author’s perceived share of the credit. Would hiring committees select against team players in favor of researchers who publish more scientific articles with fewer coauthors, pursue more of their own smaller projects and control more funding in their own labs? What’s more, how could or should the macrosystems researchers acknowledge contributions of data or expertise from hundreds or even thousands of other scientists? The potential payoffs of Big Data ecology to science and society are too compelling to ignore. Ecologists and scientists in other disciplines need scientific Big Data to ask and answer a totally new category of broad and deep questions about how our world works. But the structure and rewards of Big Data ecology projects are very different from those of the much smaller research projects that have traditionally dominated ecological science. That ecologists working with large, curated data sets are still worried about how their work fits into the academic reward system is an indicator that large-scale collaboration and data handling is still not a familiar or well-established way of doing business in ecology. That is already changing, and all signs point to it continuing to do so.
From one data set, many publications
The National Center for Ecological Analysis and Synthesis (NCEAS) carved out a prominent place in ecological science for collaborative meta-analyses and ecological modeling, both of which often depend on the sharing and re-use of existing data sets. But despite pioneering efforts from NCEAS, the Long-Term Ecological Research (LTER) Network and other organizations to support data sharing and reuse among ecologists, data sharing and data-intensive ecology have not been embraced by much of the ecological research community. As Marc Cadotte and Caroline Tucker pointed out in an informal analysis last year, most of the articles published recently in prominent ecology journals are still about classic observational or experimental research, and most of that research is conducted by scientists who both collected and analyzed the data themselves. The culture of ecology has long distinguished between field-going or laboratory researchers and “armchair” or “indoor” ecologists who specialize in theoretical work or statistical analyses. In addition, some of the reluctance to support more data-centric ecological study may come from researchers who dismiss the practice of sharing their own data or re-using other people’s as too much work for too little personal reward. Indeed, the context-dependent nature of ecological data is both an incentive for its preservation and a barrier to its re-use. In a 2008 survey of ecologists who re-used data, Ann Zimmerman noted that the single most important prerequisite to re-using other researchers’ data was understanding enough about the data and how it was collected to confidently interpret its meaning and its usefulness in other analyses. This can be a huge challenge when data formatting practices and collection methods are largely idiosyncratic or tied to unique research questions, and when technicians or researchers vary greatly in skill. The time and effort required to document, standardize, and explain a data set well enough to make it useful to someone else makes data sharing an unattractive proposition, especially when there are no clear professional rewards for sharing and when a researcher has already spent much time and funding on collecting the data in the first place. NEON, on the other hand, is employing a huge amount of field and other expertise to make its data products as useful and accessible to the widest audience possible. A massive amount of documentation, automation, standardization and quality control, including an audit system for biological samples and outsourced analyses, goes into guaranteeing the quality, consistency and accessibility of NEON data. Large, public science infrastructures like NEON have an inherent incentive to make their data and other resources accessible and useful to a broad user group, as they must maximize the scientific, educational and societal returns from large investments of public funds. Participants in data synthesis projects and other large collaborations similarly face more incentives to share resources and data than individuals whose careers are based on smaller, self-contained projects. Sizeable, shared scientific resources such as enormous telescopes and supercolliders have advanced more open and collaborative work models in other science disciplines. National Science Foundation initiatives like NEON, LTER, NCEAS, NESCent, and DataONE are doing the same for ecological science.
It’s possible now to build a whole career around analyzing previously published data, which is faster and cheaper than relying on going into the field or the lab to get your data.
-David Hembry, "What’s changed in evolution and ecology since I started my PhD"
Social networks in science
As the scope of scientific hypotheses has changed, the shape and structure of scientific work has changed as well. For most of the history of ecology, individual researchers and small teams of researchers have worked in pristine areas with limited physical infrastructure to address the hypotheses that captured the imaginations of the scientific community at the time. The advent of field stations introduced more collaboration and coordination within and between teams, but it also reinforced communities of practice that enabled the emergence of scientific monopolies. The size and diversity of scientific teams are increasing along with the nuance and complexity of scientific hypotheses. At the same time, rapid advances in computational power and Internet technology have made new kinds and cultures of innovation and communication possible. As scientists increasingly adopt the tools and more open values of the online world and become accustomed to having ready access to huge amounts of data and large networks of collaborators, the broader scientific community of practice will grow, and some of the tallest cultural barriers to massively collaborative, data-centric ecology will continue to shrink. At NEON and other large science projects, successfully employing the complementary talents of a large team of experts to address broad and complex questions requires significant investments in project management and cross-discipline communication – two skills that academic training doesn’t usually address. As team science becomes more common and teams of scientists become larger, more professional scientists are beginning to recognize the need for the skills required to work well in larger teams, although these skills are still not accorded the same value as the ability to publish papers in prominent scientific journals. Science magazine, for one, offers a basic tutorial on Project Management for Scientists in its career advice section, and at least one business has developed project management software specifically for research labs. Opportunities for written and verbal communication training are increasingly being offered to scientists as specialized fellowships and workshops and as an integral part of academic training.
Historically, competitive research advantage accrued to those individuals and groups who first conducted the experiments and captured new data, for they could ask and then answer questions before others. The rise of large-scale, shared instrumentation is necessitating new models of sharing and collaboration across disciplines and research cultures. When many groups have access to the same data, advantage shifts to those who can ask and answer better questions.
-Daniel Reed, "My Scientific Big Data Are Lonely”
It’s worth noting that other changes to career incentives in ecological science are afoot. Policies that encourage or require the publication of data sets as their own citable data papers are helping formalize the role of data sharing in the science ecosystem. Heavy criticism of the peer review process that vets scientific articles for publication has gone hand in hand with criticism of many publication-based metrics that are widely used to evaluate scientific productivity, and innovative teams of scientists have put forth several alternatives to both traditional peer review and publication-based metrics. Many of these alternatives have been formulated to take into account measurable forms of engagement with the science community, such as data sharing and posting constructive feedback in online fora. Meanwhile, early-career scientists are coming of age in a world where it takes a few keystrokes to find potential collaborators, sift through massive amounts of scientific literature, and locate and download a useful piece of computer code. Innovations like Internet search, social media, and open-source software are changing the way science is done. In an era of digital information and communication, the price of resource sharing and communication is swiftly dropping. The timing couldn’t be better for the rise of large-scale, long-term scientific collaboration. As the research ecosystem shifts to support more projects that require more extensive collaboration, the price of not sharing and communicating may rise. The most successful Big Data ecologists of the very near future will be the ones who can complement traditional science training with the skills to effectively locate and manage the community resources they need – data sets, analytical tools, and their knowledgeable colleagues – to help them pursue the answers to some of the most exciting science questions at the edge of human knowledge.
Brian Wee contributed to this piece.