COMMENTARY

On the issue of rapid and free data distribution

Barbara Romanowicz, University of California, Berkeley

Should seismological data collected by individual investigators or research groups be made freely and immediately available to the general scientific community?

As recording and archival of data have entered the digital world on a routine basis, widespread means now exist to organize efficient on-line archives that are readily accessible from anywhere over the Internet. Gone are the days when you had to travel across the world to consult a valuable and unique dataset, set a few weeks or months aside to analyze the data on-site and engage in a collaborative project with the scientific team that originally collected the data. Now, you can sit at your computer terminal and, with a minimum effort "ftp" Gigabytes of wiggles from remote sites without ever needing to move a toe. Meanwhile, it still takes years to design a data collection program, obtain the funding, deploy the instruments and verify the quality of the data produced, and only at the end of this crusade can you finally sit down and glean the fruits of your labor. No wonder there is some resistance to those vultures waiting around ready to grab your data as soon as they are available, and scoop you and your students before you have time to even realize it.
How do we reconcile the need for widespread use of data that, in one form or another, have been collected owing to community fund-raising efforts, and at the same time protect the rights of those scientific teams whose specific efforts have produced them?
This clearly is not a problem for such programs as the IRIS GSN, where the data generation, collection and archival is clearly separated from their usage, inasmuch as the Consortium members participate collectively in the design of the system, but its actual implementation is in the hands of specially funded groups, the DCC and DMC. However, it is the subject of debate regarding PASSCAL datasets, and also, to some extent, data produced by individual University-funded groups such as the one I am currently in charge of, at UC Berkeley.
I believe in making data available freely and rapidly. There are many advantages for all of us, collectively. First, no one research group can squeeze everything out of a good dataset, and there is enough creativity around to find ever new applications for it. This way you get more research results for the buck, and consequently reinforced support for continued funding. Second, the aggregate dataset available at any time to the global research community becomes much richer, since different datasets often complement each other and, by combining them, you can take your research farther than if you are restricted to the data collected in your own backyard. However, I do get frustrated too, each time I find out that some outside group has been using our data for the same purpose as we are, without even referencing any of our work.

How do we deal with such issues and make everybody happy?

One solution, currently adopted by the PASSCAL program, is to restrict access to such special datasets for a limited period of time after completion of the project (i.e. 1-2 years), in order to give the "owners" time to advance the research project which motivated the data collection. This works well, to some extent, but has some anachronistic resonances in our era of rapid access capabilities. And it does not readily apply to longer term deployments or permanent networks. There are many instances when you might wish to utilize someone's dataset to complement other data, for a current research project unrelated to the PI´s motivations, and if you have to wait that long, your interest and perhaps your funding will have waned. I would like to suggest a potentially more viable solution. All that may be needed is to develop the appropriate ethics in the community, supported by well organized information made available through the appropriate data centers.
A simple rule can be established whereby active research topics and related publications (on-going Ph.D. theses, funded research) by data "owners" would be described and posted on the relevant data center web page. Those outside researchers who wish to access the data should take care to educate themselves about the existence of these topics and either stay away from them, or negotiate cooperative arrangements with the data "owners", and make sure to reference the latter in their publications.
How do we enforce this? There's no completely iron-clad way, of course. But there are helpful measures. For example, access to a given dataset can be conditional on having familiarized yourself with the work of the "owners", and having filled out a web questionnaire describing your intentions. In addition, reviewers of major journals could be routinely asked whether a given paper submitted for publication properly acknowledges the source of the data used. Proposals could also be reviewed with that in mind.
Much of this is already happening. In fact, the seismological community has a long tradition of no-big-deal widespread data exchange, much longer than in many other fields. In particular, the role and organization of the IRIS data archival and distribution is now often cited as a model in other communities. All it might take to make most seismologists happy is to spell out the rules somewhat better. I'd suggest that the discussion of these rules also be extended to other types of geophysical data that are increasingly facing the same issues, such as GPS data.

Return to: IRIS Newsletter Information
Return to: Title Page and Table of Contents
Continue to: Next Feature