APL Ocean Remote Sensing
 Home      Staff      Research      Links      Outreach      Software 
 The Practical Oceanographer 
The Practical Oceanographer

Title Page
Contents
Introduction
The At-Sea
Experience

Planning
Safety
Test Conduct
Instrumentation
Data Acquisition
and Analysis

People
Other Resources
References
Acknowledgments
APL Safety Manual
Nautical Terms
Packing Lists
Knots

StudyWeb

Data Acquisition and Analysis

Typically getting the instruments successfully deployed in the field is only the first half of the battle. The second half is to acquire, record and analyze the data from these instruments. Usually these data will have to be combined with data from other instruments, so timing of the data streams can become important. These data are incredibly expensive to obtain. The volume of data from a typical multi-million dollar experiment may amount to anywhere from just a few hundred kilobytes up to a few hundred gigabytes. You can't afford to lose the data once it has been acquired so you'll need to consider schemes for protecting and backing it up. Finally, there are various levels of analysis to be considered, including analysis done on board and quick look analysis performed back on shore to validate the data set prior to distribution to others.

Timing

Timing should be easy. Man regularly measures time to the sub-nanosecond accuracy. Timing seems to be easy. After all, every computer built since the Apple II has a clock built into it (actually since the Apple IIe). Timing often turns out to be a problem though, precisely because it seems so easy. People rarely devote the time and effort to timing that they should. I cannot tell you how many times I have seen allegedly mature scientists synchronizing their systems to their wristwatch. This is almost never adequate!

The importance of experiment-wide timing solutions and planning has already been discussed in the chapter on preparations. In this section I discuss some of the details of how to provide timing for individual instruments and data acquisition systems. The key in data acquisition is to understand what your timing requirements are, both for individual as well as joint data streams, and to include timing data in the data streams so that these requirements can be met. The same golden rule that guides instrument design (a belief that something will fail) should guide your design of timing within your data acquisition system.

Requirements

Timing requirements depend on the rates of the data involved and inter-system synchronization requirements. If you are on a cruise making CTD stations every hour, then accuracies of several minutes may be all that is required. If, on the other hand, you are acquiring high resolution temperature chain data on one system, and ship roll on another, with the plan to later merge the two data streams, then the timing requirements will be set by the roll period of the ship which may be on the order of a few seconds. In this case the timing accuracy should be sufficient to insure a small phase difference between the common fluctuations in the two data streams, leading to a timing accuracy requirement of less than a second for a moderate-size vessel. At the other extreme, consider the problem of comparing radar backscatter with video imagery of breaking waves. In this case the timing accuracy will be set by the rates of breaking wave evolution, which has scales of less than a second. Timing accuracies for such a system would then have to be an order of magnitude better than the fastest phenomenon and hence be measured in the tens of milliseconds.

When deciding on the timing accuracy requirement for a particular system, always consider the fastest data stream that might possibly be combined with this system's data. Disk storage is cheap, so when in doubt, it pays to acquire data at higher rates and specify better timing accuracy than you might otherwise deem necessary. As my friend, Michael Paulkovich, has pointed out (Paulkovich's Rule of Resource Margins): "Whatever you think you need, triple it; then add a fudge factor. (And this will still not be enough.)"

Sources of Time for Data Streams

Timing information for a data stream is typically obtained from the instrument itself, the data acquisition computer, or an external clock. Timing derived from the instrument or data acquisition computer usually suffer from similar accuracy limitations, while external clocks range from poor to more accuracy than any experimenter will ever need.

Many modern instruments are internally computerized and provide their own time signals within their output data streams. Data from less able instruments are often time tagged using time information derived from the computer clock within the data acquisition system. Invariably both of these alternatives rely on a quartz crystal oscillator of one sort or another, and both suffer from the same limitations of accuracies. In the field, my rule of thumb is that the standard quartz crystal oscillator can drift as much as 10 seconds per day. The causes for this drift are elucidated below. While this is a worse case estimate, you should count on it. (Remember Murphy's Law.)

If you need more accuracy, then use an external clock. External clocks can be oven- stabilized quartz clocks, cesium clocks, or satellite clocks. For oceanographic work, GPS- based clocks can provide accurate timing and navigation, so they are hard to beat. If you go this route, just be sure to use a GPS receiver that is designed for timing. The cheapest GPS navigation receivers that you can buy use asynchronous computer designs that spit out the navigation and time information when the relatively slow internal computer gets around to it. These systems, which can be off by up to several seconds, are not suitable for high accuracy timing. It doesn't cost much more to buy a GPS receiver designed to provide accurate timing.

Clock Drift

Above, I mentioned an outrageously large clock drift value of 10 seconds per day for quartz crystal clocks. I can hear your objection now, because I have heard it many times before. You are thinking, "The clock in my PC (or Mac, or Unix workstation) has never drifted that much." I am sure that you are right. In the office, a PC system clock usually is good to few seconds during a month. In the field though, that exact same clock can exhibit significantly worse drift. Quartz crystal oscillators are amazingly good at keeping time, but they have well known dependencies on operating voltage and temperature. In the office, the line voltage and ambient temperature are well regulated. This is in sharp contrast to the field, where line voltages and ambient temperatures can fluctuate all over the place. Quartz clocks in the field are never as accurate as those in the office.

In addition to quartz oscillator drift, the common IBM PC has another, sometimes more dangerous clock problem built into it. Within a standard IBM-compatible PC there are two clocks, the CPU system clock and a real time clock. The CPU system clock is maintained by software from the quartz oscillator that drives the CPU. The real time clock runs from a separate oscillator within the computer. When you use the lowest-level system calls to obtain the time, you are actually reading the CPU system clock. The problem is that this clock, because it is maintained by software, can occasionally miss a tick if the system interrupts are turned off for any length of time. Most programs do not turn off the system interrupts so this is rarely a problem for the typical user. Unfortunately, scientific users are not typical. One of the only classes of programs to turn off system interrupts are those that need to maximize their response time to external events, such as data acquisition programs. This is not good! The problem is so severe that I have seen data acquisition systems lose minutes per day because of a reliance on the CPU system clock that was missing ticks due to the data acquisition system design. While software can be designed to use the system clock without losing ticks, I prefer to invoke a few more software commands to rely on the real time clock instead. It will suffer the same drift that all quartz oscillators due, but it won't drift according to the CPU loading.

Solutions

Many of the experiments that we go on require timing accuracies of on the order of a second or two. A typical solution for us is to take a GPS or GOES clock along on the experiment. The RS232 data output from this clock is then piped around to all of the computer data acquisition systems. (The RS232 outputs from the computers are left unconnected.) The individual data acquisition systems are designed to rely on the internal real time clocks, but are set up to periodically set the real time clocks from the serial time stream. This design synchronizes all of the data acquisition clocks automatically and has the advantage that the systems will continue to work if the satellite clock fails.

For even higher accuracy timing, specialized boards are usually needed or efforts have to be made to design the higher accuracy timing directly into the instrument output data streams.

On a related topic, you should carefully consider those data streams that you know will have to be merged after the experiment. A good rule of thumb is to record such data streams onto the same data acquisition system, interleaving the data streams to be combined into the same file. The alternative design, of using two separate data acquisition systems, requires that the timing within both systems be accurate. If, on the other hand, the data are interleaved into a single file, then the data can be merged after the experiment, even if the timing data are lost.

Backup

I have stressed the dollar costs of performing an experiment at sea. If you have been to sea you also know the personal costs - the time spent in preparation and execution can be immense. Because of these costs, the digital data that you acquire at sea are incredibly valuable. These data are invariably recorded on some form of magnetic media, either disks or tape. When we add to this the simple fact that all magnetic media will eventually fail, then we have a possible recipe for disaster.

You may think I exaggerate. Typical lifetimes of data on magnetic tape run 10 years. Typical lifetimes on floppies is less and hard drives more. Still, all of these media will fail at some point. The oxide will flake, the magnetic field strength will decay, or the drive will fail. Even more of a danger comes from accidents. Something can be spilled onto the media. Or, fear of all fears, a magnet will come near the media. When I go through airports, I always fear the X-ray machines. The guards will always tell you that the X-rays are safe and they are right. I don't fear the X-rays. I fear the magnetic fields set up by the motors that drive the conveyor belt. I nearly caused an international incident in the Hamburg airport once when I refused to allow them to run 80 Gigabytes of data on Exabyte tapes through their X-ray machine. I know I was being a bit fanatical, but those tapes had cost me a month of my life, and I was determined to get them home intact.

The solution to the problem of unreliability of magnetic media is simple: you should back up your data. Typically, I will back up data each day of an experiment onto a separate backup disk or tape. I fully label these disks or tapes, set their write protect tabs to prevent accidental reuse and keep them separate from the main data disks. At the end of the experiment, if the volume of data is not to great, I'll then combine all the data from the main disks onto a few tapes or disks. I then hand carry these summary tapes back home. I always carry one set back and ship the other. That way if the shipment is lost, or my prize data and I get run over by a pizza delivery guy, then at least one set of data survives.

Data backup is not hard. While the risk of data loss on modern media is small, a backup plan should be a part of every experiment.

On Board Analysis

It used to be that scientists would go into the field, acquire their data, and then go back to their labs to perform the analysis. This separation of the acquisition and analysis phases was dictated by the lack of computing power and time on board the research vessel. While I won't suggest that all analysis needs to be done on board, especially for those systems whose data are inaccessible like deep moorings, it is of critical importance to perform some on board examination of the data that are available.

There are two reasons for this desire to perform on board data analysis. First, and foremost, you can best determine if your instruments and system are working if you analyze the data. I have seen the anguish of scientists who discovered that mistakes were made in the field months after the experiment was over, and it is not a pretty sight. In this day and age there is no reason for this. I think it was Archimedes that said that if given a sufficiently big computer and a place to stand he could rule the world. (Or was that a slide rule?) Now we have that level of computing power available to us on a desktop along with advanced data analysis languages such as Matlab or IDL. Your goal here should be to perform sufficient analysis to insure that your instruments are working and that you are acquiring data of sufficient quality and quantity to satisfy the needs of your final analysis.

The second reason for on board data analysis is to provide data to guide the conduct of the experiment. For example, I regularly perform a detailed real-time analysis of CTD data during experiments involving internal waves. Calculation of the solutions to the Taylor-Goldstein equations provides information regarding both the vertical structure of internal waves trapped in the pycnocline as well as their dispersion relation. The dispersion relation is then used to better estimate the timing of key observations of internal waves, thus affecting the actual conduct of the experiment.

In other cases, real time data analysis will reveal features that warrant additional investigation. On one test in a coastal environment, I had been regularly examining current data from 3 instruments on the APL spar buoy in real time. When the system was started up on one particular day, I noticed that the vertical channel of the instrument at 3 m depth was very noisy, while the instruments at 1 m and 8 m depth were quiet. I concluded with dismay that the middle instrument was failing. Later in the day, I noticed that the 3 m vertical channel had quieted down. When I looked at all of the current time series for the day, it was clear that the vertical channel at 3 m depth was correlated with especially strong shears at the base of the mixed layer, which happened to be only 3 m deep. I quickly worked up some estimates for the bulk Richardson number, a measure of the stability of the water column, and found a near perfect correlation with the noise at 3 m depth. So instead of equipment failure, I was directly observing the baroclinic instabilities induced by the near-surface current shear. Neat stuff. And even better yet, because I was able to identify the situation, we redoubled our efforts to obtain profiles of currents and stratification to get a more complete picture of this phenomenon.

On the Nova expedition, undertaken in the late 60's to study the geology of the South Pacific, the goal was to analyze all of the data as it was taken. As Menard put it:

...We expect to analyze all the data as rapidly as they accumulate, which will require a newly developed method of computer handling. If it works, all the preliminary analysis and much of the final analysis of all the millions of observations will be completed by the time the ships return to San Diego. I certainly hope so, not only because it is essential for planning the later parts of the expedition but also because I shall not go to sea again until it is done. American oceanographic laboratories are loaded with undigested observations, and I do not want to be guilty of adding to the pile.

It is useful to recall that of the two ships participating in the Nova expedition, only one was equipped with a computer. Now, with computers everywhere, on board analysis should be simple!

You should be aware of at least two problems with onboard analysis. The first is mental inertia; don't fall into the trap of treating on board analysis products as the final word on the subject. It is a rare experiment indeed that you can walk off the ship with all of your analysis work done. I am always careful to label on board data results as preliminary and make a point of warning my colleagues about using such products prior to a more complete work up. The second problem is that the analysis results may influence you to alter your test plans in ways that are not actually for the best. For example, if I had done no further analysis on the noisy 3 m current data, I would have continued to believe that the instrument was failing. Since this was an important instrument I might have actually called out a diver to replace the system at some expense and disruption to the experiment. It is all too easy to initially jump to the wrong conclusion. For this reason you need to guard against the use of misleading or incomplete analyses during the experiment.

Goals for Cruise End

My typical goal is to walk off the ship at cruise end with a complete cruise report, containing my cruise logs along with sample data products and analyses performed on board, as well as a complete data catalog. My goal is to have this report on computer, formatted and ready to print as soon as I get back to the lab. I'll admit that this is an ambitious goal and that I rarely achieve it, but it is good to set your goals high. Furthermore, you'll need all of this information at some point anyway, so why not try to assemble it all on board ship.

I also find that putting out a cruise report within a few weeks of the end of a cruise works to document the cruise for yourself and your colleagues in ways that handwritten notes and logs cannot. While this report may be full of the blemishes associated with a preliminary effort, it should have all of the detail necessary for someone to evaluate the cruise and the quantity, if not quality, of the data that you obtained.

Quick Look Analysis

The next phase of analysis occurs after you leave the ship. In a typical ONR sponsored experiment, there will be an experimenters meeting within one to two months of the end of the cruise. I usually set my next goals to coincide with this meeting. Prior to this meeting I work hard to complete all pertinent data catalogs. I also perform sufficient analysis to evaluate the quality of my complete data sets. Finally, I attempt to process any general-purpose environmental data that I have agreed to share with the community to the point that it can be distributed. I refer to this work as quick look analysis. It is only after the completion of the quick look analysis that I actually turn to performing the research that I set out to do.

My approach to the question of priorities here is clearly different than most of my colleagues. On a typical experiment, I will be responsible for a number of measurements. Some of these are quite general in nature, such as wind speed and direction or mean currents, and are needed by all of my colleagues for their research. Other measurements I make, for example high-speed vertical velocity measurements to estimate turbulence, are solely for my own research and are not needed by my colleagues. When confronted with the choice of which to analyze first, I prefer to work up the general-purpose data to the point that it can be distributed.

There are several reasons. First, I typically need the general-purpose data myself for my own research. Second, the general-purpose data typically will be used to select times and environments favorable for the more detailed analysis to come. Thus it serves as a guide into the full data set. Third, I feel that I owe it to my colleagues to produce promised data sets in a timely fashion so that they can get on with their research. It is also good to use this as moral leverage to get your colleagues to work up their data that is of use to you. (I don't believe in holding data back as an inducement. Remember it's not really your data, it belongs to the sponsors and, at least usually, the taxpayers!) Last, but not least, I am a notorious procrastinator and so I treat my research as the carrot to get me to do the general work up front.

Naturally I understand that each data set and experimental group is different. On one experiment, it took my colleagues two years to work up the data into a preliminarily useful form! Their data set did contain over 160 gigabytes so I suppose this is not an unreasonable length of time. So as with everything in this book, don't go overboard on my suggestions. I do suggest though that you treat your commitment to your colleagues as if it was as important as your own work.

One final point on quick look data analysis is worth making. I was on one experiment a few years ago where three different groups measured the wind speed from three different locations about the experiment site. Afterwards each produced time series of the winds that they distributed to other researchers. Naturally none of the measurements agreed exactly, which led to numerous arguments about which wind speed was "correct" for a particular analysis. In a subsequent experiment, we took it upon ourselves to coordinate the wind analyses. A single report was produced which included all of the data along with the locations of each measurement and a mean wind value for the area. This approach eliminated any squabbling about which measurement should be used and at the same time, by intercomparing the measurements, provided a mechanism for checking all of the measurements. Everyone wins when they cooperate in joint analyses.

Data Distribution

I cannot tell you how many programs I have been involved with where the reality of post-experiment data distribution bears no resemblance to the pre-experiment promises of various experimenters. Too often investigators promise sponsors that their data will be made available to their colleagues, only to conveniently forget their promises later. The other out that I often see taken is to provide copies of books of data plots to interested colleagues. I have always thought of this as a cop out, because in this age of the computer, paper copies of data do me little good.

To me, the only good way to distribute data is digitally. I prefer to provide my colleagues with computer data files that they can use in their own analyses. I have personally progressed through three stages in the evolution of electronic data transfer. In the first stage I would distribute a report providing an overview of the data set. I would then tell everyone that received a copy of the report that the data were available in digital form, typically 9-track tapes. (How's that for convenient!) The next step up the evolutionary ladder was to include an IBM-compatible floppy disk right in the report itself. The disks were cheap and easy to duplicate, so it didn't matter that some got the data who didn't really want it. There still was a problem with these approaches. Inevitably, several years would pass and someone would pop up and ask for a copy of the data. Unless I stockpiled a sufficient number of the data disks, it was always a chore to dig up the old data to create a new copy. Discovering a mistake in the data, which required the distribution of a corrected data file, was another problem. While the work in sending out one set of disks was acceptable to me, the distribution of corrected disks was a pain.

This led me to the ultimate step in the evolutionary chain, to a higher plane where all is sweet and light; where children and adults play unimpeded by the encumberances of physical reality. I now distribute data via Internet. (Sorry if I got your hopes up a little too high with that lead in.) When I get involved in a new experiment I offer to set up a group FTP site for the program. Notice I said group FTP site, not anonymous. To do this we set up a single account with username and password on one of our systems. We limit this account so it can access only a selected portion of our file system and further limit upload and modification privileges to a single upload directory. As the FTP administrators we set up a beginning directory structure and stock this with our own data sets. The site is then announced to the community, who all share the same username and password to gain access. All accesses are logged so that we can keep track of who is utilizing the data. Finally we encourage everyone to upload useful data to share with others in the community. In this way the program gains a central repository of data of all types.

In each experiment I am involved in I still find that I have to sell the utility of this approach to the community, but each time the sales pitch is easier to make. Some resistance will come from those who don't want to share all of their data. To this I say, fine, just share what you want. Some resistance comes from those that are afraid that they lose control of their data. To this I say the data are only available to our colleagues in the program and making the data available electronically in no way reduces the responsibility of a data "user" to obtain permission from the data "owner".

Of course this is not really the ultimate system. The data are cataloged, archived, and easily available, but it is not easy to browse a bunch of ASCII data files. Some of my colleagues at APL have begun using Netscape and the HyperText Markup Language (html) to document their work. I imagine that after my next experiment, we will use these powerful tools to enhance the group databases with easy browsing features allowing our colleagues to peruse the data graphically to select just what they need.

Independent of how you choose to distribute your data after an experiment, it is of the utmost importance to also distribute a description of the data. I cannot tell you how many times I have gotten data from someone, only to find that I had to decipher the contents as if I were some cryptographer working to break an enemy code. Every set of files should be accompanied by a README file, containing a complete description of the file format (a few example lines can be especially useful), the units of the values (including time), a description of any processing that was performed on the data within the file, and any other information that would be useful in understanding and interpreting the data. It is most important for the description to be complete. While I have argued strongly for the use of universal time for all oceanographic data sets, not everyone agrees and so you should always specify the time zone utilized for your measurements. Further details should also be included. For example, if the files contain averaged wind speed and direction, you should note whether you used the meteorological convention for measuring wind direction (the preferred choice where 0° indicates a wind blowing from the North, toward the South) and even the method used to average the winds. Many students are surprised to learn that there is no standard method for averaging wind speed and directions, and thus the details of the processing methods used are important.

(The process of averaging geophysical vector quantities is not as straight forward as one might think. Take wind speed as an example. The mathematically obvious solution is to perform a pure vector average, where the average is the vector sum of all the measurements divided by the number of measurements. For wind speeds, this would represent the mean distance that a parcel of air would have traveled in unit time. If in fact, you were attempting to compute the motions of a balloon, then this would be the appropriate measure. In contrast, what if you were attempting to compute the depth of the mixed layer induced by the wind AND the data showed that the wind had been from the North for an hour at 10 m/s and then from South for an hour at 10 m/s. The vector averaged wind of 0 m/s would be clearly inappropriate for this analysis. In this case a better average would be obtained by averaging the wind speed and direction as separate scalar quantities. These two examples show that the appropriateness of methods used to perform vector averaging depends on the use of the averages. Thus, whatever method is used should be specified when distributing data.)

The message here is simple. A supervisor of mine once told me that if you didn't publish your work, that you might as well not have done it. I now think if you don't make the effort to distribute data to your colleagues, that you might as well not have acquired it. And by distribute, I mean in a form that others can actually understand and use. The modern tools of the information age make the distribution and archiving of data simple and painless, so you have no excuse to do otherwise.

Data Acquisition and Analysis Checklist

  • Design for necessary timing, rely on automatic GPS clocks when possible. Beware of crystal clock drift
  • Make a backup plan and stick with it
  • Analyze data on board, if possible, to assure that systems are working and to guide the conduct of the experiment
  • Keep up with cruise logs as a way of producing a cruise report soon after the end of the experiment
  • Analyze and distribute key data as quickly as possible
  • Distribute data in a form useful to other investigators
rick.chapman@jhuapl.edu
© Rick Chapman, 1997-2004, All Rights Reserved