Data format

The lack of interoperability among data collected by different sonar systems is currently a major obstacle toward integrative analysis of sonar data at large scales. echopype aims at addressing this problem by providing tools for converting data from manufacturer-specific formats into a standardized netCDF file format. NetCDF is the current defacto standard in climate research and is supported by many powerful Python packages for efficient computation, of which echopype take advantage in its data analysis modules.

Interoperable netCDF files

Echopype follows the ICES SONAR-netCDF4 convention when possible to create an interoperable data format to which all data are converted to. We made modifications to the file structure in the convention so that the computation can take full advantage of the power of xarray in manipulating labelled multi-dimensional arrays. See Modifications to SONAR-netCDF4 for details of this modification.

Echopype also supports converting raw data files into the zarr format for cloud-optimized data storage and access, following the same structure as in the netCDF files. However, computing based on the zarr format via Process is still being developed.

Modifications to SONAR-netCDF4

Echopype is designed to handle multi-dimensional labelled data sets efficiently, using xarray under the hood. Therefore, we store backscatter data (the echoes) from different frequency channels in a multi-dimensional array under a single Beam group within a netCDF file. Because of this change, all frequency-dependent parameters, such as absorption coefficients, sample intervals, etc., are stored as an array with a frequency coordinate.

This is different from the SONAR-netCDF4 convention, in which data and parameters from different frequency channels are stored in different beam groups under the Sonar group. In the convention this was designed to accommodate potential differences in the number of bins along range, or when there is a change of the temporal length of data collection in the middle of a file.

However, it is more convenient to store and slice data directly by the time, range, and frequency/beam direction coordinates (see pandas and xarray documentation for more info about coordinates and dimensions) when the data are stored in a cubic form. To accommodate this change, in the above two cases, echopype

  • handles the uneven number of data samples along range by filling in NaN for the shorter channels, and
  • splits the raw data file into multiple files when there is a change of the temporal length of data collection along range in the middle of a file.

In addition to computational efficiency, another advantage of echopype’s approach in restructuring the netCDF format is to enhance the code readability and make data analysis computations more tractable. For example, to extract data from a particular frequency, users can simply do the following without worrying about the numerical sequence of the index of the selected frequency:

import xarray as xr
fname = 'some-path/'
ds = xr.open_dataset(fname, group='Beam')  # open file as an xarray DataSet
data_120k = ds.backscatter_r.sel(frequency=120000)  # explicit indexing for frequency