Working with converted files#

Open a converted netCDF or Zarr dataset#

Converted netCDF files can be opened with the open_converted function that returns a lazy-loaded EchoData object (only metadata are read during opening):

import echopype as ep
file_path = "./converted_files/file.nc"      # path to a converted nc file
ed = ep.open_converted(file_path)            # create an EchoData object

Likewise, specify the path to open a Zarr dataset. To open such a dataset from cloud storage, use the same storage_options parameter as with open_raw. For example:

s3_path = "s3://s3bucketname/directory_path/dataset.zarr"     # S3 dataset path
ed = ep.open_converted(s3_path, storage_options={"anon": True})

Combine EchoData objects#

Data collected by the same instrument deployment across multiple files can be combined into a single EchoData object using combine_echodata. With the release of echopype version 0.6.3, one can now combine a large number of files in parallel (using Dask) while maintaining a stable memory usage. This is done under-the-hood by concatenating data directly into a Zarr store, which corresponds to the final combined EchoData object.

To use combine_echodata, the following criteria must be met:

  • Each EchoData object must have the same sonar_model

  • The EchoData objects to be combined must correspond to different raw data files (i.e., no duplicated files)

  • The EchoData objects in the list must be of sequential order in time. Specifically, the first timestamp of each EchoData object must be smaller (earlier) than the first timestamp of the subsequent EchoData object

  • The EchoData objects must contain the same frequency channels and the same number of channels

  • The following attribute criteria must be satisfied for all groups under each of the EchoData objects to be combined:

    • the names of all attributes must be the same

    • the values of all attributes must be identical (other than the attributes date_created or conversion_time; these attributes should have the same data type)

Attention

In previous versions, combine_echodata corrected reversed timestamps and stored the uncorrected timestamps in the Provenance group. Starting from 0.6.3, combine_echodata will preserve time coordinates that have reversed timestamps and not correction is performed.

The first step in combining data is to establish a Dask client with a scheduler. On a local machine, this can be done as follows:

client = Client()  # create client with local scheduler

With distributed resources, we highly recommend reviewing the Dask documentation for deploying Dask clusters.

Next, we assemble a list of EchoData objects. This list can be from converted files (netCDF or Zarr) as in the example below, or from in-memory EchoData objects:

ed_list = []
for converted_file in ["convertedfile1.zarr", "convertedfile2.zarr"]:
    ed_list.append(ep.open_converted(converted_file))  # already converted files are lazy-loaded

Finally, we apply combine_echodata on this list to combine all the data into a single EchoData object. Here, we will store the final combined form in the Zarr path path_to/combined_echodata.zarr and use the client we established above:

combined_ed = ep.combine_echodata(
    ed_list, 
    zarr_path='path_to/combined_echodata.zarr', 
    client=client
)

Once executed, combine_echodata returns a lazy loaded EchoData object (obtained from zarr_path) with all data from the input EchoData objects combined.

Note

As shown in the above example, the path of the combined Zarr store is given by the keyword argument zarr_path, and the Dask client that parallel tasks will be submitted to is given by the keyword argument client. When either (or both) of these are not provided, default values listed in the Notes section in combine_echodata will be used.