__init__

FileSet.__init__(path, handler=None, name=None, info_via=None, time_coverage=None, info_cache=None, exclude=None, placeholder=None, max_threads=None, max_processes=None, worker_type=None, read_args=None, write_args=None, post_reader=None, compress=True, decompress=True, temp_dir=None, fs=None)[source]

Initialize a FileSet object.

Parameters
  • path – A string with the complete path to the files. The string can contain placeholder such as {year}, {month}, etc. See below for a complete list. The direct use of restricted regular expressions is also possible. Please note that instead of dots ‘.’ the asterisk ‘*’ is interpreted as wildcard. If no placeholders are given, the path must point to a file. This fileset is then seen as a single file set. You can also define your own placeholders by using the parameter placeholder.

  • name – The name of the fileset.

  • handler – An object which can handle the fileset files. This fileset class does not care which format its files have when this file handler object is given. You can use a file handler class from typhon.files, use FileHandler or write your own class. If no file handler is given, an adequate one is automatically selected for the most common filename suffixes. Please note that if no file handler is specified (and none could set automatically), this fileset’s functionality is restricted.

  • info_via – Defines how further information about the file will be retrieved (e.g. time coverage). Possible options are filename, handler or both. Default is filename. That means that the placeholders in the file’s path will be parsed to obtain information. If this is handler, the get_info() method is used. If this is both, both options will be executed but the information from the file handler overwrites conflicting information from the filename.

  • info_cache – Retrieving further information (such as time coverage) about a file may take a while, especially when get_info is set to handler. Therefore, if the file information is cached, multiple calls of find() (for time periods that are close) are significantly faster. Specify a name to a file here (which need not exist) if you wish to save the information data to a file. When restarting your script, this cache is used.

  • time_coverage – If this fileset consists of multiple files, this parameter is the relative time coverage (i.e. a timedelta, e.g. “1 hour”) of each file. If the ending time of a file cannot be retrieved by its file handler or filename, it is then its starting time + time_coverage. Can be a timedelta object or a string with time information (e.g. “2 seconds”). Otherwise the missing ending time of each file will be set to its starting time. If this fileset consists of a single file, then this is its absolute time coverage. Set this to a tuple of timestamps (datetime objects or strings). Otherwise the period between year 1 and 9999 will be used as a default time coverage.

  • exclude – A list of time periods (tuples of two timestamps) or filenames (strings) that will be excluded when searching for files of this fileset.

  • placeholder – A dictionary with pairs of placeholder name and a regular expression matching its content. These are user-defined placeholders, the standard temporal placeholders do not have to be defined.

  • max_threads – Maximal number of threads that will be used to parallelise some methods (e.g. writing in background). This sets also the default for map()-like methods (default is 3).

  • max_processes – Maximal number of processes that will be used to parallelise some methods. This sets also the default for map()-like methods (default is 8).

  • worker_type – The type of the workers that will be used to parallelise some methods. Can be process (default) or thread.

  • read_args – Additional keyword arguments in a dictionary that should always be passed to read().

  • write_args – Additional keyword arguments in a dictionary that should always be passed to write().

  • post_reader – A reference to a function that will be called after reading a file. Can be used for post-processing or field selection, etc. Its signature must be callable(file_info, file_data).

  • temp_dir – You can set here your own temporary directory that this FileSet object should use for compressing and decompressing files. Per default it uses the tempdir given by tempfile.gettempdir (see tempfile.gettempdir()).

  • compress – If true and path ends with a compression suffix (such as .zip, .gz, .b2z, etc.), newly created files will be compressed after writing them to disk. Default value is true.

  • decompress – If true and path ends with a compression suffix (such as .zip, .gz, .b2z, etc.), files will be decompressed before reading them. Default value is true.

  • fs – Instance of implementation of fsspec.spec.AbstractFileSystem. By passing a remote filesystem implementation this allows for searching for and opening files on remote file systems such as Amazon S3 using s3fs.S3FileSystem.

You can use regular expressions or placeholders in path to generalize the files path. Placeholders are going to be captured and returned by file-finding methods such as find(). Temporal placeholders will be converted to datetime objects and represent a file’s time coverage. Allowed temporal placeholders in the path argument are:

Placeholder

Description

Example

year

Four digits indicating the year.

1999

year2

Two digits indicating the year. 1

58 (=2058)

month

Two digits indicating the month.

09

day

Two digits indicating the day.

08

doy

Three digits indicating the day of the year.

002

hour

Two digits indicating the hour.

22

minute

Two digits indicating the minute.

58

second

Two digits indicating the second.

58

millisecond

Three digits indicating the millisecond.

999

1

Numbers lower than 65 are interpreted as 20XX while numbers equal or greater are interpreted as 19XX (e.g. 65 = 1965, 99 = 1999)

All those place holders are also allowed to have the prefix end (e.g. end_year). They represent the end of the time coverage.

Moreover, you are allowed do define your own placeholders by using the parameter placeholder or set_placeholders(). Their names must consist of alphanumeric signs (underscores are also allowed).