Subsetting Data#

Using existing xarray functions#

Tracks are loaded as an xarray.Dataset which have lots of built in methods for subsetting data. e.g. for indexing see xarray indexing.

For more specific selection of data, the best method is to use xarray.Dataset.where with the argument drop=True. e.g.

[2]:

import huracanpy

tracks = huracanpy.load(huracanpy.example_csv_file)

# Select all points with longitude > 60
print(tracks.lon, "\n")
tracks_subset = tracks.where(tracks.lon > 60, drop=True)
print(tracks_subset.lon)

<xarray.DataArray 'lon' (record: 99)> Size: 792B
array([120.5 , 119.  , 119.  , ...,  58.5 ,  60.25,  62.25], shape=(99,))
Dimensions without coordinates: record

<xarray.DataArray 'lon' (record: 53)> Size: 424B
array([120.5 , 119.  , 119.  , ..., 156.  ,  60.25,  62.25], shape=(53,))
Dimensions without coordinates: record

Selecting times#

Generally the time array will be loaded in as an np.datetime64 array. This means it doesn’t work to compare it with the standard datetime

[3]:

import datetime

# Try to select a subset of times based on datetime
print(tracks.time)
tracks_subset = tracks.where(tracks.time > datetime.datetime(1980, 1, 10), drop=True)

<xarray.DataArray 'time' (record: 99)> Size: 792B
array(['1980-01-06T06:00:00.000000000', '1980-01-06T12:00:00.000000000',
       '1980-01-06T18:00:00.000000000', ...,
       '1980-01-30T06:00:00.000000000', '1980-01-30T12:00:00.000000000',
       '1980-01-30T18:00:00.000000000'],
      shape=(99,), dtype='datetime64[ns]')
Dimensions without coordinates: record

TypeError: '>' not supported between instances of 'int' and 'datetime.datetime'

However, the same comparison can be done using datetime64, the syntax is just a bit different

[4]:

import numpy as np

tracks_subset = tracks.where(tracks.time > np.datetime64("1980-01-10"), drop=True)
print(tracks_subset.time)

<xarray.DataArray 'time' (record: 72)> Size: 576B
array(['1980-01-10T06:00:00.000000000', '1980-01-10T12:00:00.000000000',
       '1980-01-10T18:00:00.000000000', ...,
       '1980-01-30T06:00:00.000000000', '1980-01-30T12:00:00.000000000',
       '1980-01-30T18:00:00.000000000'],
      shape=(72,), dtype='datetime64[ns]')
Dimensions without coordinates: record

Note, that this isn’t always the case. If the tracks are loaded in with a different calendar, then the times will use cftime which is not converted to datetime64 by xarray.

[5]:

# The tracks don't actually use a 360_day calendar.
# I'm just passing this as an argument to show an example of it loading this way
tracks = huracanpy.load(
    huracanpy.example_TRACK_file, source="track", track_calendar="360_day"
)
print(tracks.time)

<xarray.DataArray 'time' (record: 46)> Size: 368B
array([cftime.datetime(2022, 1, 13, 18, 0, 0, 0, calendar='360_day', has_year_zero=True),
       cftime.datetime(2022, 1, 14, 0, 0, 0, 0, calendar='360_day', has_year_zero=True),
       cftime.datetime(2022, 1, 14, 6, 0, 0, 0, calendar='360_day', has_year_zero=True),
       ...,
       cftime.datetime(2022, 1, 28, 18, 0, 0, 0, calendar='360_day', has_year_zero=True),
       cftime.datetime(2022, 1, 29, 0, 0, 0, 0, calendar='360_day', has_year_zero=True),
       cftime.datetime(2022, 1, 29, 6, 0, 0, 0, calendar='360_day', has_year_zero=True)],
      shape=(46,), dtype=object)
Dimensions without coordinates: record

In this case, neither the datetime or the datetime64 comparison will work and you have to compare to a cftime.datetime object with the same calendar

[6]:

tracks_subset = tracks.where(tracks.time > datetime.datetime(1980, 1, 10), drop=True)

TypeError: cannot compare cftime.datetime(2022, 1, 13, 18, 0, 0, 0, calendar='360_day', has_year_zero=True) and datetime.datetime(1980, 1, 10, 0, 0) (different calendars)

[7]:

tracks_subset = tracks.where(tracks.time > np.datetime64("1980-01-10"), drop=True)

TypeError: '>' not supported between instances of 'cftime._cftime.datetime' and 'datetime.date'

[8]:

import cftime

tracks_subset = tracks.where(
    tracks.time > cftime.datetime(1980, 1, 10, calendar="360_day"), drop=True
)
print(tracks_subset.time)

<xarray.DataArray 'time' (record: 46)> Size: 368B
array([cftime.datetime(2022, 1, 13, 18, 0, 0, 0, calendar='360_day', has_year_zero=True),
       cftime.datetime(2022, 1, 14, 0, 0, 0, 0, calendar='360_day', has_year_zero=True),
       cftime.datetime(2022, 1, 14, 6, 0, 0, 0, calendar='360_day', has_year_zero=True),
       ...,
       cftime.datetime(2022, 1, 28, 18, 0, 0, 0, calendar='360_day', has_year_zero=True),
       cftime.datetime(2022, 1, 29, 0, 0, 0, 0, calendar='360_day', has_year_zero=True),
       cftime.datetime(2022, 1, 29, 6, 0, 0, 0, calendar='360_day', has_year_zero=True)],
      shape=(46,), dtype=object)
Dimensions without coordinates: record

Selecting an individual track#

This can fairly easily be achieved by groupby

[9]:

print(np.unique(tracks.track_id))
track = tracks.groupby("track_id")[840]
print(track)

[ 840 1601]
<xarray.Dataset> Size: 5kB
Dimensions:        (record: 17)
Dimensions without coordinates: record
Data variables: (12/35)
    track_id       (record) int64 136B 840 840 840 840 840 ... 840 840 840 840
    time           (record) object 136B 2022-01-13 18:00:00 ... 2022-01-17 18...
    lon            (record) float64 136B 284.1 287.8 292.5 ... 317.2 318.4 321.9
    lat            (record) float64 136B 29.5 29.99 30.88 ... 58.15 58.47 60.38
    vorticity      (record) float64 136B 2.273 4.124 5.978 ... 8.996 6.404 6.274
    feature_0_lon  (record) float64 136B 285.4 287.7 292.7 ... 316.3 319.1 322.3
    ...             ...
    feature_8_lon  (record) float64 136B 289.4 289.4 293.6 ... 328.5 328.2 316.7
    feature_8_lat  (record) float64 136B 29.65 31.05 29.09 ... 58.88 61.97 59.72
    feature_8      (record) float64 136B 16.1 24.5 25.88 ... 29.98 23.18 20.72
    feature_9_lon  (record) float64 136B 289.7 289.7 290.5 ... 319.5 323.2 326.5
    feature_9_lat  (record) float64 136B 29.65 31.05 33.02 ... 56.91 59.16 60.56
    feature_9      (record) float64 136B 11.22 15.69 16.05 ... 20.47 16.66 15.37

However, this can be fairly slow if you have a large amount of tracks and you are doing nothing else with groupby. Instead, you can use sel_id to quickly get an individual track

[10]:

track = huracanpy.sel_id(tracks, tracks.track_id, 840)
print(track)

<xarray.Dataset> Size: 5kB
Dimensions:        (record: 17)
Dimensions without coordinates: record
Data variables: (12/35)
    track_id       (record) int64 136B 840 840 840 840 840 ... 840 840 840 840
    time           (record) object 136B 2022-01-13 18:00:00 ... 2022-01-17 18...
    lon            (record) float64 136B 284.1 287.8 292.5 ... 317.2 318.4 321.9
    lat            (record) float64 136B 29.5 29.99 30.88 ... 58.15 58.47 60.38
    vorticity      (record) float64 136B 2.273 4.124 5.978 ... 8.996 6.404 6.274
    feature_0_lon  (record) float64 136B 285.4 287.7 292.7 ... 316.3 319.1 322.3
    ...             ...
    feature_8_lon  (record) float64 136B 289.4 289.4 293.6 ... 328.5 328.2 316.7
    feature_8_lat  (record) float64 136B 29.65 31.05 29.09 ... 58.88 61.97 59.72
    feature_8      (record) float64 136B 16.1 24.5 25.88 ... 29.98 23.18 20.72
    feature_9_lon  (record) float64 136B 289.7 289.7 290.5 ... 319.5 323.2 326.5
    feature_9_lat  (record) float64 136B 29.65 31.05 33.02 ... 56.91 59.16 60.56
    feature_9      (record) float64 136B 11.22 15.69 16.05 ... 20.47 16.66 15.37

Subsetting by track#

To apply a criteria to each track in the dataset, use huracanpy.trackswhere

[11]:

# Add storm category by pressure to each track and filter those that don't reach
# category 2
tracks = huracanpy.load(huracanpy.example_csv_file)
tracks = tracks.hrcn.add_pressure_category(slp_units="Pa")

# Show the categories for each storm
# Storms 0 and 2 reach category 2, and storm 1 only reaches category 1
for track_id, track in tracks.groupby("track_id"):
    print("track", track_id, "category", int(track.pressure_category.max()))

# Subset the tracks by category threshold which will remove track 1
track_subset = huracanpy.trackswhere(
    tracks, tracks.track_id, lambda track: track.pressure_category.max() >= 2
)

# Confirm that track 1 has been filtered out
print("\n", "tracks remaining -", set(track_subset.track_id.data))

track 0 category 2
track 1 category 1
track 2 category 2

 tracks remaining - {np.int64(0), np.int64(2)}