dates
dates
¶
Date string parsing, formatting, and inference utilities.
Supports ISO8601 timezone offsets (via strftime_extra / strptime_extra),
permutation-based format inference (parse_date, infer_from_series),
and date randomization for PII replacement (randomize).
Classes:
| Name | Description |
|---|---|
TokenizedStr |
Represents a date string that has been broken up into individual date |
ParsedDate |
Wrapper for a parsed date and associated model_metadata |
Functions:
| Name | Description |
|---|---|
strftime_extra |
Formats a datetime object, supporting ISO8601 timezone offsets. |
strptime_extra |
Parses a string as a datetime object, supporting ISO8601 timezone offsets. |
date_component_permutations |
Return the Cartesian product of per-component format strings. |
gen_date_str_fmt_permutations |
Return the set of all unique date format permutations. |
tokenize_date_str |
Given a raw input date, will return an instance of |
maybe_match |
Attempt to parse |
parse_date |
Parse a date string and return the first matching |
parse_date_multiple |
Yield all valid |
randomize |
Given a date string of some unknown format, returns a randomly shifted version |
d_str_to_fmt_multiple |
Yield all plausible |
maybe_d_str_to_fmt_multiple |
Like |
d_str_to_fmt |
Infer the most likely |
infer_from_series |
Infer the best |
fit_and_transform_dates |
Detect date columns, convert them to elapsed seconds, and record the transformation. |
transform_dates |
Apply a previously fitted date-to-seconds transformation to a DataFrame. |
Attributes:
| Name | Type | Description |
|---|---|---|
date_component_orders |
This list contains date orderings by component. |
|
component_formats |
For every date component, there may exist multiple formats. This dictionary maps |
|
component_seperators |
Characters from this list will be removed from a date string and used to build up |
|
date_str_fmt_permutations |
A unique list of date string formats |
date_component_orders = [lambda y, m, d, hms, tz: f'{d} {m} {y}', lambda y, m, d, hms, tz: f'{m} {d} {y}', lambda y, m, d, hms, tz: f'{y} {m} {d}', lambda y, m, d, hms, tz: f'{y} {m} {d}', lambda y, m, d, hms, tz: f'{y} {m} {d} {hms}', lambda y, m, d, hms, tz: f'{y} {m} {d} {hms} {tz}']
module-attribute
¶
This list contains date orderings by component.
component_formats = {'y': {'%y', '%Y'}, 'm': {'%b', '%B', '%m'}, 'd': {'%a', '%A', '%d'}, 'hms': {'%X', '%X %f'}, 'tz': {'%z', '%Z', '%!z'}}
module-attribute
¶
For every date component, there may exist multiple formats. This dictionary maps
components to any number of format variations. This used in conjunction with
date_component_orders let us build up permutations of valid date string formats.
component_seperators = ['/', '.', '-', ' ', ',', 'T', 'Z', '+']
module-attribute
¶
Characters from this list will be removed from a date string and used to build up
a string containing only date components that hopefully match from date_component_orders.
date_str_fmt_permutations = gen_date_str_fmt_permutations()
module-attribute
¶
A unique list of date string formats
TokenizedStr(original_str, masked_str, components, seperators)
dataclass
¶
Represents a date string that has been broken up into individual date components. This class is useful when trying to rebuild a new string with the same format.
Methods:
| Name | Description |
|---|---|
assemble_str_from_components |
Given a new set of components, rebuild the string with formatting preserved. |
Attributes:
| Name | Type | Description |
|---|---|---|
original_str |
str
|
The original source string |
masked_str |
str
|
A masked version of the string. Masked strings only contain the mask characters |
components |
list[tuple[str, tuple[int, int]]]
|
A list of components and their string index mapped from the source string |
seperators |
list[str]
|
A list of component seperators. Zipping this list with |
component_str |
str
|
A string containing only the components of the date. This is |
original_str
instance-attribute
¶
The original source string
masked_str
instance-attribute
¶
A masked version of the string. Masked strings only contain the mask characters and component seperators.
components
instance-attribute
¶
A list of components and their string index mapped from the source string
seperators
instance-attribute
¶
A list of component seperators. Zipping this list with components yields
the original string.
component_str
property
¶
A string containing only the components of the date. This is used to matched a date with a date format.
assemble_str_from_components(new_components)
¶
Given a new set of components, rebuild the string with formatting preserved.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
new_components
|
list[str]
|
The new set of component to reassemble the string with. |
required |
Source code in src/nemo_safe_synthesizer/data_processing/actions/dates.py
ParsedDate(component_order, date, tokenized_date)
dataclass
¶
Wrapper for a parsed date and associated model_metadata
Methods:
| Name | Description |
|---|---|
date_to_fmt_str |
Given a new date object, returns that date in the parsed date format |
shift |
Given a date shift in days or milliseconds or a |
Attributes:
| Name | Type | Description |
|---|---|---|
component_order |
str
|
Matched date string format order form |
date |
datetime
|
The parsed datetime object |
tokenized_date |
TokenizedStr
|
A reference to the tokenized date string |
fmt_str |
str
|
The date format string used to to build the original date. This can be used |
component_order
instance-attribute
¶
Matched date string format order form date_component_orders. This can be
used to reconstruct the original date string format including seperators.
date
instance-attribute
¶
The parsed datetime object
tokenized_date
instance-attribute
¶
A reference to the tokenized date string
fmt_str
property
¶
The date format string used to to build the original date. This can be used
with function like strftime or strptime.
Returns:
| Type | Description |
|---|---|
str
|
Date format string such as "%m/%d/%Y". |
date_to_fmt_str(date)
¶
Given a new date object, returns that date in the parsed date format
Source code in src/nemo_safe_synthesizer/data_processing/actions/dates.py
shift(days=None, ms=None, delta=None)
¶
Given a date shift in days or milliseconds or a timedelta object,
will return a new date using the same original string format.
Shifting by milliseconds is useful if the date is a timestamp.
Source code in src/nemo_safe_synthesizer/data_processing/actions/dates.py
strftime_extra(dt, fmt)
¶
Formats a datetime object, supporting ISO8601 timezone offsets.
strftime_extra(dt, fmt) behaves like dt.strftime(fmt), with the exception that it supports
the special %!z format directive. %!z is formatted as an ISO8601 UTC offset. For naive datetimes,
it always expands to the empty string (same as %z); otherwise, it expands to a colon-separated
[+-]hh:mm offset or Z for UTC. For the sake of an easier implementation, %!z is only allowed
at the end of the format string.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dt
|
datetime
|
the datetime object to format. |
required |
fmt
|
str
|
the format string to use (which may include %!z). |
required |
Returns:
| Type | Description |
|---|---|
str
|
the formatted datetime. |
Source code in src/nemo_safe_synthesizer/data_processing/actions/dates.py
strptime_extra(date_string, fmt)
¶
Parses a string as a datetime object, supporting ISO8601 timezone offsets.
See the documentation on strftime_extra regarding the semantics of the new %!z format
specifier.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
date_string
|
str
|
the datetime in string format. |
required |
fmt
|
str
|
the format string to use for parsing (which may include %!z). |
required |
Returns:
| Type | Description |
|---|---|
datetime
|
the parsed datetime. |
Source code in src/nemo_safe_synthesizer/data_processing/actions/dates.py
date_component_permutations()
¶
Return the Cartesian product of per-component format strings.
Each tuple is indexed by (year, month, day, hms, tz) and can be
passed into a formatter from date_component_orders.
Source code in src/nemo_safe_synthesizer/data_processing/actions/dates.py
gen_date_str_fmt_permutations()
¶
Return the set of all unique date format permutations.
tokenize_date_str(input)
¶
Given a raw input date, will return an instance of TokenizedStr. Any
business logic, or edge cases for tokenizing a string belong in this method.
Source code in src/nemo_safe_synthesizer/data_processing/actions/dates.py
maybe_match(date, format)
¶
Attempt to parse date with format, returning None on failure.
parse_date(input_date, date_str_fmts=date_str_fmt_permutations)
¶
Parse a date string and return the first matching ParsedDate, or None.
Source code in src/nemo_safe_synthesizer/data_processing/actions/dates.py
parse_date_multiple(input_date, date_str_fmts=date_str_fmt_permutations)
¶
Yield all valid ParsedDate interpretations of input_date across known formats.
Source code in src/nemo_safe_synthesizer/data_processing/actions/dates.py
randomize(date, days)
¶
Given a date string of some unknown format, returns a randomly shifted version of that date.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
date
|
str
|
The date string to shift |
required |
days
|
int
|
The max number of days to shift the date by. The range of valid days include [-days, days]. |
required |
Source code in src/nemo_safe_synthesizer/data_processing/actions/dates.py
d_str_to_fmt_multiple(input_date)
¶
Yield all plausible strftime format strings for a date string.
Source code in src/nemo_safe_synthesizer/data_processing/actions/dates.py
maybe_d_str_to_fmt_multiple(input_date)
¶
Like d_str_to_fmt_multiple but silently yields nothing on ValueError.
Source code in src/nemo_safe_synthesizer/data_processing/actions/dates.py
d_str_to_fmt(input_date)
¶
Infer the most likely strftime format string for a date string, or None.
infer_from_series(date_series)
¶
Infer the best strftime format for a series of date strings.
Evaluates each date against all known format permutations and returns
the most frequently matched format. This is more reliable than
single-string inference, which can confuse ambiguous components like
%m and %d.
Source code in src/nemo_safe_synthesizer/data_processing/actions/dates.py
fit_and_transform_dates(df, inplace=False)
¶
Detect date columns, convert them to elapsed seconds, and record the transformation.
For each object-typed column, samples values to infer a date format. If successful, converts the column to seconds elapsed since the column minimum and records the format and min date for later reversal.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input DataFrame. |
required |
inplace
|
bool
|
If True, mutate |
False
|
Returns:
| Type | Description |
|---|---|
dict[str, dict[str, str]]
|
A tuple of (date_min_dict, result_df). |
DataFrame
|
names to |
tuple[dict[str, dict[str, str]], DataFrame]
|
|
Source code in src/nemo_safe_synthesizer/data_processing/actions/dates.py
transform_dates(dates, df)
¶
Apply a previously fitted date-to-seconds transformation to a DataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dates
|
dict[str, dict[str, str]]
|
Mapping from column names to |
required |
df
|
DataFrame
|
DataFrame to transform. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
A copy of |