Skip to content

Series

Bases: _SeriesCoreMixin, _SeriesSummaryMixin, Series


              flowchart TD
              metaframe.src.series.base.Series[Series]
              metaframe.src.series.core._SeriesCoreMixin[_SeriesCoreMixin]
              metaframe.src.series.summary._SeriesSummaryMixin[_SeriesSummaryMixin]

                              metaframe.src.series.core._SeriesCoreMixin --> metaframe.src.series.base.Series
                
                metaframe.src.series.summary._SeriesSummaryMixin --> metaframe.src.series.base.Series
                


              click metaframe.src.series.base.Series href "" "metaframe.src.series.base.Series"
              click metaframe.src.series.core._SeriesCoreMixin href "" "metaframe.src.series.core._SeriesCoreMixin"
              click metaframe.src.series.summary._SeriesSummaryMixin href "" "metaframe.src.series.summary._SeriesSummaryMixin"
            

Extended pandas Series with dataframe-aware helpers and summaries.

This subclass behaves like pandas.Series but guarantees that operations returning new objects preserve the custom Series or project DataFrame types. It also provides additional helpers for:

  • construction from DataFrames or Index objects
  • regex matching
  • structured statistical summaries
Source code in metaframe/src/series/base.py
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
class Series(
    _SeriesCoreMixin,
    _SeriesSummaryMixin,
    pd.Series
):
    """
    Extended pandas ``Series`` with dataframe-aware helpers and summaries.

    This subclass behaves like `pandas.Series` but guarantees that
    operations returning new objects preserve the custom ``Series`` or
    project ``DataFrame`` types. It also provides additional helpers for:

    * construction from DataFrames or Index objects
    * regex matching
    * structured statistical summaries
    """

    # ------------------------------------------------------------------
    # Constructors
    # ------------------------------------------------------------------

    @property
    def _constructor(self) -> Self:
        """
        Series constructor used internally by pandas.

        Ensures pandas operations that produce a new Series return this
        subclass instead of ``pandas.Series``.

        Returns
        -------
        Series
        """
        return Series

    @property
    def _constructor_expanddim(self) -> Type:
        """
        DataFrame constructor used when dimensionality increases.

        Used internally when a Series becomes a DataFrame.

        Returns
        -------
        DataFrame
        """
        from metaframe.src.dataframe import DataFrame
        return DataFrame

_constructor property

Series constructor used internally by pandas.

Ensures pandas operations that produce a new Series return this subclass instead of pandas.Series.

Returns:

Type Description
Series

_constructor_expanddim property

DataFrame constructor used when dimensionality increases.

Used internally when a Series becomes a DataFrame.

Returns:

Type Description
DataFrame

summary(**kwargs)

Compute a structured summary of the Series.

Produces a MultiIndex Series describing counts, missing values, descriptive statistics, value frequencies, and optional custom metrics.

The resulting Series will have a MultiIndex with the following levels:

  • dtype -> 'all', 'Not Numeric' or 'Numeric' Describe on which dtype of data from the original Series the summary was produced
  • Mode -> 'Describe', 'Value Count' or 'Custom' Describe which source produced the summary
    • Describe: pandas describe method
    • Value Count: pandas value_counts method
    • Custom: user-defined function from d_func parameter
  • Metric Name of the computed metric displayed
  • Type Type of metric Each 'count' metric will generate its associated % metric type below

Behavior depends on value_counts:

  • None -> automatically split numeric and non-numeric data
  • True -> frequency-based summary only (on numeric and non-numeric data)
  • False -> descriptive statistics only (on numeric data)

Parameters:

Name Type Description Default
kwargs

SeriesSummaryOpts keywords arguments.

{}

Returns:

Type Description
Series

MultiIndex Series containing the summary statistics.

Examples:

>>> s = Series([1, 2, 'a', 2, 'b', 'a', 3, 'a', None])
>>> s.summary()
dtype        Mode         Metric         Type      
all          Describe     Num. elements  count             9
                          NAs            count           1.0
                                         %              11.1
Not Numeric  Describe     Num. elements  count           4.0
                                         %              44.4
                          unique         count             2
                                         %              50.0
                          top            top               a
                          freq           freq              3
             Value Count  a              count             3
                                         %              75.0
                          b              count             1
                                         %              25.0
Numeric      Describe     Num. elements  count           4.0
                                         %              44.4
                          mean           mean            2.0
                          std            std            0.82
                          min            min             1.0
                          25%            percentile     1.75
                          50%            percentile      2.0
                          75%            percentile     2.25
                          max            max             3.0
                          sum            sum             8.0
             Custom       zeros          count           0.0
                                         %               0.0
                          filled         count           4.0
                                         %             100.0
dtype: object
Source code in metaframe/src/series/summary.py
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
def summary(self, **kwargs) -> Self:
    """
    Compute a structured summary of the Series.

    Produces a MultiIndex Series describing counts, missing values,
    descriptive statistics, value frequencies, and optional custom metrics.

    The resulting Series will have a MultiIndex with the following levels:

    - dtype
        -> 'all', 'Not Numeric' or 'Numeric'
        Describe on which dtype of data from the original Series the summary was produced
    - Mode
        -> 'Describe', 'Value Count' or 'Custom'
        Describe which source produced the summary
        - Describe: pandas describe method
        - Value Count: pandas value_counts method
        - Custom: user-defined function from d_func parameter
    - Metric
        Name of the computed metric displayed
    - Type
        Type of metric
        Each 'count' metric will generate its associated % metric type below

    Behavior depends on `value_counts`:

    - None  -> automatically split numeric and non-numeric data
    - True  -> frequency-based summary only (on numeric and non-numeric data)
    - False -> descriptive statistics only (on numeric data)

    Parameters
    ----------
    kwargs:
        SeriesSummaryOpts keywords arguments.

    Returns
    -------
    Series
        MultiIndex Series containing the summary statistics.

    Examples
    --------
    >>> s = Series([1, 2, 'a', 2, 'b', 'a', 3, 'a', None])
    >>> s.summary()
    dtype        Mode         Metric         Type      
    all          Describe     Num. elements  count             9
                              NAs            count           1.0
                                             %              11.1
    Not Numeric  Describe     Num. elements  count           4.0
                                             %              44.4
                              unique         count             2
                                             %              50.0
                              top            top               a
                              freq           freq              3
                 Value Count  a              count             3
                                             %              75.0
                              b              count             1
                                             %              25.0
    Numeric      Describe     Num. elements  count           4.0
                                             %              44.4
                              mean           mean            2.0
                              std            std            0.82
                              min            min             1.0
                              25%            percentile     1.75
                              50%            percentile      2.0
                              75%            percentile     2.25
                              max            max             3.0
                              sum            sum             8.0
                 Custom       zeros          count           0.0
                                             %               0.0
                              filled         count           4.0
                                             %             100.0
    dtype: object
    """
    if self.empty:
        return self._constructor()
    opts = SeriesSummaryOpts(**kwargs)
    l_summary_names = [opts.label_type, opts.label_mode, opts.label_metric, opts.label_metric_type]
    summary_mi = pd.MultiIndex.from_tuples([], names=l_summary_names)
    processed_summary = self._constructor(index=summary_mi)
    na_summary = self._constructor(index=summary_mi)
    count_summary = self._constructor(index=summary_mi)
    count_summary[(opts.label_type_all, opts.label_mode_desc, opts.label_metric_count, opts.label_metric_type_count)] = self.shape[0]
    if not opts.skip_na:
        na_summary[(opts.label_type_all, opts.label_mode_desc, opts.label_metric_nas, opts.label_metric_type_count)] = self.isna().sum()
        na_summary = summary_perc(na_summary, self.shape[0], opts)
    if opts.value_counts is None:
        num_mask = self.dropna().apply(lambda x: not isinstance(x, (bool, np_bool)) and isinstance(x, Number))
        processed_summary_not_num = self[num_mask[~num_mask].index].summary(**asdict(replace(opts, value_counts=True, skip_na=True, _tot_shape=self.shape[0])))
        if not processed_summary_not_num.empty:
            processed_summary_not_num = processed_summary_not_num.rename({opts.label_type_all: opts.label_type_not_num}, level=opts.label_type)
        processed_summary_num = self[num_mask[num_mask].index].infer_objects().summary(**asdict(replace(opts, value_counts=False, skip_na=True, _tot_shape=self.shape[0])))
        if not processed_summary_num.empty:
            processed_summary_num = processed_summary_num.rename({opts.label_type_all: opts.label_type_num}, level=opts.label_type)
        to_concat = [e for e in [processed_summary_not_num, processed_summary_num] if not e.empty]
        if to_concat:
            processed_summary = pd.concat(to_concat)
    elif opts.value_counts:
        processed_summary = self.astype(str)._summary_not_num(l_summary_names=l_summary_names, opts=opts)
    else:
        if pd.api.types.is_numeric_dtype(self.dtype):
            processed_summary = self._summary_num(l_summary_names=l_summary_names, summary_mi=summary_mi, opts=opts)
        else:
            count_summary.iloc[0] = 0
    if opts._tot_shape is not None:
        count_summary = summary_perc(count_summary, opts._tot_shape, opts)
    summary = pd.concat([e for e in [count_summary, na_summary, processed_summary] if not e.empty])
    summary.name = self.name
    return summary

_summary_not_num(l_summary_names, opts)

Compute summary statistics for non-numeric data.

Includes descriptive metrics and value frequencies, with optional percentage computation.

Parameters:

Name Type Description Default
l_summary_names List[str]
required
opts SeriesSummaryOpts
required

Returns:

Type Description
Series

MultiIndex summary for non-numeric values.

Source code in metaframe/src/series/summary.py
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
def _summary_not_num(self, l_summary_names: List[str], opts: SeriesSummaryOpts) -> Self:
    """
    Compute summary statistics for non-numeric data.

    Includes descriptive metrics and value frequencies, with optional
    percentage computation.

    Parameters
    ----------
    l_summary_names: List[str]
    opts : SeriesSummaryOpts

    Returns
    -------
    Series
        MultiIndex summary for non-numeric values.
    """
    s_summary_desc = self.describe(include='all', **opts.describe_kwargs).drop('count')
    s_summary_desc.index = pd.MultiIndex.from_tuples([(opts.label_type_all, opts.label_mode_desc, e, e if e!='unique' else opts.label_metric_type_count) for e in s_summary_desc.index], names=l_summary_names)
    s_summary_values = self.value_counts().fillna(0)
    s_summary_values.index = pd.MultiIndex.from_tuples([(opts.label_type_all, opts.label_mode_value_count, str(e), opts.label_metric_type_count) for e in s_summary_values.index], names=l_summary_names)
    return summary_perc(pd.concat([s_summary_desc, s_summary_values]), self.shape[0], opts)

_summary_num(l_summary_names, summary_mi, opts)

Compute summary statistics for numeric data.

Includes descriptive statistics (mean, std, min, percentiles, max, sum) and optional custom metrics provided through d_func.

Parameters:

Name Type Description Default
l_summary_names List[str]
required
summary_mi MultiIndex
required
opts SeriesSummaryOpts
required

Returns:

Type Description
Series

MultiIndex summary for numeric values.

Source code in metaframe/src/series/summary.py
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
def _summary_num(self, 
                 l_summary_names: List[str], 
                 summary_mi: pd.MultiIndex, 
                 opts: SeriesSummaryOpts) -> Self:
    """
    Compute summary statistics for numeric data.

    Includes descriptive statistics (mean, std, min, percentiles, max, sum)
    and optional custom metrics provided through `d_func`.

    Parameters
    ----------
    l_summary_names: List[str]
    summary_mi: pd.MultiIndex
    opts: SeriesSummaryOpts

    Returns
    -------
    Series
        MultiIndex summary for numeric values.
    """
    summary_desc = self.describe(**opts.describe_kwargs).drop('count').round(opts.round_desc)
    summary_desc.index = pd.MultiIndex.from_tuples([(opts.label_type_num, opts.label_mode_desc, e, e if not e.endswith('%') else opts.label_metric_type_percentile) for e in summary_desc.index], names=l_summary_names)
    processed_summary = self._constructor(name=self.name, index=summary_mi)
    processed_summary[(opts.label_type_num, opts.label_mode_desc, 'sum', 'sum')] = self.sum()
    for metric_type, d in opts.d_func.items():
        for metric_name, func in d.items():
            processed_summary[(opts.label_type_num, opts.label_mode_custom, metric_name, metric_type)] = func(self)
    return summary_perc(pd.concat([summary_desc, processed_summary.round(opts.round_desc)]), self.shape[0], opts)

fullmatch(pattern, **kwargs)

Test whether each value fully matches a regex pattern.

Each value is cast to string and matched using re.fullmatch. Missing values return False.

Parameters:

Name Type Description Default
pattern str

Regular expression pattern.

required
**kwargs

Additional arguments forwarded to re.fullmatch.

{}

Returns:

Type Description
Series of bool

Examples:

>>> s = Series(["A1", "B2", "AA"])
>>> s.fullmatch(r"[A-Z]\d")
0     True
1     True
2    False
dtype: bool
Source code in metaframe/src/series/core.py
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
def fullmatch(self, pattern: str, **kwargs) -> Self:
    """
    Test whether each value fully matches a regex pattern.

    Each value is cast to string and matched using `re.fullmatch`.
    Missing values return ``False``.

    Parameters
    ----------
    pattern : str
        Regular expression pattern.
    **kwargs
        Additional arguments forwarded to ``re.fullmatch``.

    Returns
    -------
    Series of bool

    Examples
    --------
    >>> s = Series(["A1", "B2", "AA"])
    >>> s.fullmatch(r"[A-Z]\\d")
    0     True
    1     True
    2    False
    dtype: bool
    """
    if not isinstance(pattern, str):
        pattern = str(pattern)
    return self.astype(str).apply(lambda x: re_fullmatch(pattern, x, **kwargs) is not None if x==x else False)

to_int(start_at=0)

Encode unique values as consecutive integers.

Identical values receive identical integers. Missing values are preserved.

Parameters:

Name Type Description Default
start_at int

Starting integer label.

0

Returns:

Type Description
Series of int

Examples:

>>> s = Series(["a", "b", "a"])
>>> s.to_int()
0    0
1    1
2    0
dtype: int64
Source code in metaframe/src/series/core.py
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
def to_int(self, start_at: int=0) -> Self:
    """
    Encode unique values as consecutive integers.

    Identical values receive identical integers. Missing values are preserved.

    Parameters
    ----------
    start_at : int, default 0
        Starting integer label.

    Returns
    -------
    Series of int

    Examples
    --------
    >>> s = Series(["a", "b", "a"])
    >>> s.to_int()
    0    0
    1    1
    2    0
    dtype: int64
    """
    # When dtype is 'O' and series is composed of pseudo-nuemric (numeric + numeric strings + NaN)
    # NaN are not replaced
    return self.fillna(PLACEHOLDER).replace({v: (i+start_at) for i, v in enumerate(self.fillna(PLACEHOLDER).unique())}).astype(int)