Language Selection

Get healthy now with MedBeds!
Click here to book your session

Protect your whole family with Orgo-Life® Quantum MedBed Energy Technology® devices.

Advertising by Adpathway

         

 Advertising by Adpathway

7 Pandas Tricks to Improve Your Machine Learning Model Development

8 months ago 49

PROTECT YOUR DNA WITH QUANTUM TECHNOLOGY

Orgo-Life the new way to the future

  Advertising by Adpathway

7 Pandas Tricks to Improve Your Machine Learning Model Development

7 Pandas Tricks to Improve Your Machine Learning Model Development
Image by Author | ChatGPT

Introduction

If you’re reading this, it’s likely that you are already aware that the performance of a machine learning model is not just a function of the chosen algorithm. It is also highly influenced by the quality and representation of the data that said model has been trained on.

Data preprocessing and feature engineering are some of the most important steps in your machine learning workflow. In the Python ecosystem, Pandas is the go-to library for these types of data manipulation tasks, something you also likely know. Mastering a few select Pandas data transformation techniques can significantly streamline your workflow, make your code cleaner and more efficient, and ultimately lead to better performing models.

This tutorial will walk you through seven practical Pandas scenarios and the tricks that can enhance your data preparation and feature engineering process, setting you up for success in your next machine learning project.

Preparing Our Data

To demonstrate these tricks, we’ll use the classic Titanic dataset. This is a useful example because it contains a mix of numerical and categorical data, as well as missing values, challenges you will frequently encounter in real-world machine learning tasks.

We can easily load the dataset into a Pandas DataFrame directly from URL.

import pandas as pd

import numpy as np

# Load the Titanic dataset from URL

url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"

df = pd.read_csv(url)

# Output shape and first 5 rows

print("Dataset shape:", df.shape)

print(df.head())

Output:

Dataset shape: (891, 12)

   PassengerId  Survived  Pclass                                               Name     Sex   Age  SibSp  Parch            Ticket     Fare Cabin Embarked

0            1         0       3                            Braund, Mr. Owen Harris    male  22.0      1      0         A/5 21171   7.2500   NaN        S

1            2         1       1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1      0          PC 17599  71.2833   C85        C

2            3         1       3                             Heikkinen, Miss. Laina  female  26.0      0      0  STON/O2. 3101282   7.9250   NaN        S

3            4         1       1       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1      0            113803  53.1000  C123        S

4            5         0       3                           Allen, Mr. William Henry    male  35.0      0      0            373450   8.0500   NaN        S

This gives us a DataFrame with columns like Survived (our target variable), Pclass (passenger class), Sex, Age, and more.

Now, let’s reach into our bag of tricks.

1. Using query() for Cleaner Data Filtering

Filtering data is a never-ending task, whether creating subsets for training or exploring specific segments. The standard method of doing so by using boolean indexing can become clumsy and convoluted with multiple conditions. The query() method offers a more readable and intuitive alternative by allowing you to filter using a string expression.

Standard Filtering

# Filter for first-class passengers over 30 who survived

filtered_df = df[(df['Pclass'] == 1) & (df['Age'] > 30) & (df['Survived'] == 1)]

print(filtered_df.head())

Filtering with query()

# Same filter, but using the query() method

query_df = df.query('Pclass == 1 and Age > 30 and Survived == 1')

print(query_df.head())

Same output:

    PassengerId  Survived  Pclass                                               Name     Sex   Age  SibSp  Parch    Ticket     Fare Cabin Embarked

1             2         1       1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1      0  PC 17599  71.2833   C85        C

3             4         1       1       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1      0    113803  53.1000  C123        S

11           12         1       1                           Bonnell, Miss. Elizabeth  female  58.0      0      0    113783  26.5500  C103        S

52           53         1       1           Harper, Mrs. Henry Sleeper (Myna Haxtun)  female  49.0      1      0  PC 17572  76.7292   D33        C

61           62         1       1                                Icard, Miss. Amelie  female  38.0      0      0    113572  80.0000   B28      NaN

I doubt you would disagree that the query() version is cleaner and easier to read, especially as the number of conditions grows.

2. Creating Bins for Continuous Variables with cut()

Some models — think linear models and decision trees — can benefit from discretizing continuous variables, which can help the model capture non-linear relationships. The pd.cut() function can be used for binning data into custom ranges. To demonstrate, let’s create age groups.

# Define the bins and labels for age groups

bins = [0, 12, 18, 60, np.inf]

labels = ['Child', 'Teenager', 'Adult', 'Senior']

# Create the new 'AgeGroup' feature

df['AgeGroup'] = pd.cut(df['Age'], bins=bins, labels=labels, right=False)

# Display the counts of each age group

print(df['AgeGroup'].value_counts())

Output:

AgeGroup

Adult       575

Child        68

Teenager     45

Senior       26

Name: count, dtype: int64

This new AgeGroup feature is a powerful categorical variable that your model can now use.

3. Extracting Features from Text with the .str Accessor

Text columns often contain valuable, structured information. The .str accessor in Pandas provides a whole host of string processing methods that work on an entire series at once. We can use the .str accessor with a regular expression to extract passenger titles (e.g. ‘Mr.’, ‘Miss.’, ‘Dr.’) from the Name column.

# Use a regular expression to extract titles from the Name column

df['Title'] = df['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)

# Display the value counts of the new Title feature

print(df['Title'].value_counts())

Output:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

Title

Mr          517

Miss        182

Mrs         125

Master       40

Dr            7

Rev           6

Mlle          2

Major         2

Col           2

Countess      1

Capt          1

Ms            1

Sir           1

Lady          1

Mme           1

Don           1

Jonkheer      1

Name: count, dtype: int64

This Title feature has often proven to be a strong predictor of survival in Titanic models.

4. Performing Advanced Imputation with transform()

Simply dropping rows with missing data is often not an option, as it can lead to data loss. In many situations, a better strategy is imputation. While filling with a global mean or median is common, a more sophisticated approach is to impute based on a related group. For example, we can fill missing Age values with the median age of passengers in the same Pclass. The groupby() and transform() methods make this straightforward, and it is an elegant solution.

# Calculate the median age for each passenger class

median_age_by_pclass = df.groupby('Pclass')['Age'].transform('median')

# Fill missing Age values with the calculated median

df['Age'].fillna(median_age_by_pclass, inplace=True)

# Verify that there are no more missing Age values

print("Missing Age values after imputation:", df['Age'].isnull().sum())

Output:

Missing Age values before imputation: 177

Missing Age values after imputation: 0

We did it; there are no more missing ages. This group-based imputation is often more accurate than using a single global value, for a variety of reasons.

5. Streamlining Workflows with Method Chaining and pipe()

A machine learning preprocessing pipeline often involves multiple steps. Chaining these operations together can make the code more readable and help to avoid creating unnecessary intermediate DataFrames. The pipe() method takes this a step further by allowing you to integrate your own custom functions into the chain along the way.

First, let’s define a custom function to drop columns, and another to encode the Sex column as 0 for male and 1 for female. Then, we can create a pipeline using pipe that integrates these 2 custom functions into our chain.

# A custom function to drop columns

def drop_cols(df, cols_to_drop):

    return df.drop(columns=cols_to_drop)

# A custom function to encode 'Sex'

def encode_sex(df):

    df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})

    return df

# Create a chained pipeline

processed_df = (df.copy()

                  .pipe(drop_cols, cols_to_drop=['Ticket', 'Cabin', 'Name'])

                  .pipe(encode_sex)

                 )

print(processed_df.head())

And our output:

   PassengerId  Survived  Pclass  Sex   Age  SibSp  Parch     Fare Embarked AgeGroup Title

0            1         0       3    0  22.0      1      0   7.2500        S    Adult    Mr

1            2         1       1    1  38.0      1      0  71.2833        C    Adult   Mrs

2            3         1       3    1  26.0      0      0   7.9250        S    Adult  Miss

3            4         1       1    1  35.0      1      0  53.1000        S    Adult   Mrs

4            5         0       3    0  35.0      0      0   8.0500        S    Adult    Mr

This approach is effective for building clean, reproducible machine learning pipelines.

6. Mapping Ordinal Categories Efficiently with map()

While one-hot encoding is standard for nominal categorical data, ordinal data (where categories have a natural order) is better handled by mapping to integers. A dictionary and the map() method are perfect for this. Let’s imagine passenger class has a quality ordering.

# Let's assume Embarked has an order: S > C > Q

embarked_mapping = {'S': 2, 'C': 1, 'Q': 0}

df['Embarked_mapped'] = df['Embarked'].map(embarked_mapping)

print(df[['Embarked', 'Embarked_mapped']].head())

And here is our output:

  Embarked  Embarked_mapped

0        S              2.0

1        C              1.0

2        S              2.0

3        S              2.0

4        S              2.0

This is a fast and explicit way to encode ordinal relationships for your model to learn.

7. Optimizing Memory with astype()

When working with large datasets, memory usage can become a bottleneck. Pandas defaults to larger data types (like int64 and float64), but you can often use smaller types without losing information. Converting object columns to the category dtype is an effective approach to this.

# Check original memory usage

print("Original memory usage:")

print(df.info(memory_usage='deep'))

# Optimize data types

df_optimized = df.copy()

df_optimized['Pclass'] = df_optimized['Pclass'].astype('int8')

df_optimized['Sex'] = df_optimized['Sex'].astype('category')

df_optimized['Age'] = df_optimized['Age'].astype('float32')

df_optimized['Embarked'] = df_optimized['Embarked'].astype('category')

# Check new memory usage

print("\nOptimized memory usage:")

print(df_optimized.info(memory_usage='deep'))

The output:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

Original memory usage:

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 891 entries, 0 to 890

Data columns (total 15 columns):

#   Column           Non-Null Count  Dtype  

---  ------           --------------  -----  

0   PassengerId      891 non-null    int64  

1   Survived         891 non-null    int64  

2   Pclass           891 non-null    int64  

3   Name             891 non-null    object  

4   Sex              891 non-null    object  

5   Age              891 non-null    float64

6   SibSp            891 non-null    int64  

7   Parch            891 non-null    int64  

8   Ticket           891 non-null    object  

9   Fare             891 non-null    float64

10  Cabin            204 non-null    object  

11  Embarked         889 non-null    object  

12  AgeGroup         714 non-null    category

13  Title            891 non-null    object  

14  Embarked_mapped  889 non-null    float64

dtypes: category(1), float64(3), int64(5), object(6)

memory usage: 338.9 KB

None

Optimized memory usage:

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 891 entries, 0 to 890

Data columns (total 15 columns):

#   Column           Non-Null Count  Dtype  

---  ------           --------------  -----  

0   PassengerId      891 non-null    int64  

1   Survived         891 non-null    int64  

2   Pclass           891 non-null    int8    

3   Name             891 non-null    object  

4   Sex              891 non-null    category

5   Age              891 non-null    float32

6   SibSp            891 non-null    int64  

7   Parch            891 non-null    int64  

8   Ticket           891 non-null    object  

9   Fare             891 non-null    float64

10  Cabin            204 non-null    object  

11  Embarked         889 non-null    category

12  AgeGroup         714 non-null    category

13  Title            891 non-null    object  

14  Embarked_mapped  889 non-null    float64

dtypes: category(3), float32(1), float64(2), int64(4), int8(1), object(4)

memory usage: 241.3 KB

None

You will often see a significant reduction in the memory footprint, which can become important for training models on large datasets without crashing your machine.

Wrapping Up

Machine learning always starts with well-prepared data. While the complexity of algorithms, their hyperparameters, and the model-building process often capture the spotlight, the efficient manipulation of data is where the real leverage lies.

The seven Pandas tricks covered here are more than just coding shortcuts — they represent powerful strategies for cleaning your data, engineering insightful features, and building robust, reproducible models.

No comments yet.

Read Entire Article

         

        

Start the new Vibrations with a Medbed Franchise today!  

Protect your whole family with Quantum Orgo-Life® devices

  Advertising by Adpathway