Encoding Categorical Values, Python- Scikit-Learn

In Data science, we work with datasets which has multiple labels in one or more columns. They can be in numerical or text format of any encoding. This will be ideal and understandable by humans. Label Encoding converts text data into numerical data, machine-readable form. It is an important pre-processing step for structured data in supervised learning.

These variables are typically stored as text values which represent various traits. Some examples include color (“Red”, “Yellow”, “Blue”), size (“Small”, “Medium”, “Large”) or geographic designations (State or Country) or gender(“Male”,”Female”). Regardless of what the value is used for, the challenge is determining how to use this data in the analysis.

Many machine learning algorithms can support categorical values without further manipulation but there are many more algorithms that do not. Therefore, the analyst is faced with the challenge of figuring out how to turn these text attributes into numerical values for further processing. As with many other aspects of the Data Science world, there is no single answer on how to approach this problem. Each approach has trade-offs and has a potential impact on the outcome of the analysis. Fortunately, the python tools of pandas and scikit-learn provide several approaches that can be applied to transform the categorical data into suitable numeric values.

Preparing the dataset

I am considering the dataset from UCI Machine Learning Repository, Pittsburgh Bridges . This data has mix of categorical values and continous values.

Let’s clean up the data, using pandas to do this task.

Read the data.

import pandas as pd
import numpy as np#Defining Headers for the data
headers=["IDENTIF","RIVER","LOCATION","ERECTED","PURPOSE","LENGTH","LANES","CLEAR-G","T-OR-D","MATERIAL","SPAN","REL-L","TYPE"]#Read the data from the repo
df=pd.read_csv("http://mlr.cs.umass.edu/ml/machine-learning-databases/bridges/bridges.data.version1",header=None,names=headers,na_values="?")#Print sample data
df.head()

df.dtypes #to check the data format

Label encoding will be applied over object data types. For the same let us copy the fields with object datatype into a new dataframe.

obj_df = df.select_dtypes(include=['object']).copy()
obj_df.head()

int_df = df.select_dtypes(include=['int64']).copy()
float_df=df.select_dtypes(include=['float64']).copy()df_int_float = pd.concat([float_df,int_df], axis=1, join_axes=[int_df.index])

This can concatenate columns with other data types.

For the process of label encoding, Import scikit-learn.

from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder

let us encode data with respect to an object column

le = preprocessing.LabelEncoder()
le.fit(obj_df["PURPOSE"].astype(str))
list(le.classes_)

The data in the “PURPOSE” column is classified into [‘AQUEDUCT’, ‘HIGHWAY’, ‘RR’, ‘WALK’]

Let us encode for all the columns now.

obj_df_trf=obj_df.astype(str).apply(le.fit_transform)

All the columns are now encoded to a numerical weight on each value in the column. Let’s concatenate all the columns to get the encoded data.

df_final = pd.concat([df_int_float,obj_df_trf], axis=1, join_axes=[df_int_float.index])
df_final.head()

Finally, we have our encoded data and we may use this encoded data for any machine learning application.

Thanks for reading.

Encoding Categorical Values, Python- Scikit-Learn

Trilok Chowdary Maddipudi

Trilok Chowdary Maddipudi

Image Processing using Python

Pandas-Handling data in Python