Principal Component Analysis on Open Data using IBM DSX – Environmental Water Quality !!

Level : Expert Data Scientist

Principal Component Analysis is one of the way of doing a linear dimensionality reduction , setting stage for the predictive analytics.

PCA is primarily applied on images  where it is attempted  to view the image in the lower dimension space to avoid high computational cost.

In this article, I have taken a data set with  multi variable parameter and with different dimensional units. It is demonstrated how PCA plays a role in visually representing the multi-parameter and multi-unit variables.

Data set: Environmental_Water_Quality_in_India

https:// https://data.gov.in/

 

data

In this dataset, the parameters are in different units. For example:

  • Disolve Oxygen (D.O) in mg/l ,
  • Conductivity in µmhos/cm ,
  • Coliform in MPN/100ml.

How this can be represented in a plot ?  Anything beyond 2 or 3 parameters will be quite hard to visualize and that too the  dimensions are in different units..

Let us use PCA to plot the above, in simple 5 steps:

Step 1: – Download the sample data

(search for Environmental_Water_Quality_in_India or consider taking any other data of your choice )

Step 2: Register , Login  and create Notebook on  DataScience Experience (DSX)

projects

  • Create a new python note book using the URL option as shown:

add_note

create_note

Note: URL to provide: https://raw.githubusercontent.com/RajeshJeyapaul/PCA_River/master/PCA_river_environment.ipynb

create_arrow

Step 3: Add the dataset and Insert into the code

  • Open the Notebook which got created in the above step
  • Import the data file as shown below

add_data

Note: Sample file https://raw.githubusercontent.com/RajeshJeyapaul/PCA_River/master/Environmental_Water_Quality_Ker_TN_Kar_total_new.csv

  • Insert the dataset onto the code as shown below. Use Pada DataFrame

insert_code

insert_code_py

Step 4: Do the PCA fitment against the data:

  •  Using the scale function from sklearn, standardize the data to have zero mean and SD as 1
    • X = pd.DataFrame(scale(df_data_1), index=df_data_1.index, columns=df_data_1.columns)
  • Compute the Loading vectors (loading vectors are eigenvectors of XTX. )
    • pca_loadings = pd.DataFrame(PCA().fit(X).components_.T, index=df_data_1.columns, columns=[‘V1’, ‘V2’, ‘V3’, ‘V4′,’V5′,’V6’])
    • Use Principal Component loading vectors, using a second y-axis, for plotting
  • Fit the PCA model and transform X to get the principal components
    • df_plot = pd.DataFrame(pca.fit_transform(X), columns=[‘PC1’, ‘PC2’, ‘PC3’, ‘PC4′,’PC5′,’PC6’], index=X.index)

 

Step 5: Plot the data

  •  Use Principal Component loading vectors, using a second y-axis, for plotting 

for i in pca_loadings[[‘V1’, ‘V2’]].index:

 ax2.annotate(i, (-pca_loadings.V1.loc[i]*a, -pca_loadings.V2.loc[i]*a),        color=’red’)

 

plot

Wow !! This is how PCA is applied to visualize a multi parameter and multi dimensional unit data.

Go back to school for basic Mathematical understanding and aim to be an expert Data Scientist by getting into algorithmns and Model training & validation. All the Best !!

Note: Git Repo : https://github.com/RajeshJeyapaul/PCA_River

Author: Rajesh K Jeyapaul

Currently working as Developer Advocate and Startup Mentor @ IBM India,Bangalore. Having primary focus around IoT, Cognitive and Data Science.Apart from technical ,having Interest in exploring bibilical histories, love to play Basket Ball . Having Interest in in Piano and Violin. Loves spends time with Family .Native of Tuticorin.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s