# Principal Component Analysis on Open Data using IBM DSX – Environmental Water Quality !! Level : Expert Data Scientist

Principal Component Analysis is one of the way of doing a linear dimensionality reduction , setting stage for the predictive analytics.

PCA is primarily applied on images  where it is attempted  to view the image in the lower dimension space to avoid high computational cost.

In this article, I have taken a data set with  multi variable parameter and with different dimensional units. It is demonstrated how PCA plays a role in visually representing the multi-parameter and multi-unit variables.

Data set: Environmental_Water_Quality_in_India

https:// https://data.gov.in/ In this dataset, the parameters are in different units. For example:

• Disolve Oxygen (D.O) in mg/l ,
• Conductivity in µmhos/cm ,
• Coliform in MPN/100ml.

How this can be represented in a plot ?  Anything beyond 2 or 3 parameters will be quite hard to visualize and that too the  dimensions are in different units..

Let us use PCA to plot the above, in simple 5 steps:

(search for Environmental_Water_Quality_in_India or consider taking any other data of your choice )

### Step 2: Register , Login  and create Notebook on  DataScience Experience (DSX) • Create a new python note book using the URL option as shown:   ### Step 3: Add the dataset and Insert into the code

• Open the Notebook which got created in the above step
• Import the data file as shown below • Insert the dataset onto the code as shown below. Use Pada DataFrame  ### Step 4: Do the PCA fitment against the data:

•  Using the scale function from sklearn, standardize the data to have zero mean and SD as 1
• X = pd.DataFrame(scale(df_data_1), index=df_data_1.index, columns=df_data_1.columns)
• Use Principal Component loading vectors, using a second y-axis, for plotting
• Fit the PCA model and transform X to get the principal components
• df_plot = pd.DataFrame(pca.fit_transform(X), columns=[‘PC1’, ‘PC2’, ‘PC3’, ‘PC4′,’PC5′,’PC6’], index=X.index)

### Step 5: Plot the data

•  Use Principal Component loading vectors, using a second y-axis, for plotting Wow !! This is how PCA is applied to visualize a multi parameter and multi dimensional unit data.

Go back to school for basic Mathematical understanding and aim to be an expert Data Scientist by getting into algorithmns and Model training & validation. All the Best !!

Note: Git Repo : https://github.com/RajeshJeyapaul/PCA_River ## Author: Rajesh K Jeyapaul

Currently working as Developer Advocate and Startup Mentor @ IBM India,Bangalore. Having primary focus around IoT, Cognitive and Data Science.Apart from technical ,having Interest in exploring bibilical histories, love to play Basket Ball . Having Interest in in Piano and Violin. Loves spends time with Family .Native of Tuticorin.