How can R Users Learn Python for Data Science ?

Introduction

The best way to learn a new skill is by doing it!

This article is meant to help R users enhance their set of skills and learn Python for data science (from scratch). After all, R and Python are the most important programming languages a data scientist must know.

Python is a supremely powerful and a multi-purpose programming language. It has grown phenomenally in the last few years. It is used for web development, game development, and now data analysis / machine learning. Data analysis and machine learning is a relatively new branch in python.

For a beginner in data science, learning python for data analysis can be really painful. Why ?

You try Googling "learn python," and you'll get tons of tutorials only meant for learning python for web development. How can you find a way then ?

In this tutorial, we'll be exploring the basics of python for performing data manipulation tasks. Alongside, we'll also look how you do it in R. This parallel comparison will help you relate the set of tasks you do in R to how you do it in python! And in the end, we'll take up a data set and practice our newly acquired python skills.

Note: This article is best suited for people who have a basic knowledge of R language.

 

Table of Contents

  1. Why learn Python (even if you already know R)
  2. Understanding Data Types and Structures in Python vs. R
  3. Writing Code in Python vs. R
  4. Practicing Python on a Data Set

 

Why learn Python (even if you already know R)

No doubt, R is tremendously great at what it does. In fact, it was originally designed for doing statistical computing and manipulations. Its incredible community support allows a beginner to learn R quickly.

But, python is catching up fast. Established companies and startups have embraced python at a much larger scale compared to R.

r machine learning vs python machine learning

According to indeed.com (from Jan 2016 to November 2016), the number of job postings seeking "machine learning python" increased much faster (approx. 123%) than "machine learning in R" jobs. Do you know why ? It is because

  1. Python supports the entire spectrum of machine learning in a much better way.
  2. Python not only supports model building but also supports model deployment.
  3. The support of various powerful deep learning libraries such as keras, convnet, theano, and tensorflow is more for python than R.
  4. You don't need to juggle between several packages to locate a function in python unlike you do in R. Python has relatively fewer libraries, with each having all the functions a data scientist would need.

 

Understanding Data Types and Structures in Python vs. R

These programming languages understand the complexity of a data set based on its variables and data types. Yes! Let's say you have a data set with one million rows and 50 columns. How would these programming languages understand the data ?

Basically, both R and Python have pre-defined data types. The dependent and independent variables get classified among these data types. And, based on the data type, the interpreter allots memory for use. Python supports the following data types:

  1. Numbers - It stores numeric values. These numeric values can be stored in 4 types: integer, long, float, and complex. Let's understand them.
    • Integer - It refers to whole numbers such as 10,13,91,102, etc. It is the same as R's integer type.
    • Long - It refers to long integers which are represented in octa and hexadecimal. In R, you use bit64 package to read hexadecimal values.
    • Float - It refers to decimal values such as 1.23, 9.89, etc. It is the same as R's numeric type.
    • Complex - It refers to complex numbers such as 2 + 3i, 5i, etc. However, this data type is rarely found in data.
  2. Boolean - It stores two values (True and False). In R, it can be stored as a factor type or a character type. There exists a tiny difference between Boolean values in R and python. In R, Boolean are stored as TRUE and FALSE. In python, they are stored as True and False. There's a difference in the letter case.
  3. Strings - It stores text (character) data such as "elephant," "lotus," etc. It is the same as R's character type.
  4. Lists - It is the same as R's list data type. It is capable of storing values of multiple variable types such as string, integer, Boolean, etc.
  5. Tuples - There is nothing like tuples in R. Think of tuples as an R vector whose values can't be changed; i.e., it is immutable.
  6. Dictionary - It provides a two dimensional structure which supports key : value pair. In simple words, think of a key as a column name, and pair as column values.

Since R is a statistical computing language, all the functions to manipulate data and reading variables are available inherently. On the other hand, python hails all the data analysis / manipulation / visualization functions from external libraries. Python has several libraries for data manipulation and machine learning. The most important ones are:

  1. Numpy - It is used for doing numerical computing in python. It provides access to numerous mathematical function such as linear algebra, statistics etc. It is largely used to create arrays. In R, think of an array as a list. It consists of one class (numeric or string or boolean) or multiple classes also. It can be unidimensional or multidimensional.
  2. Scipy - It is used for doing scientific computing in python.
  3. Matplotlib - It is used for doing data visualization in python. For R, we use the famous ggplot2 library.
  4. Pandas - It is the powerhouse for doing data manipulation tasks. In R, we use packages like dplyr, data.table etc.
  5. Scikit Learn - It is the powerhouse for implementing machine learning algorithms. In fact, it's the best part about doing machine learning in python. It contains all the functions you would require for model building.

In a way, python for a data scientist is largely about mastering the libraries stated above. However, there are many more advanced libraries which people have started using. Therefore, for practical purposes you should remember the following things:

  1. Array - This is similar to R's list. It can be multidimensional. It can contain data of the same or multiple classes. In case of multiple classes, the coercion effect takes place.
  2. List - This is also similar to R's List.
  3. Data Frame - It's a two-dimensional structure comprising several lists. R has a built-in function data.frame and python uses the Dataframe function from the pandas library.
  4. Matrix - It's a two (or multi) dimensional structure comprising all values of the same class (or multiple class). Think of a matrix as a 2D-version of a vector. In R, we use the matrix function. In python, we use the numpy.column_stack function.

Until here, I hope you've understood the basics of data types and data structures in R and Python. Now, let's start working with them!

 

Writing Code in Python vs. R

Let's use the knowledge gained in the previous section and understand its practical implications. But before that, you should install python using anaconda's jupyter notebook (previously called as ipython notebook). You can download here. Also, you can download other python IDEs for data analysis. I hope you already have R Studio installed on your laptop.

1. Creating Lists

In R, lists are created using the base list function:

my_list <- list ('monday','specter',24,TRUE)
typeof(my_list)
[1] "list"

In Python, lists are created using square brackets:

my_list = ['monday','specter',24,True]
type(my_list)
list

You can get the same output using the pandas library also. In pandas, lists are known as series. To load pandas in python, write:

#importing pandas library as pd notation (you can use any notation)
import pandas as pd
pd_list = pd.Series(my_list)
pd_list

The numbers (0,1,2,3) denote array indexing. Did you notice anything? Python is a zero-based indexing language, whereas indexing in R starts from 1. Let's proceed and understand the difference between list subsetting in R and Python.

#create a list
new_list <- list(roll_number = 1:10, Start_Name = LETTERS[1:10])

Think of a new_list as a train. This train has two coaches named roll_number and Start_Name. In each of these coaches, there are 10 people. So, in list subsetting, we can extract the value of coaches, people sitting in the coaches, etc.

#extract first coach information
new_list[1] #or
df['roll_number']  

$roll_number
 [1] 1 2 3 4 5 6 7 8 9 10

#extract only people sitting in first coach 
new_list[[1]] #or
df$roll_number

#[1] 1 2 3 4 5 6 7 8 9 10

If you check type of new_list[1], you'll find that it's a list, whereas type of new_list[[1]] is a character. Similarly, in python, you can extract list components like this:

#create a new list
new_list = pd.Series({'Roll_number' : range(1,10),

                      'Start_Name' : map(chr, range(65,70))})

Roll_number [1, 2, 3, 4, 5, 6, 7, 8, 9]
Start_Name [A, B, C, D, E]
dtype: object

#extracting first coach
new_list[['Roll_number']] #or
new_list[[0]]

Roll_number [1, 2, 3, 4, 5, 6, 7, 8, 9]
dtype: object

#extract people sitting in first coach
new_list['Roll_number']  #or
new_list.Roll_number
[1, 2, 3, 4, 5, 6, 7, 8, 9]

There's a confusing difference in list indexing in R and Python. If you would have noticed [[ ]] extracts the elements of a coach in R, whereas [[ ]] extracts the coach itself in python.

 

2. Matrix

A matrix is a 2D-structure created by a combination of vectors (or arrays). Generally, a matrix contains elements of the same class. However, even if you mix up elements from different classes (string, boolean, numeric etc), it will still work. The method of subsetting a matrix is quite similar except for the indexing number. To reiterate, python indexing starts with 0 and R indexing start with 1.

In R, a matrix can be created as:

my_mat <- matrix(1:10,nrow = 5)
my_mat


Subsetting a matrix is really easy.

#to select first row
my_mat[1,]

#to select second column
my_mat[,2]

In Python, we'll take the help of numpy arrays to create a matrix. Therefore, first we'll load the numpy library.

import numpy as np
a=np.array(range(10,15))
b=np.array(range(20,25))
c=np.array(range(30,35))
my_mat = np.column_stack([a,b,c])

#to select first row
my_mat[0,]

#to select second column
my_mat[:,1]

 

3. Data Frames

Data frames provide a much-needed skeleton to the loosely collected data from multiple sources. It's spreadsheet-like structure which provides a data scientist with a nice picture of how the data set looks. In R, we can create a data frame using data.frame() function:

data_set <- data.frame(Name = c("Sam","Paul","Tracy","Peter"),
Hair_Colour = c("Brown","White","Black","Black"),
Score = c(45,89,34,39))

So, we know that a dataframe is created by collection of vectors (or lists). To create a data frame in python, we'll create a dictionary (collection of arrays) and enclose the dictionary in Dataframe function from pandas library.

data_set = pd.DataFrame({'Name' : ["Sam","Paul","Tracy","Peter"],
'Hair_Colour' : ["Brown","White","Black","Black"],
'Score' : [45,89,34,39]})

Now, let's look at the most crucial aspect of working with dataframe, i.e., subsetting. In fact, most of the data manipulation revolves around slicing and dicing a dataframe from every possible angle. Let's look at the tasks one by one:

#select first column in R
data_set$Name # or
data_set[["Name]] #or
data_set[1]

#select first column in Python
data_set['Name'] #or
data_set.Name #or
data_set[[0]]

#select multiple columns in R
data_set[c('Name','Hair_Colour')] #or
data_set[,c('Name','Hair_Colour')]

#select multiple columns in Python
data_set[['Name','Hair_Colour']] #or
data_set.loc[:,['Name','Hair_Colour']]

.loc function is used for label based indexing.

Until here, we've understood the skeleton of data types, structures, and formats in R and Python. Let's now take up a data set and explore various other aspects of exploring data in python.

 

Practicing Python on a Data Set

The wonderful scikit learn library contains an inbuilt repository of data sets. For our practice purpose, we'll be using Boston housing data set. It's a popular data set used in data analysis.

#import libraries
import numpy as np
import pandas as pd
from sklearn.datasets import load_boston

#store in a variable
boston = load_boston()

The variable boston is a dictionary. Just to refresh, a dictionary is a combination of key-value pairs. Let's look at the key information:

boston.keys()
['data', 'feature_names', 'DESCR', 'target']

Now we know our required data set resides in the key data. We also see that there is a separate key for feature names. I suppose the data set will not have column names attributed. Let's check the column name we are going to deal with.

print(boston['feature_names'])
['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO' 'B' 'LSTAT']

Can you understand these names ? Me neither. Now, let's check the data description and understand the significance of each variable.

print(boston['DESCR'])

This data set has 506 rows and 13 columns. It comprises various characteristics which help in determining the prices of houses in Boston (U.S.). Now, let's create the dataframe and start exploring.

bos_data = pd.DataFrame(boston['data'])

Similar to R, python also has a head() function to peek into data:

bos_data.head()

The output shows that data set has no column names (as anticipated above). Attributing column names to a dataframe is easy.

bos_data.columns = boston['feature_names']
bos_data.head()

Just like R's dim() function, python has shape() function to check the dimension of the data set. To get the statistical summary of the data sets, we can write:

bos_data.describe()

It shows us column-wise statistical summary of the data. Let's quickly explore other aspects of this data.

#get first 10 rows
bos_data.iloc[:10]

#select first 5 columns
bos_data.loc[:,'CRIM':'NOX'] #or
bos_data.iloc[:,:5]

#filter columns based on a condition
bos_data.query("CRIM > 0.05 & CHAS == 0")

#sample the data set
bos_data.sample(n=10)

#sort values - default is ascending
bos_data.sort_values(['CRIM']).head() #or
bos_data.sort_values(['CRIM'],ascending=False).head()

#rename a column
bos_data.rename(columns={'CRIM' : 'CRIM_NEW'})

#find mean of selected columns
bos_data[['ZN','RM']].mean()

#transform a numeric data into categorical
bos_data['ZN_Cat'] = pd.cut(bos_data['ZN'],bins=5,labels=['a','b','c','d','e'])

#calculate the mean age for ZN_Cat variable
bos_data.groupby('ZN_Cat')['AGE'].sum()

In addition, python also allows us to create pivot tables. Yes! just like MS Excel or any other spreadsheet software, you can create a pivot table and understand data more closely. Unfortunately, creating a pivot table in R is a quite complex process. In python, a pivot table requires row names, column names, and the value to be calculated. If we don't pass any column name, the results would be just like what you would get using the groupby function. Therefore, let's create another categorical variable.

#create a new categorical variable
bos_data['NEW_AGE'] = pd.cut(bos_data['AGE'],bins=3,labels=['Young','Old','Very_Old'])

#create a pivot table calculating mean age per ZN_Cat variable
bos_data.pivot_table(values='DIS',index='ZN_Cat',columns= 'NEW_AGE',aggfunc='mean')

This was just the tip of the iceberg. Where to go next ? Just like we used Boston data, now you should work with iris data. It is available in the sklearn_datasets repository. Try to explore it in depth. Remember, the more your practice, more time you spend coding, and the better you'll become.

 

Summary

While coding in python, I realized that there is not much difference in the amount of code you write here;although some functions are shorter in R than in Python. However, R has really awesome packages which handle big data quite conveniently. Do let me know if you wish to learn about them!

Overall, learning both the languages would give you enough confidence to handle any type of data set. In fact, the best part about learning python is its comprehensive documentation available on numpy, pandas, and scikit learn libraries, which are sufficient enough to help you overcome all initial obstacles.

In this article, we just touched the basics of python. There's a long to way to go. Next week, we'll learn about data manipulation in python in detail. After that, we'll look into data visualization, and the powerful machine learning library in python.

Do share your experience, suggestions, and questions below while practicing this tutorial!

About the Author

Making an effort to help people understand Machine Learning. I believe your educational background doesn't stop you to pursue ML & Data Science. Earned Masters in F/M, a self taught data science professional. Previously worked at Analytics Vidhya. Now solving ML & Growth challenges at HackerEarth!
6
51
273