Logo

The Data Daily

Web Scraping and Data Analysis using Selenium Webdriver and Python

Web Scraping and Data Analysis using Selenium Webdriver and Python

We all are surrounded by data and it reveals lot of things to us to make our decisions and recommends the next steps. Data is collected from different sources such as Web, Database, log files etc. and then it is thoroughly cleaned and reshaped, and further used for analysis and explored to determine the hidden patterns and trends which is really essential for any business decision making, Extracting data from web is always easy with the help of API's but what if website doesn't provide any API's, In such case, Web Scraping is an excellent way to extract the unstructured data from web and put that in structured format like excel,csv, database etc..

Just provide the address(Xpath or any other locator) of the data to be extracted and Selenium webdriver extracts all the data from the page just with the help of one api(find_element_by_xpath), See how easy it is. Isn't it?

Data is extracted from the page with the help of webdriver and is stored in a list, So individual list for all the following data is created: 

Sample Data set for Movie Name, Votes and Director is displayed here, Rest of the data is also stored in individual python list 

Don't see any co-relation between these data, if someone have to pick the release year and director for a movie then it's difficult to get it from these lists, So let's put the data in a Structured and more meaningful format which will make sense for someone looking at this data. So lets bind the data in Python dictionary

This Data in python dictionary(Key:value pair) looks good and make more sense now, However if you look carefully the data is not in correct format for data manipulation, Votes value contains comma, Director contains unwanted text "Director:" and Ratings and Runtime are not in correct data type. Lets Clean this data to bring it in shape for performing analysis

The entire movie data is stored in python dictionary but for doing further analysis this data needs to be consumed by Pandas Dataframe so that by using Pandas rich data structures and built-in function we can do some analysis on this data. Import data in Dataframe.

There are some missing values in this data, But Pandas provides excellent feature to handle missing and null values. So for these 3 movies RunTime data is not available on the page. so for further analysis we will replace this missing data with the mean value of the available data for RunTimeColumn

Images Powered by Shutterstock