Logo

The Data Daily

The Five Essential Skills All Data Scientists Should Master

The Five Essential Skills All Data Scientists Should Master

Most data science problems begin with finding a dataset relevant to the task. However, most of the time, no dataset for your problem exists. We must build our own. Being familiar with image crawling is a necessary skill for those planning on working with computer vision. Also, when making your first dataset, you will realize this is harder than you think.

Exercise 1: Build a python script to automatically fetch N images that match Q queries using google images. Save them all in one sub-folder per query. Use N = 150 and the following ten queries: cats, dogs, lizards, fish, birds, sharks, frogs, cows, dinosaurs and monsters

Exercise 2: Manually inspect each folder and remove all images that do not adequately match the query, such as memes and drawings. Count how many photos were deleted for each query and the reasons for deletion. Did any of the queries overlap, such as cat pictures in the dog’s folder?

Exercise 3: Adapt the script in Problem 1 to download more images for each query, to compensate for the ones you deleted in Problem 2. Execute this script to get N = 1000.

Exercise 4: As for problem 2, manually inspect each folder removing overlaps and unwanted images. This time, the frequency of bogus images is higher or smaller? How how long would it take to build a dataset of ten thousand images per class?

As for image mining, we often have to gather a sizeable amount of text. While general-purpose datasets exist, we often deal with specific vocabularies or not so common languages. Besides, many times we have to do the text cleaning ourselves, which is a very error-prone task.

To hone our text mining skills, we will look at scraping text from Wikipedia. Text is noisy; a couple of articles might hold more than a thousand unique words. Some may be misspelled. Some will be hyphenated. Cleaning images is natural; cleaning text not so much.

Exercise 1: Create a script that can fetch N articles from Wikipedia and extract its title and text. There are many ways to do this, REST APIs, pip packages, beautiful soup, and so on. To get random pages, you can use the “random page” link. Try downloading a hundred articles.

Exercise 2: Create a script that combines all articles into a single string, counts the frequency of each unique word, sorts them by frequency, and prints everything. Which are the hundred least/most common words in the dataset? How many words appear more than a hundred times?

Exercise 3: Using a plotting library, create a line plot of the word frequencies. What is the shape of the obtained curve? Repeat the plot using a logarithmic scale. How does the curve look now?

Exercise 4: Manually compute the TF-IDF of each word for each document. This is one of the most widely used metrics in Natural Language Processing to determine which words are relevant for each document. In sequence, sort the words by their TF-IDF values and print the top-10. Compare them to the article’s title. Are they relevant words?

Exercise 5: To clean a text corpus, you must convert everything to lower case, remove stop-words, remove numbers, fix typos, undo contractions, and normalize each word (stemming and lemmatization). This procedure effectively removes all articles, prepositions, gender markings, plural markings, and converts all verbs to the infinitive form. Finally, the 10 or 20% least common words are usually pruned. This entire procedure is one of the most critical tasks for doing good NLP. Look for tutorials on how to perform this. NLTKis an excellent tool with plenty of tutorials.

Exercise 6: If you survived cleaning the corpus, repeat problem 4. Did the most relevant words for each document change?

Exercise 7: Some words have a different meaning when together. “The”, “United” and “States” have individual meanings when in isolation and means “The United States” when together. Find such bigrams and trigrams. At a high level, this can be accomplished by comparing the probability that two words will appear together at random against the actual frequency they appeared together in the dataset.

Sometimes we get to find a dataset we can use. However, apart from the mainstream ones, getting to the “X_train, y_train, X_test, y_test” might not be as trivial as you think. In practice, data scientists have to tame data beasts on a daily basis. Here are some datasets that you have to do some manual work to get them in shape for deep learning.

Exercise 1: Process the original CIFAR-10 and CIFAR-100 datasets. This includes downloading it, opening the files with pickle, reshaping the loaded arrays, and normalizing the images to floats in the [0, 1] range.

Exercise 2: Process the original Flower-17 dataset. Perform the same tasks as in Problem 1. This time, there is no label information in the download; you have to figure out how to get it.

Exercise 3: Process the Original Flower-102 dataset. This time there is a Matlab file you can use to figure out how to open and extract the relevant data. The other option is using this page to get the instance counts per class.

Exercise 4: Previously, we manually downloaded some Wikipedia articles to do some NLP tasks. This time, your mission is to download a dump of the entire English Wikipedia. In other words, all the text of all articles. Luckily, Wikipedia itself provides many ways to download its entire catalog. Do not try to do this article by article. Your job is to find a dump, download, read, and parse it. This means doing the entire cleaning procedure.

Exercise 5: Sometimes, we have to enhance a dataset. CIFAR-100 has a hundred categories, but it only has image data. Your task is to use your Wikipedia scraping skills to find 50 relevant words for each CIFAR-100 class. Measure how much these 50 words overlap among categories and sub-categories.

For most proofs-of-concept, pre-trained models suffice. TensorFlow Hub is an excellent place to get a feeling for what is out there to be used. Many models exist for image classification, object detection, sentiment analysis, text processing, and so on. Knowing how to apply, fine-tune, and retrain, some of the most famous models is a necessary skill to perform data tasks quickly.

These exercises are meant to get you going with some of the most commonly used models for some of the most frequently seen tasks.

Exercise 1: Get a pre-trained version of ResNet50 for the ImageNet dataset. Its output is a 1000-elements vector with its class confidence for the thousand classes used in the ImageNet challenge. Your job is to create a script that takes an image and prints its class utilizing the ResNet model. You can also have it write the five most probable classes, taking the five highest output values.

Exercise 2: Redo Problem 1 for the MobileNetV3 model. Download a thousand images and measure how long it takes to process them with each model. Is the accuracy significantly affected? Which model is better?

Exercise 3: Get the YOLO and Faster R-CNN models up and running. You can use COCO 2017 Validation Images, some Youtube video, or your webcam as a data stream for the detection. Find out which types of objects they support. Which is faster and more accurate, and by how much?

Exercise 4: For text, most models assign numerical values to each word to represent their meanings. This is known as word embedding. Use the BERT model to perform sentiment analysis with the IMDb Reviews Dataset. For this, you will need to fine-tune BERT. How well does it fare?

The importance of data visualization cannot be over-stated. Without it, we are blind to all the opportunities our data can offer. We can use it for tasks such as model interpretation, error analysis, and data understanding. Of all the skills I covered here in this article, visualization is the one that requires the most ingenious and imaginative ideas. As the saying goes, a picture is worth a thousand words. In the following, I present some problems you have to figure out one or more visualizations that answer the posed questions.

Exercise 1: Consider the IMDb Reviews dataset. Can you think of a good way to visualize which words are representative of the positive and negative classes? Does pruning the most frequent words improve your visualization? If you trained BERT to solve this problem, try visualizing the reviews it got wrong. Is there anything special about these reviews?

Exercise 2: Download the CelebA dataset. Plot the frequency of each attribute. Which are the most frequent/infrequent? How can you quickly summarize what each class means to a non-English speaker? Some tags are very subjective, are all faces under the “attractive” category appealing to you? Are some other images not under this category attractive?

Exercise 3: Fine-tune the ResNet50 and MobileNetV3 to solve the CIFAR-100 dataset and save their predictions for all test images. How often do both models agree/disagree in their predictions? Which are the classes they agree/disagree more? What are the pictures they both get wrong?

Exercise 4: Download the Flower-102 dataset. What are the primary colors of each kind? Are there multi-colored flowers? Flowers can be grouped by their colors, can they also be meaningfully grouped by name?

Exercise 5: Download the VGG network trained on the ImageNet problem and apply it to some image. Inspect its first convolution, which patterns does each filter recognize? Can you answer this for the second and third layers?

Images Powered by Shutterstock