
Machine learning is the study of systems with the capability to imitate human learning with the help of algorithms and advanced processing systems. It is a branch of Artificial Learning In which you develop systems that can improve themselves with the help of processed data. The system uses your provided algorithms to analyze the data being fed to it and helps to predict the possible outcome for it.
AI-based chatbots are a practical example of the technology. ML allows the chatbot to check the past data of their customer, analyze their needs and predict the answer they would want to hear. But of course, that’s not it’s only practical application. In fact, presently, ML finds its application in almost every field, including healthcare security and business. Even the large-scale industries, including Facebook, Google, and Amazon, are upgrading to machine learning.
And while the increasing use of this technology is increasing its scope, it’s also increasing the need for data for the prediction. It’s the basic building block of the technology and will directly affect both its performance and accuracy. Thus you have to make sure everything you input into the ML system is crystal clear and accurate, which is not very easy. You have to create a dataset with thousands and thousands of samples and even add variations for each of these. For instance, if you are creating a face recognition application, you will have to add images of each face you want the system to recognize. Not to mention the different angles, colors, and lighting conditions for each of them. That would be hectic and costly, isn’t it?
Well, luckily you don’t have to do it yourself. There are pre-build databases called the datasets that you can use for this particular task. And here’s all you need to know about them.
What is a Machine Learning Dataset?
The datasets, in simple words, are a collection of data related to a specific subject or aspect of Machine Learning. For instance, let’s say you collect voice recognition data of 50 users with all the variations of their voice. It includes different tones of the voices in different environments. Then the collection you develop is called the dataset.
Top Public Machine Learning Datasets
With many Machine Learning enthusiasts diving into the field, the need for free datasets is increasing. We have organized a list of the top free datasets in the market that you can prefer for your work. The data is divided into sections according to their type; feel free to browse directly to the one you need.
Facial Recognition Technology
Every face has distinct features and attributes that are unique for every person. Facial recognition technology uses these attributes to identify and recognize a person.
The technology works on the basis of databases that can include both manually collected files and public libraries. It uses the analytical data collected from the facial features and matches them with the one provided in the database to identify the person. You can use biometrics to recognize a person in photographs, video, and even in real-time.
And there are many open face recognition datasets that you can consider to get started with Facial recognition. Here are a few best considerations.
- CelebFace Attributes Dataset (CelebA) is one of the largest datasets available for detecting and facial verification. It uses 5 facial landmarks with 40 different attributes to find the best possible results. It offers you a database of around 10K people with more than 200K face images. You get multiple attributes of every face, covering different poses and background fills.
- If you are just starting with facial recognition and fine using a relatively small dataset, the CelebAMask-HQ is a great option. This dataset comes with 30,000 high DPI face images and includes 19 classes. It includes data about different skins, noses, eyebrows, ears, lips, and many other components. You can easily use it for face parsing, face recognition, and most of all, GANs for face generation and editing.
- CelebA-Spoof, as the name suggests, is a CelebA based data set. It’s a large-scale set that comes with over 10177 subjects and includes around 43 different attributes. In this set, you get around 625537 images that cover most of the expressions, skin types, and other data about each subject. Plus, you get data about many different environments, illumination conditions, and accessories that come in real handy for precise facial recognition.
- The tech giant Google always has the best option when it comes to data and innovative technologies. And their facial recognition dataset, the Google facial expression comparison dataset, is one of the best. It offers you 156,000 facial images with around 500,000 image triplets. Each of these triplets is inspected by multiple human raters. The dataset is specifically good for expression recognition and expression-based image retrieval. It is also an excellent choice for emotion classification, expression synthesis, and more. And you can get access to all the data by filling in just a small Google form.
- IMDb-face is a large-scale facial recognition dataset developed and managed using data from the IMDb website. The dataset provides you access to over 1.7 million faces with 59K identities. Each of these images has been manually cleaned from 2.0 million raw images, which means crystal clear images for your work.
- The Facial Deformable Models of Animals (FDMA) aims at proposing a new algorithm that can cope with a larger changeability than before, which is often seen in the faces of animals. It is done to challenge the current human facial tracking and detection’s state of the art. The algorithms offered by the FDMA project were capable enough of tracking different variations that are usually caused by a change in facial expressions, poses, or illumination.
Object Detection
Object Detection, as the name represents, is the technique in which software identifies and locates any particular object. It helps locate various different objects in no time. Listed below is a comprehensive list that will save your time:
- DOTA (Dataset of Object deTection in Aerial images) is one of the largest datasets that detects objects in aerial images. It is highly useful for the development and evaluation of object detectors from the camera placed at a high altitude.
- COCO (Common Objects in Context) is one of the most renowned image datasets used for evaluating the computer vision models state of the art. COCO works effectively for keypoint detection, panoptic segmentation, object detection, semantic segmentation, and image captioning tasks.
- Pascal Visual Object Classes (VOC) is a combination of annotation and patterned image datasets generally used for class recognition, object detection, and instance segmentation. It is often used to help in evaluating and making comparisons between different methods.
- The Pascal3D+ is a multiview dataset that consists of images of object categories with high variability, or the images captured in uncontrolled settings, cluttered scenes, and in various different poses. In addition to the inclusion of 12 categories of the PASCAL VOC 2012 dataset’s rigid objects, you also get the ImageNet dataset’s pose annotated images of these categories.
- LVIS stands for Large Vocabulary Instance Segmentation that focuses on collecting around 2 million instance segmentation masks that are of high quality. It is helpful for collecting around 1000 entry-level object categories in 164k images.
- MOT ( The Multiple Object Tracking) is used for tracking objects outdoor and indoor scenes of communal places with the object of interest being the pedestrians. The video record is split into two clips, from which one is used for training purposes while the other one is for testing. MOT helps in detecting objects from the video frames by using three different detectors, SDP, DPM, and Faster-RCNN.
- Visual Genome includes 101,174 images from MSCOCO along with millions of question and answer pairs, with 17 questions per one image on average. It has a multichoice setting which includes Visual Question Answering data. When compared with the Visual Question Answering Dataset, a more balanced distribution of over 6 question types, i.e., What, Where, Why, How, When, and Who, is represented by Visual Genome. Not just this, but Visual Genome also represents 108k images with congested annotated objects, relationships, and attributes.
- To assess articulated human pose estimation, a state-of-the-art benchmark known as the MPII Human Pose Dataset is used. It contains around 25K images that further include over 40k people along with annotated body joints. The images included cover around 410 various different human activities, with manually annotated poses with up to 16 body joints. The source of these images is Youtube videos.
Video
The task of collecting and loading the entire dataset into local storage is not only time-consuming but impractical as well. It gets even more impractical when it comes to videos. These inconvenient and labor-intensive processes of the data pipeline can be cut down with the help of open datasets. Below is the list of top open-source video datasets, a section of which you can get on our computer vision datasets section.
- BDD100K is one large-scale and diverse open video dataset that is completely collected by a driving platform. This dataset is perfect for automotive applications and includes around 100K videos, each of 40 seconds. Not just this, but BDD100K also offers features like street temporal information, data variation, and annotated footage.
- As the name suggests, The Cityscapes datasets are large-scale datasets of stereo videos of the urban scene. The Cityscapes features a video recording of Germany’s 50 different cities with pixel-accurate annotations. Not just this, but it also provides outside temperature, right stereo views, GPS coordinates, and ego-motion data.
- The VOT2016 dataset available by way of the VOT toolkit is used for visual object tracking. It includes around 60 video clips and 21646 ground-truth maps along with pixel-wise annotation of important objects.
- The Kinetics dataset is one of the most renowned video datasets that presents one large-scale, top-quality dataset for human action recognition. It includes around 650,000 video clips which consist of 700 human action classes. The human interactions included in the Kinetics dataset are shaking hands, hugging, and more. All the videos are taken from Youtube, and each of the classes has 400 to 700 video clips with a duration of 10 seconds.
- UCF101 dataset is a dataset that features action recognition of real action videos gathered from Youtube. It contains up to 101 categories which are classified into five different types, human to human interactions, body motion, playing musical instruments, human-object interactions, and sports. These 101 categories include a total of 13320 video clips that have been accumulated from Youtube with a total duration of around 27 hours.
- The HMDB51 dataset is a group of various real video clips gathered from different sources like movies and random videos from the web. The dataset includes 6849 videos categorized into 51 action categories like kiss, jump, laugh, and more. Each of these categories has around 101 video clips in them.
- DAVIS (Densely Annotation Video Segmentation dataset) is a dataset that includes 50 video clips having densely annotated frames of 3455-pixel level. Out of these 50 videos, 30 videos of 2079 frames are used for training, while the other 20 videos having 1376 frames are for validation.
- KITTI (Karlsruhe Institute of Technology and Toyota Technological Institute) is one of the most recommendable datasets that is used in autonomous driving and mobile robotics. With the help of a 3D laser scanner, sensor modalities, grayscale stereo cameras, and high-resolution RGB, hours of traffic have been recorded and included in KITTI. The dataset has been made perfect over time by many researchers by manually annotating sections of the dataset in order to fit their needs.
Audio
When it comes to recognizing and contextualizing the world by machine learning models, it is not only done with the help of computer vision but audio and sound also play an important role. Thanks to the open-source audio databases, training on speech-enabled applications in the environment with the craziest sound has now been made easy.
- The LibriSpeech is a compilation of the LibriVox project’s 1000 hours of audiobooks. Almost all the audiobooks come from Project Gutenberg. Three portions are made of the training data. These portions are 100hr, 360hr, and 500 hr, whereas the dev and the test data are of 5 hr audio length.
- Universal Dependencies (UD) is a structure that tries to evolve cross-linguistically consistent treebank annotation of syntax and morphology for multiple languages. In 2020, the 2.7 version was released, which included 183 treebanks over 104 languages. The annotation includes universal dependencies labels, dependencies heads, and universal part-of-speech tags.
- VoxCeleb1 is a collection of around 10000 utterances for 1250 celebrities which are extracted from the videos that have been uploaded on Youtube. The dataset is best suitable for emotion recognition, speaker identification, speech separation, and more.
- VoxCeleb2 is a dataset with over a million utterances from around 6k speakers. The dataset is acquired from open-source media. As the dataset includes both audio and visuals, it helps in various applications like speech separation, visual speech synthesis, and cross-modal transfer from face to the voice and vice versa.
- Audioset, as the name suggests, is a dataset that includes 10 seconds video clips of over 2 million people. The dataset includes an ontology of 632 event classes specifically to interpret the data. This makes sure that the same sound will be annotated under different labels. For instance, the sound of meow will be annotated under Animal, Cat, and Pets.
- The CSTR VCTK Corpus is a speech data set produced by 110 different English speakers with various different accents. Each of the speakers speaks 400 different sentences, which were chosen from the newspaper, elicitation paragraph, and the rainbow passage utilized for speech accent archive.
- The SNIPS Natural Language Understanding benchmark is a dataset that contains around 16000 crowdsourced queries. These queries are grouped into 7 users intents of diverse complexity, like booking a restaurant or asking about the weather. Among these 16000 utterances, 13084 utterances are for the training set, whereas 700 utterances each with 100 queries per intent.
- The Common Voice is an audio dataset that includes various different unique MP3 and related text files. The dataset has around a total of 13905 hours of files. Not only this, but it also contains demographic data such as age, accent, and sex. The validated 11192 hours included in the dataset are of 76 different languages.
Text
Most of the data you collect from emails, books, online forums, and websites is text form. You will need natural language processing and optical character recognition to convert this unstructured data into an understandable form. And considering there are thousands of dimensions, characters, and shapes of character, you will need a text dataset for the task. Here are a few you can try.
- The English Penn Treebank (PTB) corpus, particularly the section of the corpus connected with Wall Street Journal (WSJ) articles, is one of the most known and used corpus for evaluating models for sequence labeling. The task consists of annotating each word with its part-of-speech tag. The corpus is also commonly used for character-level and word-level language modeling.
- The Stanford Question Answering Dataset (SQuAD) has a complete gathering of pairs of questions and their answers. These question answers are taken from the articles on Wikipedia. There are 107,785 pairs of questions and answers on 536 articles in SQuAD 1.1. The latest version is SQuAD2.0, which contains an assembly of 100,000 questions of SQuAD1.1 and 50,000 questions that are unanswerable. These 50,0000 questions, written by crowd workers, are similar to the answerable ones.
- Visual Question Answering (VQA), as the name suggests, is a dataset containing open-ended images about the images. To answer this question, it is necessary to have an understanding of language, vision, and commonsense knowledge. In October 2015, the first version of VQA was released. In contrast, the second version, i.e., VQA v2.0 got released in April 2017.
- The IMDb Movie review dataset is a binary sentiment classification dataset containing around 50000 reviews labeled as positive and negative from the Internet Movie Database (IMDb). The positive and negative reviews are present in an even number, along with some additional unlabeled data. Each movie in the dataset has no more than 30 reviews.
- ConceptNet is a knowledge graph that works to make a connection between natural language words and phrases. Expert-created resources, crowdsourcing, and games with a purpose are some of the numerous sources of knowledge.
- The SNLI Dataset ( Stanford Natural Language Interference) is a group of over 570000 pairs of human-written English sentences. These are labeled with three labels are entailment, contradiction, and neutral, for a balanced classification. It is not only done for evaluating the representational systems of text but also works as a resource used to develop NLP models.
- CLEVR ( Compositional Language and Elementary Visual Reasoning ) is a dataset that contains Visual Question Answering. 3D rendered objects images are included in the data, and each of them comes with a number of questions that are further categorized into different groups. The dataset includes three different sets: the training set, the validation set, and the test set. The training set includes 70000 images with 700000 questions, the validation set contains 15000 images with 150000 questions, and the test set contains 15000 images with 150000 questions about objects, scene graphs, and answers for all train and validation questions and images.
Healthcare
The use of predictive analytics and machine learning techniques is on the rise. There’s a demand for ML-based apps in this section. If you want to test the water, the following datasets are the best options.
- Breast cancer has become quite a concern in the last couple of years. It’s hard to detect, and many women are falling victim to it. The ML technology is lending a hand by providing a way to predict if the cancer is benign or malignant. It consists of data about 569 instances, with 357 being benign and 212 malignant. The technology compares the digitized images from Breast Cancer Wisconsin (Diagnostic) Data Set of fine needles aspirate (FNA) of breast mass to make predictions.
- Pima Indians Diabetes database is a collection of 768 observations with 8 input features and one output feature. The dataset is efficient in using certain parameters to predict whether a person has diabetes or not. Although, the dataset is not balanced; and you will find some missing values denoted by 0.
- If you are building a program to predict the height or weight of a human, the SOCR Dataset is the best. It contains data related to the height and weight of over 25000 humans with an age of 18 or less.
- The International Collaboration on Cancer Reporting (ICCR) is not a dataset but a collection of 12 datasets. The data in this set are arranged according to 12 anatomical sites that are vulnerable to cancer. Ten main aim of this dataset is to collect the required data about different tumors and store it in an organized manner. It can come in handy for the prognosis and management of cancers.
- The hasty lifestyle and food habits expose humans to a risk of heart diseases. The Heart disease dataset is helpful in recognizing individuals prone to heart problems. It uses 76 different attributes, including age, sex, chest pain types, resting blood pressure, and more, to predict the risk. The dataset includes 303 instances that help to distinguish the presence of heart diseases, with 1, 2, 3, and 4 representing the scale of the issue and 0 representing its absence.
- CDI or the Chronic Disease Indicators is a relatively new dataset updated in April 2021. It consists of information about genders, diseases, mortality, and more between 2008 and 2019. The US Centers for Disease Control and Prevention published the dataset to allow the states and metropolitans to collect chronic disease data.
- Cardiovascular health problems are not uncommon these days. The Heart Failure Dataset consists of data that you can use to predict these problems in people based on certain traits. The data set consists of 12 features to predict the possibility of major problems or death due to heart problems. The file has columns with data organized based on ages, sex, diabetes presence, blood pressure, and more. You can download it in CSV format.
- MIT labs developed the MIMIC-III for computational physiology that consists of de-identified health data. It contains details of around 40,000 critical patience, including details of their vital signs, laboratory tests, medications, and more.
- When you need real-life data about patients, the Ocular Disease Intelligent Recognition (ODIR) is the right dataset. It consists of data collected by Shanggong Medical Technology Co. Ltd from hospitals and medical centers of China. The collection will offer a structured ophthalmic database of 5000 patients. It includes data about their age, along with fundus photographs from left and right eyes. The company has also included diagnosis keywords collected from doctors. And all of the data is closely checked by trained human readers to maintain accuracy and quality.
- The Fetal Health Classification is an efficient data set for determining the health of a fetus or an unborn baby. The set consists of 2126 Fetal CTGs, processed and labeled by three expert obstetricians. It can use the details to classify the baby into Normal, Suspect, or pathological based on the CTG data. The dataset can help to prevent high mortality rates both for mother and child.
For more such healthcare datasets, click here.
COVID-19
The COVID pandemic shook the world to a great degree, with the number of infected increasing exponentially. The pandemic is still having a devastating effect on health. And it needs proper research to find solutions to the problems, and Machine learning can help with it. It can help to collect data, analyze it and also predict issues that can arise in the near future.
- The COVID 19 pandemic raised a need for medical masks to prevent infection. The COVID-19 Medical Face Mask Detection Dataset was developed on top of this rising need. It was published by Mikolaj Witowski and includes 682 pictures and over 3000 medical masked faces. The publishers removed every redundant and low-quality image from the data set to maintain high quality. It still has 1415 images, enough for making any mask detection program.
- During COVID 19 pandemic, detecting the infection in the initial stage was a challenge. The COVID-Net is a deep convolutional neural network designed to help in this aspect. This dataset created by Alexander Wong and Linda could help to detect the COVID 19 cases based on chest X-Rays radiography. The dataset consists of 16,756 chest radiography images from around 13,645 patients. The datasets use COVIDx to gather the data and train its system.
- When you need to develop an algorithm or software to detect whether someone is using a mask properly or not, MaskedFace-Net is the perfect dataset. The dataset consists of 133,983 images of people wearing their masks both correctly and incorrectly. It uses a dataset called Flickr-Faces-HQ or FFHQ for the details.
- The CORD-19 is a dataset indicating an extensive collection of literature that can be read by a machine. This AI research challenge supports the AI research community all over the world to use text and mining approaches and study the new content related to COVID-19 response efforts worldwide. The dataset is fresh and constantly updated because of the sponsored $1,000 per task by Kaggle.
- COVID-19 CT scans dataset has 20 CT scans of COVID-19 patients and lung and infection segmentation by experts. Due to the paucity of expert radiologists, CT scans can be useful in COVID-19 diagnosis and treatment.
- The United States COVID-19 County Level of Community Transmission as Originally Posted is a public use dataset containing 7 data elements that show infection levels of community transmission (low, moderate, substantial, or high). There are two versions of country-level community transmission level data, this dataset has recent data, and the historical dataset has country-level transmission data from 1st January 2021.
- COVID-19 dataset by Our World Data has the information of country-level vaccination and vaccination sources. The dataset is updated daily and has information about the number of vaccinations on a particular day, the number of vaccinated people, the number of fully vaccinated people, and more.
Agriculture
Machine learning is an effective technique for increasing the yield of crops and developing agricultural plans. It can also help share data about different crops and agricultural techniques suites for a specific area.
- The Wine quality dataset contains various synthetic standards regarding the wine, which include volatile acidity, leftover sugar, fixed acidity, chlorides, and more. The goal is to plan a model that helps to predict the wine quality, whether it is typical, poor, or the best. With around 4898 occurrences, this dataset is perfect for regression and classification tasks.
- The Food and Agriculture Organization (FAO) of the United Nations offers free admittance to food and horticulture data for north of 245 nations and regions from 1961-2013. In one of their projects, the Food Balance Sheets dataset shares bits of knowledge on our overall food creation by focusing on the difference between feed production for animals and food production for human consumption.
- The United states’ wildfire dataset contains wildfire information for the period of 1992-2015 collected from the US government, state, and local reporting systems. It was last updated a year ago with a total of three updates. This data set is a kind of SQLite database that includes information on the code, year, its longitude and latitude, fire name, and more.
- The Food and Agriculture Organization (FAO) of the United Nations provides a dataset named FAOSTAT. In this dataset, one can view, filter, and download the data on food insecurity, hunger, and demographics, and more. The graphs include data from 1990-2019 and are ideal for making forecast models.
- Published in 2020, the Crop Recommendation Dataset is generally considered a new dataset. The goal of this dataset is to increase the agricultural yield by suggesting appropriate crops. This dataset was created by increasing Indian rainfall, climate, and fertilizer datasets, letting consumers make a prescient model to suggest the most suitable harvest for a particular farm depending on various factors kike rainfall, soil PH value, rainfall, humidity level, and more.
Security And Fraud
Security and surveillance systems are an important part of the present world. Machine learning can add several benefits by providing technologies like computer vision, motion detection, and fraud detection. And that means you can definitely use some security datasets.
- The fake and real news dataset includes two CSV lists to enlist both fake and real news. The list having Fake and real news dataset contains data of around 17903 articles. On the other hand, the list with real news consists of 20826 unique values. Both the news are shortlisted to United states politics.
- Being intensely utilized in writing, the Spam SMS dataset is a decent decision to rehearse spam identification and text order. With a sum of 5574 occasions, the set addresses a message document with the tag (ham or spam) trailed by the crude messages gathered from different sources.
- Bank security, also defined as the acknowledgment of fake credit card transactions, is considered an essential part of security. The Credit Card Fraud Detection dataset contains around 284,807 transactions done by the European cardholders in September 2013. From all the transactions, almost 492 fake transactions have been identified, which makes the dataset completely unbalanced. Nonetheless, lately, a duplicate for transaction data has been released as a part of the practical handbook on Machine learning for detecting credit card frauds, so go and check out if interested.
- The Synthetic financial dataset is another fraud detection dataset that targets mobile money transactions. This dataset is created with the help of a simulator known as PaySim that combines data from the private dataset and builds a synthetic dataset that resembles the real transactions. Additionally, the simulator inserts the malicious behavior to check the performance of fraud detection methods in the future.
- These days, the credit card score is another serious problem. The Credit card approval prediction dataset makes the base for the model to make a comparison between the authentic and unauthentic clients as per the historical data. This dataset comprises two lists, i.e., application records and credit records.
- The Global Terrorism Database is an open-source CSV chart. It contains a record of more than 180,000 terrorist attacks of the whole world from 1970-2017. This dataset contains all the information such as country, region, occurrence date, attack, and target types, and even resolution.
- The SIXray dataset includes the information of around 1,059,231 X-ray images. These images are gathered from subway stations and explained by human security inspectors. The purpose of this dataset is to detect and categorize the six prohibited items, including guns, knives, pliers, scissors, hammers, and wrenches. Furthermore, it contains manually added bounding boxes on the testing sets for each restricted item to check the performance of object localization.
- The Handgun Detection Dataset was first published by the University of Grenada. The aim of the Handgun Detection Dataset is to contribute to the enhancement of public safety by detecting handguns within pictures. This dataset consists of 2986 images and 3448 labels across a single interpretation class, including pistols in hand, pistols, and other types of gun images.
- Firenet Dataset is a real-time project for fire detection. The aim of this project is to make sure that the ML system can be trained in the way to detect fires instantly and remove false alerts. It features pre-trained models, annotated datasets, and pre-trained models. Furthermore, it includes 502 images that are divided into 412 images for training and 90 images for validation.
- The name US Accident Dataset already depicts the content of the project. This countrywide dataset includes 49 USA states with information from February 2016 to December 2020. Currently, there are around 1.5 million accident records available in the dataset. The data has been gathered in real-time by using various APIs. The broadcast traffic data of APIs are collected with the help of multiple identities such as traffic cameras, law enforcement agencies, and the state departments of transportation.
Flora and Fauna
Artificial intelligence-powered applications are brilliant for the recognition of a species from a mere image. And technology like this can definitely use some datasets.
- Iris Data Set is the best database in the literature of pattern recognition. It is owing to the classic R.A. Fisher paper, which is useful even today. The Iris Data Set plant has 3 classes or 3 iris plant types of 50 instances each.
- The large-scale dataset has nine diverse seafood types, namely gilt head bream, red sea bream, sea bass, red mullet, horse mackerel, black sea sprat, striped red mullet, trout, and shrimp image samples. It was published in 2020 and collected with the aim to do segmentation, feature extraction, and classification tasks.
- The INRIA-Horse dataset comprises 170 horse images which are interpreted with bounding boxes, and 170 images without horses. The dataset can be used for object detection, edge detection, and classification tasks. However, it also has some drawbacks, including clutter, intra-class shape variability, and scale changes.
- The Stanford Dogs dataset has 20,580 images of 120 dog breeds worldwide. Images and bounding box labeled annotations from ImageNet with the aim of fine-grained image categorization are used to build this data set.
- The Mushroom dataset aims at forecasting whether or not a mushroom is safe to eat. It was donated to the UCI Machine Learning repository and had information about hypothetical samples of 23 gilled mushroom species. Every species is categorized as definitely edible, definitely poisonous, or of unknown edibility and not recommended.
- Animals-10 is a dataset containing 10 categories of animals: dog, cat, horse, butterfly, sheep, elephant, cow, squirrel, spyder. The thousand animal images of medium-quality make it suitable for checking image recognition or classification tasks.
- Another animal dataset useful for image classification or recognition has 3 categories of animals, including dog, cat, and wildlife, and has high-quality images at 512×512 resolution. The dataset has more than 5000 images of each category and sums up to 16,130 images in total.
Final Thoughts
Artificial intelligence technologies, especially Machine learning, are on the rise. More and more people are indulging in it, and more companies are including it in their management system. Not to mention the applications of ML in websites and chatbots.
Thus, the need for ML resources is increasing, and the datasets are just the thing that you need to get started. It can help beginners to get started with ML and experience developers to the hassle of resource collection.
So, the next time you need a dataset for any of your ML projects, try the option provided above.