AutoML Training Guide

AutoML Training Guide

12 October 2021

Before you start

Note: Before using this script you need to export Google credentials.

Note: The dataset files should be having ratio as mentioned below:

Training Dataset (80%)

Test Dataset (10%)

Validate Dataset (10%)

For example: If you have 100 files then 80 should belong to the train dataset, 10 for Validation and 10 for the Test dataset.

Manual Annotation Process

Step 1: Create a directory say dir1 and add all the resumes that you want to use for the train dataset.

Step 2: Now you need to create dir2 and dir3 to validate and test the dataset respectively.

Step 3: To proceed further you need to install Google Cloud SDK and authenticate your email ID. Once you have completed the installation of Google Cloud SDK now you have to enter below commands one by one to upload train, validation and test dataset to Google Cloud Storage.

Train:

python2 script.py -t gs://match_making/tenant1/hr/documents/train train,dir1/*.pdf

Validation:

python2 script.py -t gs://match_making/tenant1/hr/documents/validation validation,dir2/*.pdf

Test:

python2 script.py -t gs://match_making/tenant1/hr/documents/test test,dir3/*.pdf

Step 4: To verify that the uploading is successfully completed. Navigate to the path that you have given in the script to verify your uploaded resumes/JDs in the Google Cloud Storage.

Step 5: After you are done with the data upload process, you need to import them by creating a new dataset in AutoML or you can use existing dataset.

Step 6: Navigate to the dataset.csv file in Google Cloud Storage and select the file and click on import.

Step 7: Now once you click on the import dataset, wait for 5-10 minutes to import all the resumes/JDs on the AutoML platform.

Step 8: Once done you will see all the PDF files and you can now annotate and start the annotation process.

Step 9: After completing the annotation you need to start training which usually takes 3 hours.

Step 10: Once training is finished you can test the model

Auto Annotation Process

Step 1: To upload txt files and use auto annotations feature you first need to convert PDF into a TXT file.

Step 2: Now you just need to keep these files in different directories like we have done for uploading PDF files and use the below commands to upload them to Google Cloud Storage.

Note: dict.csv file contains all the labels that we want to auto annotate.

Train:

python2 script.py -d dict.csv -s train,dir1/*.txt gs://match_making/tenant1/hr/documents/train

python2 script.py -d dict_banking.csv -s train,dir1/*.txt gs://match_making/tenant1/banking/documents/train

Validation:

python2 script.py -d dict.csv -s validation,dir2/*.txt gs://match_making/tenant1/hr/documents/validate

python2 script.py -d dict_banking.csv -s validation,dir2/*.txt gs://match_making/tenant1/banking/documents/validate

Test:

python2 script.py -d dict.csv -s test,dir3/*.txt gs://match_making/tenant1/hr/documents/test

python2 script.py -d dict_banking.csv -s test,dir3/*.txt gs://match_making/tenant1/banking/documents/test

Step 3: To verify that the uploading is successfully completed. Navigate to the path that you have given in the script to verify your uploaded resumes/JDs in Google Cloud Storage.

Step 4: After you are done with the data upload process, you need to import them by creating a new dataset in AutoML or you can use the existing dataset.

Step 5: Navigate to the dataset.csv file in Google Cloud Storage and select the file and click on import.

Step 6: Now once you click on the import dataset, wait for 5-10 minutes to import all the resumes/JDs on the AutoML platform.

Step 7: Once done you will see all the documents and you can now start the annotation process.

Step 8: After completing the annotation you need to start training which usually takes 3 hours.

Step 9: Once training is finished you can test the model.

Request a quote