UiPath PDF File Word Analysis

UiPath PDF File Word Analysis

I have a lot of PDF documentation to read.  I’ve been starting to work with UiPath and I wondered if I could put something together to perform a simple analysis of the frequency words are used in a document.  My thinking here is that is might help to identify trends in the document, any pattern of thinking and highlight key themes.  It turns out you absolutely can, and with the simple way UiPath builds automation frameworks, it really wasn’t anywhere near as difficult as I was expecting.

For everything that follows I’ve used the free and native OCR capabilities, the accuracy of the automation could be improved by paying for an OCR service.

GitHub repository for this is here

PDF Analysis with UiPath

This automation utilises the native capabilities within UiPath to create an analysis of the words used within a PDF file. Mapping the frequency that a word is used, whilst stripping out common english words.

Steps

Set Variables

  1. Requests the folder containing the PDFs for analysis
  2. Requests the path to the downloaded wordCloud.txt (this contains the VBA code)

Analysis

For Each file in the provided directory;
  1. Read PDF Text to string, assign to a string variable.
  2. Generate single column DataTable from PDF Text, utilising ‘space’ as both a Column and NewLine Separator. Assign to DataTable variable.
  3. Assign DataTable column name to a string variable.
  4. Count the Rows in the DataTable and assign that to a string variable
  5. Create Excel file from PDF name with suffix ‘-report’
  6. Right DataTable to Excel file starting from cell ‘A1’ on sheet named ‘StratWords’
  7. Creates a table in the sheet ‘stratwords’, using the variable from count rows action to determine the correct table size
  8. Invokes VBA stored in wordCloud.txt at ‘Punc’ entry point to strip punctuation from the word list
  9. Invoke VBA stored in wordCloud.txt at ‘commonWords’ entry point to remove common words from the word list
  10. Inserts column ‘Count’
  11. Writes formula ‘COUNTIF(A:A,A2)’ to cells in Count column – this counts the frequency of a word
  12. Inserts column ‘Rank’
  13. Writes formula ‘RANK.EQ(B2,B:B,0)’ to cells in the Rank column – this ranks the words from most to least frequently used.
  14. Creates Pivot table call ‘StratPiv’ in new sheet called ‘PDFPivot’
  15. Invoke VBA stored in wordCloud.txt at ‘PivotConfig’ entry point to reformat the Pivot table
  16. Saves the file

Output Example

UiPath PDF File Word Analysis

Pivot Output Example 1

Pivot Output Example 2

Summary

As you can see from the output examples the final look is not polished, but then I don’t need it to be.  What you can start to see is the document being expressed through the key words being utilised.

Thanks

Simon