UiPath PDF File Word Analysis
I have a lot of PDF documentation to read. I’ve been starting to work with UiPath and I wondered if I could put something together to perform a simple analysis of the frequency words are used in a document. My thinking here is that is might help to identify trends in the document, any pattern of thinking and highlight key themes. It turns out you absolutely can, and with the simple way UiPath builds automation frameworks, it really wasn’t anywhere near as difficult as I was expecting.
For everything that follows I’ve used the free and native OCR capabilities, the accuracy of the automation could be improved by paying for an OCR service.
GitHub repository for this is here
PDF Analysis with UiPath
This automation utilises the native capabilities within UiPath to create an analysis of the words used within a PDF file. Mapping the frequency that a word is used, whilst stripping out common english words.
- Requests the folder containing the PDFs for analysis
- Requests the path to the downloaded wordCloud.txt (this contains the VBA code)
For Each file in the provided directory;
- Read PDF Text to string, assign to a string variable.
- Generate single column DataTable from PDF Text, utilising ‘space’ as both a Column and NewLine Separator. Assign to DataTable variable.
- Assign DataTable column name to a string variable.
- Count the Rows in the DataTable and assign that to a string variable
- Create Excel file from PDF name with suffix ‘-report’
- Right DataTable to Excel file starting from cell ‘A1’ on sheet named ‘StratWords’
- Creates a table in the sheet ‘stratwords’, using the variable from count rows action to determine the correct table size
- Invokes VBA stored in wordCloud.txt at ‘Punc’ entry point to strip punctuation from the word list
- Invoke VBA stored in wordCloud.txt at ‘commonWords’ entry point to remove common words from the word list
- Inserts column ‘Count’
- Writes formula ‘COUNTIF(A:A,A2)’ to cells in Count column – this counts the frequency of a word
- Inserts column ‘Rank’
- Writes formula ‘RANK.EQ(B2,B:B,0)’ to cells in the Rank column – this ranks the words from most to least frequently used.
- Creates Pivot table call ‘StratPiv’ in new sheet called ‘PDFPivot’
- Invoke VBA stored in wordCloud.txt at ‘PivotConfig’ entry point to reformat the Pivot table
- Saves the file
As you can see from the output examples the final look is not polished, but then I don’t need it to be. What you can start to see is the document being expressed through the key words being utilised.