Powershell

April 21, 2017

Parsing PDF to text using PowerShell

There are times when you come and you find some Important PDF file (Something like a low level design or something similar), The point is I really hate PDF files...

I was googling a way which i can use to read PDF files using PowerShell, this can be helpful of you have a lots of PDF files and you want to filter these PDF's based on certain word or a string (Using a regex or something similar to do that).

I know i was talking too much so let cut to the chase :

I found this amazing .NET library which you can use to parse a PDF file(Convert it into a text). The library is called "ITextSharp", here you can fidn some information about it :

http://sourceforge.net/projects/itextsharp/?source=typ_redirect

I assume that you already know what the hell I am talking about so, This is the code which i got from from Stack Overflow :

http://stackoverflow.com/questions/15684699/how-to-parse-pdf-content-to-database-with-powershell

The above code (click on the link to get it) will read one PDF file and parse the inter content into text.

The code which made, using the help of the above code is this :

# The code starts here :

$pdflist = Get-ChildItem -Path "D:\Resumes\" -Filter "*.pdf"

foreach ($pdff in $pdflist){
Add-Type -Path "c:\NonOfficialPSModules\itextsharp-all-5.5.8\itextsharp-dll-core\itextsharp.dll"

$pdffile = $pdff.Name
$reader = New-Object iTextSharp.text.pdf.pdfreader -ArgumentList "D:\Resumes\$pdffile"

Write-Host "Reading file $pdffile" -BackgroundColor Black -ForegroundColor Green

for ($page = 1; $page -le $reader.NumberOfPages; $page++)
{
$strategy = new-object 'iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy'
$currentText = [iTextSharp.text.pdf.parser.PdfTextExtractor]::GetTextFromPage($reader, $page, $strategy);
[string[]]$Text += [system.text.Encoding]::UTF8.GetString([System.Text.ASCIIEncoding]::Convert( [system.text.encoding]::default , [system.text.encoding]::UTF8, [system.text.Encoding]::Default.GetBytes($currentText)));
}

$Text
#[regex]::matches( $text, '(?<=IMEI\s+)(\d+)(?=\s+)' ) | select *
$reader.Close()
}

# The code ends here...

What i really did is that Instead of reading one PDF file, I am reading an entire folder which contains a lots of PDF files and extract certain text from all the PDF files and show it on the console screen, this can be useful if (For example) i want to know how many PDF file contains a certain word or a string.

Comments

Gary BijoyAugust 19, 2020 at 4:12 AM
Is there a way to extract the keyword plus a certain number of characters, for eg: my Keyword which i am searching in pdf is "Address". I want the output in like "Address"+ "next 100 characters" ?
ReplyDelete
Replies
morrisMarch 10, 2021 at 6:38 AM
Thank you very much for writing such an interesting article on this topic. This has really made me think and I hope to read more. this
ReplyDelete
Replies

Add comment

Search This Blog

Powershell

Comments

Post a Comment

Popular posts from this blog

IP Calculator in PowerShell...with IP exclusion