Categories
Data

Extract Hidden URLs from PDF files

To search for URLs in a PDF file, one can use the built-in search function in PDF readers and look for strings like “http.” However, when the URL is hidden, a simple string search will not work in the reader. A more reliable way to get all the URLs is to use the “pdftotext” program to convert the PDF file to text format, then use the “grep” command to catch all the URLs. The “-raw” option helps to keep the order of the text in place. The example code below converts the file to pdf, and identify the generated txt file by looking at the latest file file, grep the http, then delete the txt file.

pdftotext -raw ml.pdf && file=ls -tr|tail -1; grep http $file; rm $file

Leave a Reply

Your email address will not be published.