Offline Data Extraction Tool - web-scraping

I have a huge amount of web scrapes which was downloaded from the website. I want to extract specific data from those pages. so, I need to implement a tool that parses through every folder and find HTML files within it. afterward, each HTML file should be parsed to draw required information. Can anyone help me with the first phase of this project?


Creating a dataset by web crawling

I want to build a dataset consisting about 2000-3000 web pages, starting with several seed URLs. I tried it using the Nutch crawler but I was unable to get it done (unable to convert the 'segments' data fetched into html pages) .
Any suggestions of a different crawler that you have used or any other tool? What if web pages contain absolute URLs which will make offline use of the dataset impossible?
You can NOT directly convert the nutch crawled segments to html files directly.
I suggest you these options:
You can try modifying the source code to do that. (study the org.apache.nutch.segment.SegmentReader class. You can then dig into it to modify the working as per your use case).
EASY SOLUTION if you dont want to invest time to study code: Use nutch to crawl all required pages. Then get the actual urls crawled by using the "bin/nutch readdb" command (use dump option). Then write a script to wget the urls and save it in html form. Done !!

Can Apache solr stores actual files which are uploaded on it?

This is my first time on Stack Overflow. Thanks to all for providing valuable information and helping one another.
I am currently working on Apache Solr 7. There is a POC I need to complete as I have less time so putting this question here. I have setup SOLR on my windows machine. I have created core and uploaded a PDF document using /update/extract from Admin UI. After uploading, I can see the metadata of the file if I query from the Admin UI using query button. I was wondering if I can get the actusl content of the PDF as well. I can see there is one tlog file gets generated under /data/tlog/tlog000... with raw PDF data but not the actual file.
So the question are,
1. Can I get the PDF content?
2. does Solr stores the actual file somewhere?
a. If it stores then where it does?
b. If it does not store then, is there a way to store THE FILE?
Munish Arora
Solr will not sore the actual file anywhere.
Depending on your config it can store the binary content though.
Using the extract request handler Apache Solr relies on Apache Tika[1] to extract the content from the document[2].
So you can search and return the content of the pdf and a lot of other metadata if you like.

How to generate PDF files using Liferay?

I tried to find proper services for generating PDF files in Liferay, however I have found only class PDFProcessorUtil. How to use it to generate PDF file? How to save the generated file then? I think I should use
DLAppLocalServiceUtil.addFileEntry to save file into Liferay storage.
Liferay's PDF-conversion works by converting documents in the document library and offering them for download - this is implemented through Open Office. Install Open Office or Libre Office, run it in server mode and configure Liferay to use it, then you can choose to select downloads as PDF. The HTML format has a few limitations, as it can include so many external resources, so I'm not sure what your result will be.
If you're generating the HTML output yourself, you might want to consider any other (Liferay-independent) means of generating PDF, as you might not need to upload your files to the Document Library (e.g. if you're generating reports on the fly and just want the generator result to be PDF, but not store them). If this is what you need, you can use any pdf converter library you want - Liferay does not limit you in your choice.
You can also generate the PDFs from the serve resource phase of a portlet.
You put a button or a link somewhere, and when you click on it, you download the PDF.
In this simple example, the PDF is generated from a Freemarker template that generates an HTML that is converted to PDF:

Use EC2 for PDF Generation, provide public URL to user

I have developed an application which allows Users to select multiple "transactions"; each of this is directly related to a PDF file.
When a User multi-selects them, and "prints" them, these PDF files are merged into one longer file to provide ease of print.
Currently, "transaction" PDFs are generated on request, and so is PDF-merging.
I'm trying to scale this up relaying over Amazon infrastructure, some questions arised to me.
Should I implement a queue for the PDF generation per "transaction"? If so, how can I provide the user a seamless experience? We don't want them to "wait"
Can I use EC2 to generate these PDF files for me? If so, can I provide a "public" link for the user to download the file directly from Amazon, instead of using our resources.
Thanks a lot!
EDIT ---- More details
User inputs some information through a regular form
System generates a PDF per request, using the provided information for the document
The PDF generated by the system is kept under Amazon S3
We provide an API which allows you to "print" multiple pdfs at once, to do so, we merge the selected PDF files from S3, into one file for ease-of-print
When you multi-print documents, a new window is opened which is your merged file directly, user needs to wait around 20ish seconds for it to display.
We can to leverage the resources used to generate the PDFs onto Amazon infrastructure, but we need to keep the same flow, meaning, we should provide an instant public link to the User to download & print the files.
Based on your understanding, i think you just need your link to be created immediately right after user request for file. However, you want in parallel to create PDF merge. I have idea to do that based on my understanding, and may be it could work in your situations.
First start with some logic to create unique pdf file name, with random string representing name of file. And at same time in background generate PDF, but the name of PDF should be same as you created in first step. This will give user instant name of file with link to download. However, your file creation is still in progress.
Make sure, you use threads if using PHP or event loop if using Node.JS to run both steps at same time. This will avoid 404 error for file not found.
Transferring files from EC2 to S3 would also add latency delay. But if you want to preserve files for later or multiple use in future then S3 is good idea as it could simply serve PDF files for faster delivery. As we know S3 is used for static media storage. Otherwise simply compute everything and generate files on EC2

extract data from Pdf using Web harvesting

How can i extract data from PDF using Web Harvesting? I am getting all the relevant PDFs url in a page but i am not been able to extract data out of those Pdf.I am using Web Harvest version 2.0 for extracting the Pdfs url. Please help.
how will i incorporate pdfcommand in web harvesting to get the text? Is there any other way to do without running any batch file?
I think web harvest is not sufficient for this. You should use WGET and pdfbox to get your result. First download all the PDF through your URL into a folder with the help of WGET or Web harvest itself. Then run pdfbox command to get text from PDFs. You may get some knowledge on pdfbox from URL You can also create a batch file to run these things in order.