jaweedkhan's avatar

ElasticSearch Index docx / PDF file

Hello, I would like to get help on Elasticsearch solution for the one of the functionality on my project.

Functionality :

  1. when user uploads the file docs / pdf file that file data should be indexed in the elasticsearch. Currently i can see the form data is easily indexed in the Elasticsearch but not the Uploaded file data such as ( docx file or pdf file )
  2. If the uploaded file data gets indexed, how can i used that data in search to search with in that indexed data..
  3. How can we checked if the same file has been uploading again and that data is already indexed so we should stopped the indexing of the file which is already indexed.
0 likes
1 reply
umarhabib's avatar

Google Tesseract package can be used to read data from images and JSON-formatted data can be simply indexed like form data is indexed... for pdf files first we have to convert the pdf pages into images and then by using Google Tesseract we can do the same...

here is how you can read text from image using Google Tesseract

public function processUploadedFile(Request $request) {

    $request->validate([
        'uploaded_file' => 'required|image|mimes:jpeg,png,jpg,gif',
    ]);

    $uploadedFile = $request->file('uploaded_file');
    $filePath = $uploadedFile->storeAs('uploads', $uploadedFile->getClientOriginalName(), 'public');

    $text = (new TesseractOCR(storage_path("app/public/{$filePath}")))->run();

    return response()->json(['text' => $text]);
}

Please or to participate in this conversation.