kevinwakhisi's avatar

How can I clean data performing excel imports

I am importing a large number of records ('probably 10,000+) but the issue is that one column has this variation. The row name is certifications and from the file, I get these multiple variations that need to be standardized, For example, I when I dd() the group certifications I get

^ array:25 [▼
  "RA CERTIFICATION" => Illuminate\Database\Eloquent\Collection {#1707 ▶}
  "RA" => Illuminate\Database\Eloquent\Collection {#1689 ▶}
  "RFA" => Illuminate\Database\Eloquent\Collection {#7561 ▶}
  "RA/ISO CERTIFICATION" => Illuminate\Database\Eloquent\Collection {#7560 ▶}
  "" => Illuminate\Database\Eloquent\Collection {#7559 ▶}
  "R/A" => Illuminate\Database\Eloquent\Collection {#7558 ▶}
  "RFA/ISO" => Illuminate\Database\Eloquent\Collection {#7557 ▶}
  "RA CERTIFIED" => Illuminate\Database\Eloquent\Collection {#7556 ▶}
  "RA CERTIFICATION AND ISO" => Illuminate\Database\Eloquent\Collection {#7555 ▶}
  "RA AND ISO CERTIFIED" => Illuminate\Database\Eloquent\Collection {#7554 ▶}
  "ISO RA" => Illuminate\Database\Eloquent\Collection {#7553 ▶}
  "ISO" => Illuminate\Database\Eloquent\Collection {#7552 ▶}
  "FSMS,RA" => Illuminate\Database\Eloquent\Collection {#7551 ▶}
  "ISO AND RA CERTIFICATION" => Illuminate\Database\Eloquent\Collection {#7550 ▶}
  "RA CERTIFICARION" => Illuminate\Database\Eloquent\Collection {#7549 ▶}
  "N/A" => Illuminate\Database\Eloquent\Collection {#7548 ▶}
  "RA/ISO22000" => Illuminate\Database\Eloquent\Collection {#7547 ▶}
  "ISO 22000 :2018" => Illuminate\Database\Eloquent\Collection {#7546 ▶}
  "RFA CERTIFIED" => Illuminate\Database\Eloquent\Collection {#7545 ▶}
  "RA & FT CERTIFICATION" => Illuminate\Database\Eloquent\Collection {#7544 ▶}
  "ISO 22000, RA" => Illuminate\Database\Eloquent\Collection {#7543 ▶}
  "RAIN FOREST" => Illuminate\Database\Eloquent\Collection {#7542 ▶}
  "RAIN FOREST AND FAIR TRADE" => Illuminate\Database\Eloquent\Collection {#7541 ▶}
  "RA/ ISO CERTIFICATION" => Illuminate\Database\Eloquent\Collection {#7540 ▶}
  "NON RA CERTIFICATION" => Illuminate\Database\Eloquent\Collection {#7539 ▶}
]

as you can see, RA, RFA, and RAIN FOREST all mean the same but still have variations and might include CERTIFICATION or CERTIFIED. How can I clean this data before saving them in my data base while importing an excel file?

0 likes
6 replies
SilenceBringer's avatar

@kevinwakhisi as a way - write lookup table with the mapping between imported column names and actual db columns and check this lookup table on import

Really - it's always helpful to see your actual code to understand what you're trying to do and suggest something more detailed

kevinwakhisi's avatar

@SilenceBringer what i did was

$catalogues = Auctioncatalogue::get()->groupBy(function ($data) {
            return $data->certification;
        })->dd();

that gave me the end result as shown above. Is there a certain condition that I could apply to format the data ta. all the RA and RAIN FOREST to RA

SilenceBringer's avatar
Level 55

@kevinwakhisi

/** The mapping between values in import and column names you need */
$lookupTable = [
	'RA CERTIFICATION' => 'RA',
	'RFA' => 'RA',
	'RA/ISO CERTIFICATION' => 'RA',
	// ... all others
];

$catalogues = Auctioncatalogue::get()->groupBy(function ($data) use ($lookupTable) {
            return Arr::get($lookupTable, $data->certification, $data->certification);
        })->dd();
kevinwakhisi's avatar

@SilenceBringer Well this works thanks, How can I reuse it in my model let's say is certified and still maintaining the lookup that you used here

sr57's avatar

@kevinwakhisi

as you can see, RA, RFA, and RAIN FOREST all mean the same

You are the "only one" who knows this, have a synonyms list somewhere and decide which is the master to enter the db.

qualitcertservices's avatar

Standardize certification values Create a fixed master list (e.g., RA, ISO 22000, RA + ISO, RFA) and map all variations like RA CERTIFICARION, R/A, RAIN FOREST to the correct standard value.

Please or to participate in this conversation.